From: Alan Frindell Date: Fri, 15 May 2015 17:42:48 +0000 (-0700) Subject: Fix race issue in EventBase::runInEventBaseThreadAndWait (t6779663) X-Git-Tag: v0.39.0~10 X-Git-Url: http://demsky.eecs.uci.edu/git/?a=commitdiff_plain;h=1ae1c1ff3f344c6c1cec8f4b78c725e8e4b9a134;p=folly.git Fix race issue in EventBase::runInEventBaseThreadAndWait (t6779663) Summary: Huge thanks to @afrind for debugging this issue with me and found the root cause. As per the comment from @afrind for the diff D1823407, there is a tricky race issue. The main thread could have left and reset the condition_variable from its stack but the EventBase thread tries to access it afterwards due to race and could be blocked indefinitely. This caused the server-side IO threads not able to pick up the incoming connections for the proxygen case. The fix is to use a simpler struct barrier and get a hold of the shared_ptr instead of the same object in a safer way. Test Plan: The original issue reproes very easily in HDFS XDC encryption case. Servers easily enter into bad state and we got high volume of timeouts from the client. With the fix, this does not happen anymore after the fix being deployed at 11:15PM. Here is the Scuba log before and after the fix: https://fburl.com/109969805 And here is the correspond Scuba diagram for successful calls in the same test: https://fburl.com/109971729 The throughput improved a lot after the fix. Reviewed By: davejwatson@fb.com Subscribers: doug, folly-diffs@, yfeldblum, chalfant, afrind, brettp, dougw, fma FB internal diff: D2071646 Signature: t1:2071646:1431709609:10fb033536f9e4fb428dea8ba68f6a9a051616c0 --- diff --git a/folly/io/async/EventBase.cpp b/folly/io/async/EventBase.cpp index 1c3ad397..daf4049b 100644 --- a/folly/io/async/EventBase.cpp +++ b/folly/io/async/EventBase.cpp @@ -599,8 +599,10 @@ bool EventBase::runInEventBaseThreadAndWait(const Cob& fn) { SCOPE_EXIT { std::unique_lock l(m); ready = true; - l.unlock(); cv.notify_one(); + // We cannot release the lock before notify_one, because a spurious + // wakeup in the waiting thread may lead to cv and m going out of scope + // prematurely. }; fn(); });