various perf improvements
Summary: Three strategies
1. Optimistic locking
2. Acquire-release memory ordering instead of full sequential consistency
3. Some low-hanging branch miss optimizations
Please review carefully; the dogscience is strong with this one
```
Before:
============================================================================
folly/futures/test/Benchmark.cpp relative time/iter iters/s
============================================================================
constantFuture 127.99ns 7.81M
promiseAndFuture 94.89% 134.89ns 7.41M
withThen 28.40% 450.63ns 2.22M
----------------------------------------------------------------------------
oneThen 446.68ns 2.24M
twoThens 58.35% 765.55ns 1.31M
fourThens 31.87% 1.40us 713.41K
hundredThens 1.61% 27.78us 35.99K
----------------------------------------------------------------------------
no_contention 4.63ms 216.00
contention 80.79% 5.73ms 174.52
----------------------------------------------------------------------------
throwAndCatch 10.91us 91.64K
throwAndCatchWrapped 127.14% 8.58us 116.51K
throwWrappedAndCatch 178.22% 6.12us 163.32K
throwWrappedAndCatchWrapped 793.75% 1.37us 727.38K
----------------------------------------------------------------------------
throwAndCatchContended 1.35s 741.33m
throwAndCatchWrappedContended 139.18% 969.23ms 1.03
throwWrappedAndCatchContended 169.51% 795.76ms 1.26
throwWrappedAndCatchWrappedContended 17742.23% 7.60ms 131.53
----------------------------------------------------------------------------
complexUnit 127.50us 7.84K
complexBlob4 100.14% 127.32us 7.85K
complexBlob8 100.16% 127.30us 7.86K
complexBlob64 96.45% 132.19us 7.57K
complexBlob128 92.83% 137.35us 7.28K
complexBlob256 87.79% 145.23us 6.89K
complexBlob512 81.64% 156.18us 6.40K
complexBlob1024 72.54% 175.76us 5.69K
complexBlob2048 58.52% 217.89us 4.59K
complexBlob4096 32.54% 391.78us 2.55K
============================================================================
After:
============================================================================
folly/futures/test/Benchmark.cpp relative time/iter iters/s
============================================================================
constantFuture 85.28ns 11.73M
promiseAndFuture 88.63% 96.22ns 10.39M
withThen 30.46% 279.99ns 3.57M
----------------------------------------------------------------------------
oneThen 231.18ns 4.33M
twoThens 60.57% 381.70ns 2.62M
fourThens 33.52% 689.71ns 1.45M
hundredThens 1.49% 15.48us 64.58K
----------------------------------------------------------------------------
no_contention 3.84ms 260.19
contention 88.29% 4.35ms 229.73
----------------------------------------------------------------------------
throwAndCatch 10.63us 94.06K
throwAndCatchWrapped 127.17% 8.36us 119.61K
throwWrappedAndCatch 179.83% 5.91us 169.15K
throwWrappedAndCatchWrapped 1014.48% 1.05us 954.19K
----------------------------------------------------------------------------
throwAndCatchContended 1.34s 749.03m
throwAndCatchWrappedContended 140.66% 949.16ms 1.05
throwWrappedAndCatchContended 164.87% 809.77ms 1.23
throwWrappedAndCatchWrappedContended 49406.39% 2.70ms 370.07
----------------------------------------------------------------------------
complexUnit 86.83us 11.52K
complexBlob4 97.42% 89.12us 11.22K
complexBlob8 96.63% 89.85us 11.13K
complexBlob64 92.53% 93.84us 10.66K
complexBlob128 90.85% 95.57us 10.46K
complexBlob256 82.56% 105.17us 9.51K
complexBlob512 74.13% 117.12us 8.54K
complexBlob1024 63.67% 136.37us 7.33K
complexBlob2048 50.25% 172.79us 5.79K
complexBlob4096 26.63% 326.05us 3.07K
============================================================================
```
Reviewed By: @djwatson
Differential Revision:
D2139822