improve io::Cursor read() performance for small sizeof(T)
Summary:
I just found that gcc (4.8.2) failed to unroll the loop in
`pullAtMost()`, so it didn't replace `memcpy` with a simple load
for small `len`.
Test Plan:
fbconfig -r folly/io/test thrift/lib/cpp2/test && fbmake runtests_opt -j32
Ran unicorn-specific thrift deserialization benchmark from
D1724070, verified 50% improvement in `SearchRequest` deserialization
performance.
`thrift/lib/cpp2/test/ProtocolBench` results:
```
|---- before -----| |---- after -----|
================================================================================================
thrift/lib/cpp2/test/ProtocolBench.cpp relative time/iter iters/s time/iter iters/s
================================================================================================
BinaryProtocol_read_Empty 21.72ns 46.04M 17.58ns 56.89M
BinaryProtocol_read_SmallInt 43.03ns 23.24M 23.64ns 42.30M
BinaryProtocol_read_BigInt 43.72ns 22.87M 22.03ns 45.38M
BinaryProtocol_read_SmallString 88.57ns 11.29M 47.01ns 21.27M
BinaryProtocol_read_BigString 365.76ns 2.73M 323.58ns 3.09M
BinaryProtocol_read_BigBinary 207.78ns 4.81M 169.09ns 5.91M
BinaryProtocol_read_LargeBinary 187.81ns 5.32M 172.09ns 5.81M
BinaryProtocol_read_Mixed 161.18ns 6.20M 68.41ns 14.62M
BinaryProtocol_read_SmallListInt 177.32ns 5.64M 96.91ns 10.32M
BinaryProtocol_read_BigListInt 77.03us 12.98K 15.88us 62.97K
BinaryProtocol_read_BigListMixed 1.79ms 557.79 923.99us 1.08K
BinaryProtocol_read_LargeListMixed 195.01ms 5.13 103.78ms 9.64
================================================================================================
```
Reviewed By: soren@fb.com
Subscribers: alandau, bmatheny, mshneer, trunkagent, njormrod, folly-diffs@
FB internal diff:
D1724111
Tasks:
5770136
Signature: t1:
1724111:
1417977810:
b7d643d0c819a0bbac77fa0048206153929e50a8