Improve fast path of Cursor
Summary:
This change simplifies the fastpath by reducing it to bare minimum (i.e. check length, load data) and removes indirection to IOBuf.
Additionally it adds `skipNoAdvance` method to have 1-instruction skip.
Disassembly of `read<signed char>` is over 35 instructions (just hot path). With this change it's doesn to 8.
Disassembly after:
Dump of assembler code for function folly::io::detail::CursorBase<folly::io::Cursor, folly::IOBuf const>::read<unsigned char>():
0x000000000041f0f0 <+0>: mov 0x18(%rdi),%rax
0x000000000041f0f4 <+4>: lea 0x1(%rax),%rcx
0x000000000041f0f8 <+8>: cmp 0x10(%rdi),%rcx
0x000000000041f0fc <+12>: ja 0x41f105 <folly::io::detail::CursorBase<folly::io::Cursor, folly::IOBuf const>::read<unsigned char>()+21>
0x000000000041f0fe <+14>: mov (%rax),%al
0x000000000041f100 <+16>: mov %rcx,0x18(%rdi)
0x000000000041f104 <+20>: retq
0x000000000041f105 <+21>: jmpq 0x41f110 <folly::io::detail::CursorBase<folly::io::Cursor, folly::IOBuf const>::readSlow<unsigned char>()>
With this diff Thrift deserialization becomes ~20% faster (with prod workloads).
Thrift benchmark:
Before:
============================================================================
thrift/lib/cpp2/test/ProtocolBench.cpp relative time/iter iters/s
============================================================================
BinaryProtocol_read_Empty 12.98ns 77.03M
BinaryProtocol_read_SmallInt 20.94ns 47.76M
BinaryProtocol_read_BigInt 20.86ns 47.93M
BinaryProtocol_read_SmallString 34.64ns 28.86M
BinaryProtocol_read_BigString 185.53ns 5.39M
BinaryProtocol_read_BigBinary 67.34ns 14.85M
BinaryProtocol_read_LargeBinary 62.23ns 16.07M
BinaryProtocol_read_Mixed 58.74ns 17.03M
BinaryProtocol_read_SmallListInt 89.99ns 11.11M
BinaryProtocol_read_BigListInt 39.92us 25.05K
BinaryProtocol_read_BigListMixed 616.20us 1.62K
BinaryProtocol_read_LargeListMixed 83.49ms 11.98
CompactProtocol_read_Empty 11.28ns 88.67M
CompactProtocol_read_SmallInt 19.15ns 52.22M
CompactProtocol_read_BigInt 26.14ns 38.25M
CompactProtocol_read_SmallString 31.04ns 32.22M
CompactProtocol_read_BigString 184.55ns 5.42M
CompactProtocol_read_BigBinary 69.73ns 14.34M
CompactProtocol_read_LargeBinary 64.39ns 15.53M
CompactProtocol_read_Mixed 58.73ns 17.03M
CompactProtocol_read_SmallListInt 76.50ns 13.07M
CompactProtocol_read_BigListInt 25.93us 38.56K
CompactProtocol_read_BigListMixed 623.15us 1.60K
CompactProtocol_read_LargeListMixed 80.57ms 12.41
============================================================================
After:
============================================================================
thrift/lib/cpp2/test/ProtocolBench.cpp relative time/iter iters/s
============================================================================
BinaryProtocol_read_Empty 10.40ns 96.17M
BinaryProtocol_read_SmallInt 15.14ns 66.03M
BinaryProtocol_read_BigInt 15.19ns 65.84M
BinaryProtocol_read_SmallString 25.19ns 39.70M
BinaryProtocol_read_BigString 172.85ns 5.79M
BinaryProtocol_read_BigBinary 56.88ns 17.58M
BinaryProtocol_read_LargeBinary 56.77ns 17.61M
BinaryProtocol_read_Mixed 43.98ns 22.74M
BinaryProtocol_read_SmallListInt 58.19ns 17.19M
BinaryProtocol_read_BigListInt 19.75us 50.63K
BinaryProtocol_read_BigListMixed 440.20us 2.27K
BinaryProtocol_read_LargeListMixed 56.94ms 17.56
CompactProtocol_read_Empty 9.35ns 106.93M
CompactProtocol_read_SmallInt 13.07ns 76.49M
CompactProtocol_read_BigInt 18.23ns 54.87M
CompactProtocol_read_SmallString 25.61ns 39.05M
CompactProtocol_read_BigString 174.46ns 5.73M
CompactProtocol_read_BigBinary 59.77ns 16.73M
CompactProtocol_read_LargeBinary 60.81ns 16.44M
CompactProtocol_read_Mixed 42.70ns 23.42M
CompactProtocol_read_SmallListInt 66.89ns 14.95M
CompactProtocol_read_BigListInt 25.08us 39.87K
CompactProtocol_read_BigListMixed 427.93us 2.34K
CompactProtocol_read_LargeListMixed 56.11ms 17.82
============================================================================
Reviewed By: yfeldblum
Differential Revision:
D6635325
fbshipit-source-id:
393fc1005689042977c03f37f5a898ebe7814d44