[X86] Prefer blendps over insertps codegen for one special case
With this patch, for this one exact case, we'll generate:
blendps %xmm0, %xmm1, $1
instead of:
insertps %xmm0, %xmm1, $0
If there's a memory operand available for load folding and we're
optimizing for size, we'll still generate the insertps.
The detailed performance data motivation for this may be found in D7866;
in summary, blendps has 2-3x throughput vs. insertps on widely used chips.
Differential Revision: http://reviews.llvm.org/D8332
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@232850
91177308-0d34-0410-b5e6-
96231b3b80d8