[AArch64] Avoid going through GPRs for across-vector instructions.
authorAhmed Bougacha <ahmed.bougacha@gmail.com>
Tue, 10 Mar 2015 20:45:38 +0000 (20:45 +0000)
committerAhmed Bougacha <ahmed.bougacha@gmail.com>
Tue, 10 Mar 2015 20:45:38 +0000 (20:45 +0000)
commit4a3cd42601e1488298c280beb51f419a8b78b01b
tree7b9925352a5f444ba7a75f6de3ce57e91813f992
parent4cd59eb6293cf7df79add91da23c121d16e08ba9
[AArch64] Avoid going through GPRs for across-vector instructions.

This adds new node types for each intrinsic.
For instance, for addv, we have AArch64ISD::UADDV, such that:
  (v4i32 (uaddv ...))
is the same as
  (v4i32 (scalar_to_vector (i32 (int_aarch64_neon_uaddv ...))))
that is,
  (v4i32 (INSERT_SUBREG (v4i32 (IMPLICIT_DEF)),
           (i32 (int_aarch64_neon_uaddv ...)), ssub)

In a combine, we transform all such across-vector-lanes intrinsics to:

  (i32 (extract_vector_elt (uaddv ...), 0))

This has one big advantage: by making the extract_element explicit, we
enable the existing patterns for lane-aware instructions to fire.
This lets us avoid needlessly going through the GPRs.  Consider:

    uint32x4_t test_mul(uint32x4_t a, uint32x4_t b) {
        return vmulq_n_u32(a, vaddvq_u32(b));
    }

We now generate:
    addv.4s  s1, v1
    mul.4s   v0, v0, v1[0]
instead of the previous:
    addv.4s  s1, v1
    fmov     w8, s1
    dup.4s   v1, w8
    mul.4s   v0, v1, v0

rdar://20044838

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@231840 91177308-0d34-0410-b5e6-96231b3b80d8
lib/Target/AArch64/AArch64ISelLowering.cpp
lib/Target/AArch64/AArch64ISelLowering.h
lib/Target/AArch64/AArch64InstrInfo.td
test/CodeGen/AArch64/arm64-smaxv.ll
test/CodeGen/AArch64/arm64-sminv.ll
test/CodeGen/AArch64/arm64-umaxv.ll
test/CodeGen/AArch64/arm64-uminv.ll
test/CodeGen/AArch64/arm64-vaddv.ll