X-Git-Url: http://demsky.eecs.uci.edu/git/?a=blobdiff_plain;f=lib%2FTarget%2FARM%2FREADME.txt;h=8ba9a27e95c89251cc61e1d6a260fb644fde8cc0;hb=3a96122c4ae4e7727ba976a9f658626c18997689;hp=000e8e6450a82d092340bf16d3dfb7928e8e891a;hpb=a8e2989ece6dc46df59b0768184028257f913843;p=oota-llvm.git diff --git a/lib/Target/ARM/README.txt b/lib/Target/ARM/README.txt index 000e8e6450a..8ba9a27e95c 100644 --- a/lib/Target/ARM/README.txt +++ b/lib/Target/ARM/README.txt @@ -7,84 +7,77 @@ Reimplement 'select' in terms of 'SEL'. * We would really like to support UXTAB16, but we need to prove that the add doesn't need to overflow between the two 16-bit chunks. -* implement predication support * Implement pre/post increment support. (e.g. PR935) -* Coalesce stack slots! * Implement smarter constant generation for binops with large immediates. -* Consider materializing FP constants like 0.0f and 1.0f using integer - immediate instructions then copy to FPU. Slower than load into FPU? +A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX -//===---------------------------------------------------------------------===// +Interesting optimization for PIC codegen on arm-linux: +http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129 -The constant island pass is extremely naive. If a constant pool entry is -out of range, it *always* splits a block and inserts a copy of the cp -entry inline. It should: +//===---------------------------------------------------------------------===// -1. Check to see if there is already a copy of this constant nearby. If so, - reuse it. -2. Instead of always splitting blocks to insert the constant, insert it in - nearby 'water'. -3. Constant island references should be ref counted. If a constant reference - is out-of-range, and the last reference to a constant is relocated, the - dead constant should be removed. +Crazy idea: Consider code that uses lots of 8-bit or 16-bit values. By the +time regalloc happens, these values are now in a 32-bit register, usually with +the top-bits known to be sign or zero extended. If spilled, we should be able +to spill these to a 8-bit or 16-bit stack slot, zero or sign extending as part +of the reload. -This pass has all the framework needed to implement this, but it hasn't -been done. +Doing this reduces the size of the stack frame (important for thumb etc), and +also increases the likelihood that we will be able to reload multiple values +from the stack with a single load. //===---------------------------------------------------------------------===// -We need to start generating predicated instructions. The .td files have a way -to express this now (see the PPC conditional return instruction), but the -branch folding pass (or a new if-cvt pass) should start producing these, at -least in the trivial case. +The constant island pass is in good shape. Some cleanups might be desirable, +but there is unlikely to be much improvement in the generated code. + +1. There may be some advantage to trying to be smarter about the initial +placement, rather than putting everything at the end. -Among the obvious wins, doing so can eliminate the need to custom expand -copysign (i.e. we won't need to custom expand it to get the conditional -negate). +2. There might be some compile-time efficiency to be had by representing +consecutive islands as a single block rather than multiple blocks. + +3. Use a priority queue to sort constant pool users in inverse order of + position so we always process the one closed to the end of functions + first. This may simply CreateNewWater. //===---------------------------------------------------------------------===// -Implement long long "X-3" with instructions that fold the immediate in. These -were disabled due to badness with the ARM carry flag on subtracts. +Eliminate copysign custom expansion. We are still generating crappy code with +default expansion + if-conversion. //===---------------------------------------------------------------------===// -We currently compile abs: -int foo(int p) { return p < 0 ? -p : p; } +Eliminate one instruction from: -into: +define i32 @_Z6slow4bii(i32 %x, i32 %y) { + %tmp = icmp sgt i32 %x, %y + %retval = select i1 %tmp, i32 %x, i32 %y + ret i32 %retval +} -_foo: - rsb r1, r0, #0 - cmn r0, #1 +__Z6slow4bii: + cmp r0, r1 movgt r1, r0 mov r0, r1 bx lr +=> -This is very, uh, literal. This could be a 3 operation sequence: - t = (p sra 31); - res = (p xor t)-t - -Which would be better. This occurs in png decode. +__Z6slow4bii: + cmp r0, r1 + movle r0, r1 + bx lr //===---------------------------------------------------------------------===// -More load / store optimizations: -1) Look past instructions without side-effects (not load, store, branch, etc.) - when forming the list of loads / stores to optimize. - -2) Smarter register allocation? -We are probably missing some opportunities to use ldm / stm. Consider: - -ldr r5, [r0] -ldr r4, [r0, #4] +Implement long long "X-3" with instructions that fold the immediate in. These +were disabled due to badness with the ARM carry flag on subtracts. -This cannot be merged into a ldm. Perhaps we will need to do the transformation -before register allocation. Then teach the register allocator to allocate a -chunk of consecutive registers. +//===---------------------------------------------------------------------===// -3) Better representation for block transfer? This is from Olden/power: +More load / store optimizations: +1) Better representation for block transfer? This is from Olden/power: fldd d0, [r4] fstd d0, [r4, #+32] @@ -98,7 +91,7 @@ chunk of consecutive registers. If we can spare the registers, it would be better to use fldm and fstm here. Need major register allocator enhancement though. -4) Can we recognize the relative position of constantpool entries? i.e. Treat +2) Can we recognize the relative position of constantpool entries? i.e. Treat ldr r0, LCPI17_3 ldr r1, LCPI17_4 @@ -122,11 +115,28 @@ L6: .long -858993459 .long 1074318540 -5) Can we make use of ldrd and strd? Instead of generating ldm / stm, use -ldrd/strd instead if there are only two destination registers that form an -odd/even pair. However, we probably would pay a penalty if the address is not -aligned on 8-byte boundary. This requires more information on load / store -nodes (and MI's?) then we currently carry. +3) struct copies appear to be done field by field +instead of by words, at least sometimes: + +struct foo { int x; short s; char c1; char c2; }; +void cpy(struct foo*a, struct foo*b) { *a = *b; } + +llvm code (-O2) + ldrb r3, [r1, #+6] + ldr r2, [r1] + ldrb r12, [r1, #+7] + ldrh r1, [r1, #+4] + str r2, [r0] + strh r1, [r0, #+4] + strb r3, [r0, #+6] + strb r12, [r0, #+7] +gcc code (-O2) + ldmia r1, {r1-r2} + stmia r0, {r1-r2} + +In this benchmark poor handling of aggregate copies has shown up as +having a large effect on size, and possibly speed as well (we don't have +a good way to measure on ARM). //===---------------------------------------------------------------------===// @@ -138,24 +148,19 @@ double bar(double x) { } _bar: - sub sp, sp, #16 - str r4, [sp, #+12] - str r5, [sp, #+8] - str lr, [sp, #+4] - mov r4, r0 - mov r5, r1 - ldr r0, LCPI2_0 - bl _foo - fmsr f0, r0 - fcvtsd d0, f0 - fmdrr d1, r4, r5 - faddd d0, d0, d1 - fmrrd r0, r1, d0 - ldr lr, [sp, #+4] - ldr r5, [sp, #+8] - ldr r4, [sp, #+12] - add sp, sp, #16 - bx lr + stmfd sp!, {r4, r5, r7, lr} + add r7, sp, #8 + mov r4, r0 + mov r5, r1 + fldd d0, LCPI1_0 + fmrrd r0, r1, d0 + bl _foo + fmdrr d0, r4, r5 + fmsr s2, r0 + fsitod d1, s2 + faddd d0, d1, d0 + fmrrd r0, r1, d0 + ldmfd sp!, {r4, r5, r7, pc} Ignore the prologue and epilogue stuff for a second. Note mov r4, r0 @@ -270,56 +275,6 @@ See McCat/18-imp/ComputeBoundingBoxes for an example. //===---------------------------------------------------------------------===// -We need register scavenging. Currently, the 'ip' register is reserved in case -frame indexes are too big. This means that we generate extra code for stuff -like this: - -void foo(unsigned x, unsigned y, unsigned z, unsigned *a, unsigned *b, unsigned *c) { - short Rconst = (short) (16384.0f * 1.40200 + 0.5 ); - *a = x * Rconst; - *b = y * Rconst; - *c = z * Rconst; -} - -we compile it to: - -_foo: -*** stmfd sp!, {r4, r7} -*** add r7, sp, #4 - mov r4, #186 - orr r4, r4, #89, 24 @ 22784 - mul r0, r0, r4 - str r0, [r3] - mul r0, r1, r4 - ldr r1, [sp, #+8] - str r0, [r1] - mul r0, r2, r4 - ldr r1, [sp, #+12] - str r0, [r1] -*** sub sp, r7, #4 -*** ldmfd sp!, {r4, r7} - bx lr - -GCC produces: - -_foo: - ldr ip, L4 - mul r0, ip, r0 - mul r1, ip, r1 - str r0, [r3, #0] - ldr r3, [sp, #0] - mul r2, ip, r2 - str r1, [r3, #0] - ldr r3, [sp, #4] - str r2, [r3, #0] - bx lr -L4: - .long 22970 - -This is apparently all because we couldn't use ip here. - -//===---------------------------------------------------------------------===// - Pre-/post- indexed load / stores: 1) We should not make the pre/post- indexed load/store transform if the base ptr @@ -351,21 +306,7 @@ time. 4) Once we added support for multiple result patterns, write indexed loads patterns instead of C++ instruction selection code. -5) Use FLDM / FSTM to emulate indexed FP load / store. - -//===---------------------------------------------------------------------===// - -We should add i64 support to take advantage of the 64-bit load / stores. -We can add a pseudo i64 register class containing pseudo registers that are -register pairs. All other ops (e.g. add, sub) would be expanded as usual. - -We need to add pseudo instructions (i.e. gethi / getlo) to extract i32 registers -from the i64 register. These are single moves which can be eliminated if the -destination register is a sub-register of the source. We should implement proper -subreg support in the register allocator to coalesce these away. - -There are other minor issues such as multiple instructions for a spill / restore -/ move. +5) Use VLDM / VSTM to emulate indexed FP load / store. //===---------------------------------------------------------------------===// @@ -437,3 +378,306 @@ http://www.inf.u-szeged.hu/gcc-arm/ http://citeseer.ist.psu.edu/debus04linktime.html //===---------------------------------------------------------------------===// + +gcc generates smaller code for this function at -O2 or -Os: + +void foo(signed char* p) { + if (*p == 3) + bar(); + else if (*p == 4) + baz(); + else if (*p == 5) + quux(); +} + +llvm decides it's a good idea to turn the repeated if...else into a +binary tree, as if it were a switch; the resulting code requires -1 +compare-and-branches when *p<=2 or *p==5, the same number if *p==4 +or *p>6, and +1 if *p==3. So it should be a speed win +(on balance). However, the revised code is larger, with 4 conditional +branches instead of 3. + +More seriously, there is a byte->word extend before +each comparison, where there should be only one, and the condition codes +are not remembered when the same two values are compared twice. + +//===---------------------------------------------------------------------===// + +More LSR enhancements possible: + +1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged + in a load / store. +2. Allow iv reuse even when a type conversion is required. For example, i8 + and i32 load / store addressing modes are identical. + + +//===---------------------------------------------------------------------===// + +This: + +int foo(int a, int b, int c, int d) { + long long acc = (long long)a * (long long)b; + acc += (long long)c * (long long)d; + return (int)(acc >> 32); +} + +Should compile to use SMLAL (Signed Multiply Accumulate Long) which multiplies +two signed 32-bit values to produce a 64-bit value, and accumulates this with +a 64-bit value. + +We currently get this with both v4 and v6: + +_foo: + smull r1, r0, r1, r0 + smull r3, r2, r3, r2 + adds r3, r3, r1 + adc r0, r2, r0 + bx lr + +//===---------------------------------------------------------------------===// + +This: + #include + std::pair full_add(unsigned a, unsigned b) + { return std::make_pair(a + b, a + b < a); } + bool no_overflow(unsigned a, unsigned b) + { return !full_add(a, b).second; } + +Should compile to: + +_Z8full_addjj: + adds r2, r1, r2 + movcc r1, #0 + movcs r1, #1 + str r2, [r0, #0] + strb r1, [r0, #4] + mov pc, lr + +_Z11no_overflowjj: + cmn r0, r1 + movcs r0, #0 + movcc r0, #1 + mov pc, lr + +not: + +__Z8full_addjj: + add r3, r2, r1 + str r3, [r0] + mov r2, #1 + mov r12, #0 + cmp r3, r1 + movlo r12, r2 + str r12, [r0, #+4] + bx lr +__Z11no_overflowjj: + add r3, r1, r0 + mov r2, #1 + mov r1, #0 + cmp r3, r0 + movhs r1, r2 + mov r0, r1 + bx lr + +//===---------------------------------------------------------------------===// + +Some of the NEON intrinsics may be appropriate for more general use, either +as target-independent intrinsics or perhaps elsewhere in the ARM backend. +Some of them may also be lowered to target-independent SDNodes, and perhaps +some new SDNodes could be added. + +For example, maximum, minimum, and absolute value operations are well-defined +and standard operations, both for vector and scalar types. + +The current NEON-specific intrinsics for count leading zeros and count one +bits could perhaps be replaced by the target-independent ctlz and ctpop +intrinsics. It may also make sense to add a target-independent "ctls" +intrinsic for "count leading sign bits". Likewise, the backend could use +the target-independent SDNodes for these operations. + +ARMv6 has scalar saturating and halving adds and subtracts. The same +intrinsics could possibly be used for both NEON's vector implementations of +those operations and the ARMv6 scalar versions. + +//===---------------------------------------------------------------------===// + +ARM::MOVCCr is commutable (by flipping the condition). But we need to implement +ARMInstrInfo::commuteInstruction() to support it. + +//===---------------------------------------------------------------------===// + +Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting +LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g. +ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg) +while ARMConstantIslandPass only need to worry about LDR (literal). + +//===---------------------------------------------------------------------===// + +Constant island pass should make use of full range SoImm values for LEApcrel. +Be careful though as the last attempt caused infinite looping on lencod. + +//===---------------------------------------------------------------------===// + +Predication issue. This function: + +extern unsigned array[ 128 ]; +int foo( int x ) { + int y; + y = array[ x & 127 ]; + if ( x & 128 ) + y = 123456789 & ( y >> 2 ); + else + y = 123456789 & y; + return y; +} + +compiles to: + +_foo: + and r1, r0, #127 + ldr r2, LCPI1_0 + ldr r2, [r2] + ldr r1, [r2, +r1, lsl #2] + mov r2, r1, lsr #2 + tst r0, #128 + moveq r2, r1 + ldr r0, LCPI1_1 + and r0, r2, r0 + bx lr + +It would be better to do something like this, to fold the shift into the +conditional move: + + and r1, r0, #127 + ldr r2, LCPI1_0 + ldr r2, [r2] + ldr r1, [r2, +r1, lsl #2] + tst r0, #128 + movne r1, r1, lsr #2 + ldr r0, LCPI1_1 + and r0, r1, r0 + bx lr + +it saves an instruction and a register. + +//===---------------------------------------------------------------------===// + +It might be profitable to cse MOVi16 if there are lots of 32-bit immediates +with the same bottom half. + +//===---------------------------------------------------------------------===// + +Robert Muth started working on an alternate jump table implementation that +does not put the tables in-line in the text. This is more like the llvm +default jump table implementation. This might be useful sometime. Several +revisions of patches are on the mailing list, beginning at: +http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html + +//===---------------------------------------------------------------------===// + +Make use of the "rbit" instruction. + +//===---------------------------------------------------------------------===// + +Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how +to licm and cse the unnecessary load from cp#1. + +//===---------------------------------------------------------------------===// + +The CMN instruction sets the flags like an ADD instruction, while CMP sets +them like a subtract. Therefore to be able to use CMN for comparisons other +than the Z bit, we'll need additional logic to reverse the conditionals +associated with the comparison. Perhaps a pseudo-instruction for the comparison, +with a post-codegen pass to clean up and handle the condition codes? +See PR5694 for testcase. + +//===---------------------------------------------------------------------===// + +Given the following on armv5: +int test1(int A, int B) { + return (A&-8388481)|(B&8388480); +} + +We currently generate: + ldr r2, .LCPI0_0 + and r0, r0, r2 + ldr r2, .LCPI0_1 + and r1, r1, r2 + orr r0, r1, r0 + bx lr + +We should be able to replace the second ldr+and with a bic (i.e. reuse the +constant which was already loaded). Not sure what's necessary to do that. + +//===---------------------------------------------------------------------===// + +The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal: + +int a(int x) { return __builtin_bswap32(x); } + +a: + mov r1, #255, 24 + mov r2, #255, 16 + and r1, r1, r0, lsr #8 + and r2, r2, r0, lsl #8 + orr r1, r1, r0, lsr #24 + orr r0, r2, r0, lsl #24 + orr r0, r0, r1 + bx lr + +Something like the following would be better (fewer instructions/registers): + eor r1, r0, r0, ror #16 + bic r1, r1, #0xff0000 + mov r1, r1, lsr #8 + eor r0, r1, r0, ror #8 + bx lr + +A custom Thumb version would also be a slight improvement over the generic +version. + +//===---------------------------------------------------------------------===// + +Consider the following simple C code: + +void foo(unsigned char *a, unsigned char *b, int *c) { + if ((*a | *b) == 0) *c = 0; +} + +currently llvm-gcc generates something like this (nice branchless code I'd say): + + ldrb r0, [r0] + ldrb r1, [r1] + orr r0, r1, r0 + tst r0, #255 + moveq r0, #0 + streq r0, [r2] + bx lr + +Note that both "tst" and "moveq" are redundant. + +//===---------------------------------------------------------------------===// + +When loading immediate constants with movt/movw, if there are multiple +constants needed with the same low 16 bits, and those values are not live at +the same time, it would be possible to use a single movw instruction, followed +by multiple movt instructions to rewrite the high bits to different values. +For example: + + volatile store i32 -1, i32* inttoptr (i32 1342210076 to i32*), align 4, + !tbaa +!0 + volatile store i32 -1, i32* inttoptr (i32 1342341148 to i32*), align 4, + !tbaa +!0 + +is compiled and optimized to: + + movw r0, #32796 + mov.w r1, #-1 + movt r0, #20480 + str r1, [r0] + movw r0, #32796 @ <= this MOVW is not needed, value is there already + movt r0, #20482 + str r1, [r0] + +//===---------------------------------------------------------------------===//