add doesn't need to overflow between the two 16-bit chunks.
* Implement pre/post increment support. (e.g. PR935)
-* Coalesce stack slots!
* Implement smarter constant generation for binops with large immediates.
-* Consider materializing FP constants like 0.0f and 1.0f using integer
- immediate instructions then copy to FPU. Slower than load into FPU?
+A few ARMv6T2 ops should be pattern matched: BFI, SBFX, and UBFX
+
+Interesting optimization for PIC codegen on arm-linux:
+http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43129
//===---------------------------------------------------------------------===//
//===---------------------------------------------------------------------===//
-We currently compile abs:
-int foo(int p) { return p < 0 ? -p : p; }
-
-into:
-
-_foo:
- rsb r1, r0, #0
- cmn r0, #1
- movgt r1, r0
- mov r0, r1
- bx lr
-
-This is very, uh, literal. This could be a 3 operation sequence:
- t = (p sra 31);
- res = (p xor t)-t
-
-Which would be better. This occurs in png decode.
-
-//===---------------------------------------------------------------------===//
-
More load / store optimizations:
1) Better representation for block transfer? This is from Olden/power:
4) Once we added support for multiple result patterns, write indexed loads
patterns instead of C++ instruction selection code.
-5) Use FLDM / FSTM to emulate indexed FP load / store.
+5) Use VLDM / VSTM to emulate indexed FP load / store.
//===---------------------------------------------------------------------===//
//===---------------------------------------------------------------------===//
-More register scavenging work:
-
-1. Use the register scavenger to track frame index materialized into registers
- (those that do not fit in addressing modes) to allow reuse in the same BB.
-2. Finish scavenging for Thumb.
-
-//===---------------------------------------------------------------------===//
-
More LSR enhancements possible:
1. Teach LSR about pre- and post- indexed ops to allow iv increment be merged
ARM::MOVCCr is commutable (by flipping the condition). But we need to implement
ARMInstrInfo::commuteInstruction() to support it.
+
+//===---------------------------------------------------------------------===//
+
+Split out LDR (literal) from normal ARM LDR instruction. Also consider spliting
+LDR into imm12 and so_reg forms. This allows us to clean up some code. e.g.
+ARMLoadStoreOptimizer does not need to look at LDR (literal) and LDR (so_reg)
+while ARMConstantIslandPass only need to worry about LDR (literal).
+
+//===---------------------------------------------------------------------===//
+
+Constant island pass should make use of full range SoImm values for LEApcrel.
+Be careful though as the last attempt caused infinite looping on lencod.
+
+//===---------------------------------------------------------------------===//
+
+Predication issue. This function:
+
+extern unsigned array[ 128 ];
+int foo( int x ) {
+ int y;
+ y = array[ x & 127 ];
+ if ( x & 128 )
+ y = 123456789 & ( y >> 2 );
+ else
+ y = 123456789 & y;
+ return y;
+}
+
+compiles to:
+
+_foo:
+ and r1, r0, #127
+ ldr r2, LCPI1_0
+ ldr r2, [r2]
+ ldr r1, [r2, +r1, lsl #2]
+ mov r2, r1, lsr #2
+ tst r0, #128
+ moveq r2, r1
+ ldr r0, LCPI1_1
+ and r0, r2, r0
+ bx lr
+
+It would be better to do something like this, to fold the shift into the
+conditional move:
+
+ and r1, r0, #127
+ ldr r2, LCPI1_0
+ ldr r2, [r2]
+ ldr r1, [r2, +r1, lsl #2]
+ tst r0, #128
+ movne r1, r1, lsr #2
+ ldr r0, LCPI1_1
+ and r0, r1, r0
+ bx lr
+
+it saves an instruction and a register.
+
+//===---------------------------------------------------------------------===//
+
+It might be profitable to cse MOVi16 if there are lots of 32-bit immediates
+with the same bottom half.
+
+//===---------------------------------------------------------------------===//
+
+Robert Muth started working on an alternate jump table implementation that
+does not put the tables in-line in the text. This is more like the llvm
+default jump table implementation. This might be useful sometime. Several
+revisions of patches are on the mailing list, beginning at:
+http://lists.cs.uiuc.edu/pipermail/llvmdev/2009-June/022763.html
+
+//===---------------------------------------------------------------------===//
+
+Make use of the "rbit" instruction.
+
+//===---------------------------------------------------------------------===//
+
+Take a look at test/CodeGen/Thumb2/machine-licm.ll. ARM should be taught how
+to licm and cse the unnecessary load from cp#1.
+
+//===---------------------------------------------------------------------===//
+
+The CMN instruction sets the flags like an ADD instruction, while CMP sets
+them like a subtract. Therefore to be able to use CMN for comparisons other
+than the Z bit, we'll need additional logic to reverse the conditionals
+associated with the comparison. Perhaps a pseudo-instruction for the comparison,
+with a post-codegen pass to clean up and handle the condition codes?
+See PR5694 for testcase.
+
+//===---------------------------------------------------------------------===//
+
+Given the following on armv5:
+int test1(int A, int B) {
+ return (A&-8388481)|(B&8388480);
+}
+
+We currently generate:
+ ldr r2, .LCPI0_0
+ and r0, r0, r2
+ ldr r2, .LCPI0_1
+ and r1, r1, r2
+ orr r0, r1, r0
+ bx lr
+
+We should be able to replace the second ldr+and with a bic (i.e. reuse the
+constant which was already loaded). Not sure what's necessary to do that.
+
+//===---------------------------------------------------------------------===//
+
+The code generated for bswap on armv4/5 (CPUs without rev) is less than ideal:
+
+int a(int x) { return __builtin_bswap32(x); }
+
+a:
+ mov r1, #255, 24
+ mov r2, #255, 16
+ and r1, r1, r0, lsr #8
+ and r2, r2, r0, lsl #8
+ orr r1, r1, r0, lsr #24
+ orr r0, r2, r0, lsl #24
+ orr r0, r0, r1
+ bx lr
+
+Something like the following would be better (fewer instructions/registers):
+ eor r1, r0, r0, ror #16
+ bic r1, r1, #0xff0000
+ mov r1, r1, lsr #8
+ eor r0, r1, r0, ror #8
+ bx lr
+
+A custom Thumb version would also be a slight improvement over the generic
+version.
+
+//===---------------------------------------------------------------------===//
+
+Consider the following simple C code:
+
+void foo(unsigned char *a, unsigned char *b, int *c) {
+ if ((*a | *b) == 0) *c = 0;
+}
+
+currently llvm-gcc generates something like this (nice branchless code I'd say):
+
+ ldrb r0, [r0]
+ ldrb r1, [r1]
+ orr r0, r1, r0
+ tst r0, #255
+ moveq r0, #0
+ streq r0, [r2]
+ bx lr
+
+Note that both "tst" and "moveq" are redundant.
+
+//===---------------------------------------------------------------------===//
+