1 //===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
5 * implement do-loop -> bdnz transform
6 * __builtin_return_address not supported on PPC
8 ===-------------------------------------------------------------------------===
10 Support 'update' load/store instructions. These are cracked on the G5, but are
13 With preinc enabled, this:
15 long *%test4(long *%X, long *%dest) {
16 %Y = getelementptr long* %X, int 4
18 store long %A, long* %dest
33 with -sched=list-burr, I get:
42 ===-------------------------------------------------------------------------===
44 We compile the hottest inner loop of viterbi to:
55 bne cr0, LBB1_83 ;bb420.i
57 The CBE manages to produce:
68 This could be much better (bdnz instead of bdz) but it still beats us. If we
69 produced this with bdnz, the loop would be a single dispatch group.
71 ===-------------------------------------------------------------------------===
88 This is effectively a simple form of predication.
90 ===-------------------------------------------------------------------------===
92 Lump the constant pool for each function into ONE pic object, and reference
93 pieces of it as offsets from the start. For functions like this (contrived
94 to have lots of constants obviously):
96 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
101 lis r2, ha16(.CPI_X_0)
102 lfd f0, lo16(.CPI_X_0)(r2)
103 lis r2, ha16(.CPI_X_1)
104 lfd f2, lo16(.CPI_X_1)(r2)
106 lis r2, ha16(.CPI_X_2)
107 lfd f1, lo16(.CPI_X_2)(r2)
108 lis r2, ha16(.CPI_X_3)
109 lfd f2, lo16(.CPI_X_3)(r2)
113 It would be better to materialize .CPI_X into a register, then use immediates
114 off of the register to avoid the lis's. This is even more important in PIC
117 Note that this (and the static variable version) is discussed here for GCC:
118 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
120 ===-------------------------------------------------------------------------===
122 PIC Code Gen IPO optimization:
124 Squish small scalar globals together into a single global struct, allowing the
125 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
126 of the GOT on targets with one).
128 Note that this is discussed here for GCC:
129 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
131 ===-------------------------------------------------------------------------===
133 Implement Newton-Rhapson method for improving estimate instructions to the
134 correct accuracy, and implementing divide as multiply by reciprocal when it has
135 more than one use. Itanium will want this too.
137 ===-------------------------------------------------------------------------===
141 int %f1(int %a, int %b) {
142 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
143 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
144 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
148 without a copy. We make this currently:
151 rlwinm r2, r4, 0, 24, 27
152 rlwimi r2, r3, 0, 28, 31
156 The two-addr pass or RA needs to learn when it is profitable to commute an
157 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
158 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
160 ===-------------------------------------------------------------------------===
162 Compile offsets from allocas:
165 %X = alloca { int, int }
166 %Y = getelementptr {int,int}* %X, int 0, uint 1
170 into a single add, not two:
177 --> important for C++.
179 ===-------------------------------------------------------------------------===
181 No loads or stores of the constants should be needed:
183 struct foo { double X, Y; };
184 void xxx(struct foo F);
185 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
187 ===-------------------------------------------------------------------------===
189 Darwin Stub LICM optimization:
195 Have to go through an indirect stub if bar is external or linkonce. It would
196 be better to compile it as:
201 which only computes the address of bar once (instead of each time through the
202 stub). This is Darwin specific and would have to be done in the code generator.
203 Probably not a win on x86.
205 ===-------------------------------------------------------------------------===
207 Simple IPO for argument passing, change:
208 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
210 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
211 of arguments get assigned to r3 through r10. That is, if you have a function
212 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
213 argument bytes for r4 and r5. The trick then would be to shuffle the argument
214 order for functions we can internalize so that the maximum number of
215 integers/pointers get passed in regs before you see any of the fp arguments.
217 Instead of implementing this, it would actually probably be easier to just
218 implement a PPC fastcc, where we could do whatever we wanted to the CC,
219 including having this work sanely.
221 ===-------------------------------------------------------------------------===
223 Fix Darwin FP-In-Integer Registers ABI
225 Darwin passes doubles in structures in integer registers, which is very very
226 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
227 that percolates these things out of functions.
229 Check out how horrible this is:
230 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
232 This is an extension of "interprocedural CC unmunging" that can't be done with
235 ===-------------------------------------------------------------------------===
242 return b * 3; // ignore the fact that this is always 3.
248 into something not this:
253 rlwinm r2, r2, 29, 31, 31
255 bgt cr0, LBB1_2 ; UnifiedReturnBlock
257 rlwinm r2, r2, 0, 31, 31
260 LBB1_2: ; UnifiedReturnBlock
264 In particular, the two compares (marked 1) could be shared by reversing one.
265 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
266 same operands (but backwards) exists. In this case, this wouldn't save us
267 anything though, because the compares still wouldn't be shared.
269 ===-------------------------------------------------------------------------===
271 We should custom expand setcc instead of pretending that we have it. That
272 would allow us to expose the access of the crbit after the mfcr, allowing
273 that access to be trivially folded into other ops. A simple example:
275 int foo(int a, int b) { return (a < b) << 4; }
282 rlwinm r2, r2, 29, 31, 31
286 ===-------------------------------------------------------------------------===
288 Fold add and sub with constant into non-extern, non-weak addresses so this:
291 void bar(int b) { a = b; }
292 void foo(unsigned char *c) {
309 lbz r2, lo16(_a+3)(r2)
313 ===-------------------------------------------------------------------------===
315 We generate really bad code for this:
317 int f(signed char *a, _Bool b, _Bool c) {
323 ===-------------------------------------------------------------------------===
326 int test(unsigned *P) { return *P >> 24; }
341 ===-------------------------------------------------------------------------===
343 On the G5, logical CR operations are more expensive in their three
344 address form: ops that read/write the same register are half as expensive as
345 those that read from two registers that are different from their destination.
347 We should model this with two separate instructions. The isel should generate
348 the "two address" form of the instructions. When the register allocator
349 detects that it needs to insert a copy due to the two-addresness of the CR
350 logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
351 we can convert to the "three address" instruction, to save code space.
353 This only matters when we start generating cr logical ops.
355 ===-------------------------------------------------------------------------===
357 We should compile these two functions to the same thing:
360 void f(int a, int b, int *P) {
361 *P = (a-b)>=0?(a-b):(b-a);
363 void g(int a, int b, int *P) {
367 Further, they should compile to something better than:
373 bgt cr0, LBB2_2 ; entry
390 ... which is much nicer.
392 This theoretically may help improve twolf slightly (used in dimbox.c:142?).
394 ===-------------------------------------------------------------------------===
396 int foo(int N, int ***W, int **TK, int X) {
399 for (t = 0; t < N; ++t)
400 for (i = 0; i < 4; ++i)
401 W[t / X][i][t % X] = TK[i][t];
406 We generate relatively atrocious code for this loop compared to gcc.
408 We could also strength reduce the rem and the div:
409 http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
411 ===-------------------------------------------------------------------------===
413 float foo(float X) { return (int)(X); }
428 We could use a target dag combine to turn the lwz/extsw into an lwa when the
429 lwz has a single use. Since LWA is cracked anyway, this would be a codesize
432 ===-------------------------------------------------------------------------===
434 We generate ugly code for this:
436 void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
438 if(dx < -dw) code |= 1;
439 if(dx > dw) code |= 2;
440 if(dy < -dw) code |= 4;
441 if(dy > dw) code |= 8;
442 if(dz < -dw) code |= 16;
443 if(dz > dw) code |= 32;
447 ===-------------------------------------------------------------------------===
449 Complete the signed i32 to FP conversion code using 64-bit registers
450 transformation, good for PI. See PPCISelLowering.cpp, this comment:
452 // FIXME: disable this lowered code. This generates 64-bit register values,
453 // and we don't model the fact that the top part is clobbered by calls. We
454 // need to flag these together so that the value isn't live across a call.
455 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
457 Also, if the registers are spilled to the stack, we have to ensure that all
458 64-bits of them are save/restored, otherwise we will miscompile the code. It
459 sounds like we need to get the 64-bit register classes going.
461 ===-------------------------------------------------------------------------===
463 %struct.B = type { i8, [3 x i8] }
465 define void @bar(%struct.B* %b) {
467 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
468 %tmp = load i32* %tmp ; <uint> [#uses=1]
469 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
470 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
471 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
472 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
473 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
474 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
475 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
476 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
477 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
478 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
479 store i32 %tmp13, i32* %tmp8
489 rlwimi r2, r4, 0, 0, 0
493 We could collapse a bunch of those ORs and ANDs and generate the following
498 rlwinm r4, r2, 1, 0, 0
503 ===-------------------------------------------------------------------------===
507 unsigned test6(unsigned x) {
508 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
515 rlwinm r3, r3, 16, 0, 31
524 rlwinm r3,r3,16,24,31
529 ===-------------------------------------------------------------------------===
531 Consider a function like this:
533 float foo(float X) { return X + 1234.4123f; }
535 The FP constant ends up in the constant pool, so we need to get the LR register.
536 This ends up producing code like this:
545 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
546 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
552 This is functional, but there is no reason to spill the LR register all the way
553 to the stack (the two marked instrs): spilling it to a GPR is quite enough.
555 Implementing this will require some codegen improvements. Nate writes:
557 "So basically what we need to support the "no stack frame save and restore" is a
558 generalization of the LR optimization to "callee-save regs".
560 Currently, we have LR marked as a callee-save reg. The register allocator sees
561 that it's callee save, and spills it directly to the stack.
563 Ideally, something like this would happen:
565 LR would be in a separate register class from the GPRs. The class of LR would be
566 marked "unspillable". When the register allocator came across an unspillable
567 reg, it would ask "what is the best class to copy this into that I *can* spill"
568 If it gets a class back, which it will in this case (the gprs), it grabs a free
569 register of that class. If it is then later necessary to spill that reg, so be
572 ===-------------------------------------------------------------------------===
576 return X ? 524288 : 0;
584 beq cr0, LBB1_2 ;entry
597 This sort of thing occurs a lot due to globalopt.
599 ===-------------------------------------------------------------------------===
601 We currently compile 32-bit bswap:
603 declare i32 @llvm.bswap.i32(i32 %A)
604 define i32 @test(i32 %A) {
605 %B = call i32 @llvm.bswap.i32(i32 %A)
612 rlwinm r2, r3, 24, 16, 23
614 rlwimi r2, r3, 8, 24, 31
615 rlwimi r4, r3, 8, 8, 15
616 rlwimi r4, r2, 0, 16, 31
620 it would be more efficient to produce:
623 rlwinm r3,r3,8,0xffffffff
625 rlwimi r3,r0,24,16,23
628 ===-------------------------------------------------------------------------===
630 test/CodeGen/PowerPC/2007-03-24-cntlzd.ll compiles to:
632 __ZNK4llvm5APInt17countLeadingZerosEv:
635 or r2, r2, r2 <<-- silly.
639 The dead or is a 'truncate' from 64- to 32-bits.
641 ===-------------------------------------------------------------------------===