1 //===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
5 * implement do-loop -> bdnz transform
7 ===-------------------------------------------------------------------------===
9 Support 'update' load/store instructions. These are cracked on the G5, but are
12 With preinc enabled, this:
14 long *%test4(long *%X, long *%dest) {
15 %Y = getelementptr long* %X, int 4
17 store long %A, long* %dest
32 with -sched=list-burr, I get:
41 ===-------------------------------------------------------------------------===
43 We compile the hottest inner loop of viterbi to:
54 bne cr0, LBB1_83 ;bb420.i
56 The CBE manages to produce:
67 This could be much better (bdnz instead of bdz) but it still beats us. If we
68 produced this with bdnz, the loop would be a single dispatch group.
70 ===-------------------------------------------------------------------------===
87 This is effectively a simple form of predication.
89 ===-------------------------------------------------------------------------===
91 Lump the constant pool for each function into ONE pic object, and reference
92 pieces of it as offsets from the start. For functions like this (contrived
93 to have lots of constants obviously):
95 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
100 lis r2, ha16(.CPI_X_0)
101 lfd f0, lo16(.CPI_X_0)(r2)
102 lis r2, ha16(.CPI_X_1)
103 lfd f2, lo16(.CPI_X_1)(r2)
105 lis r2, ha16(.CPI_X_2)
106 lfd f1, lo16(.CPI_X_2)(r2)
107 lis r2, ha16(.CPI_X_3)
108 lfd f2, lo16(.CPI_X_3)(r2)
112 It would be better to materialize .CPI_X into a register, then use immediates
113 off of the register to avoid the lis's. This is even more important in PIC
116 Note that this (and the static variable version) is discussed here for GCC:
117 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
119 ===-------------------------------------------------------------------------===
121 PIC Code Gen IPO optimization:
123 Squish small scalar globals together into a single global struct, allowing the
124 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
125 of the GOT on targets with one).
127 Note that this is discussed here for GCC:
128 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
130 ===-------------------------------------------------------------------------===
132 Implement Newton-Rhapson method for improving estimate instructions to the
133 correct accuracy, and implementing divide as multiply by reciprocal when it has
134 more than one use. Itanium will want this too.
136 ===-------------------------------------------------------------------------===
140 int %f1(int %a, int %b) {
141 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
142 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
143 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
147 without a copy. We make this currently:
150 rlwinm r2, r4, 0, 24, 27
151 rlwimi r2, r3, 0, 28, 31
155 The two-addr pass or RA needs to learn when it is profitable to commute an
156 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
157 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
159 ===-------------------------------------------------------------------------===
161 Compile offsets from allocas:
164 %X = alloca { int, int }
165 %Y = getelementptr {int,int}* %X, int 0, uint 1
169 into a single add, not two:
176 --> important for C++.
178 ===-------------------------------------------------------------------------===
180 No loads or stores of the constants should be needed:
182 struct foo { double X, Y; };
183 void xxx(struct foo F);
184 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
186 ===-------------------------------------------------------------------------===
188 Darwin Stub LICM optimization:
194 Have to go through an indirect stub if bar is external or linkonce. It would
195 be better to compile it as:
200 which only computes the address of bar once (instead of each time through the
201 stub). This is Darwin specific and would have to be done in the code generator.
202 Probably not a win on x86.
204 ===-------------------------------------------------------------------------===
206 Simple IPO for argument passing, change:
207 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
209 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
210 of arguments get assigned to r3 through r10. That is, if you have a function
211 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
212 argument bytes for r4 and r5. The trick then would be to shuffle the argument
213 order for functions we can internalize so that the maximum number of
214 integers/pointers get passed in regs before you see any of the fp arguments.
216 Instead of implementing this, it would actually probably be easier to just
217 implement a PPC fastcc, where we could do whatever we wanted to the CC,
218 including having this work sanely.
220 ===-------------------------------------------------------------------------===
222 Fix Darwin FP-In-Integer Registers ABI
224 Darwin passes doubles in structures in integer registers, which is very very
225 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
226 that percolates these things out of functions.
228 Check out how horrible this is:
229 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
231 This is an extension of "interprocedural CC unmunging" that can't be done with
234 ===-------------------------------------------------------------------------===
241 return b * 3; // ignore the fact that this is always 3.
247 into something not this:
252 rlwinm r2, r2, 29, 31, 31
254 bgt cr0, LBB1_2 ; UnifiedReturnBlock
256 rlwinm r2, r2, 0, 31, 31
259 LBB1_2: ; UnifiedReturnBlock
263 In particular, the two compares (marked 1) could be shared by reversing one.
264 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
265 same operands (but backwards) exists. In this case, this wouldn't save us
266 anything though, because the compares still wouldn't be shared.
268 ===-------------------------------------------------------------------------===
270 We should custom expand setcc instead of pretending that we have it. That
271 would allow us to expose the access of the crbit after the mfcr, allowing
272 that access to be trivially folded into other ops. A simple example:
274 int foo(int a, int b) { return (a < b) << 4; }
281 rlwinm r2, r2, 29, 31, 31
285 ===-------------------------------------------------------------------------===
287 Fold add and sub with constant into non-extern, non-weak addresses so this:
290 void bar(int b) { a = b; }
291 void foo(unsigned char *c) {
308 lbz r2, lo16(_a+3)(r2)
312 ===-------------------------------------------------------------------------===
314 We generate really bad code for this:
316 int f(signed char *a, _Bool b, _Bool c) {
322 ===-------------------------------------------------------------------------===
325 int test(unsigned *P) { return *P >> 24; }
340 ===-------------------------------------------------------------------------===
342 On the G5, logical CR operations are more expensive in their three
343 address form: ops that read/write the same register are half as expensive as
344 those that read from two registers that are different from their destination.
346 We should model this with two separate instructions. The isel should generate
347 the "two address" form of the instructions. When the register allocator
348 detects that it needs to insert a copy due to the two-addresness of the CR
349 logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
350 we can convert to the "three address" instruction, to save code space.
352 This only matters when we start generating cr logical ops.
354 ===-------------------------------------------------------------------------===
356 We should compile these two functions to the same thing:
359 void f(int a, int b, int *P) {
360 *P = (a-b)>=0?(a-b):(b-a);
362 void g(int a, int b, int *P) {
366 Further, they should compile to something better than:
372 bgt cr0, LBB2_2 ; entry
389 ... which is much nicer.
391 This theoretically may help improve twolf slightly (used in dimbox.c:142?).
393 ===-------------------------------------------------------------------------===
395 int foo(int N, int ***W, int **TK, int X) {
398 for (t = 0; t < N; ++t)
399 for (i = 0; i < 4; ++i)
400 W[t / X][i][t % X] = TK[i][t];
405 We generate relatively atrocious code for this loop compared to gcc.
407 We could also strength reduce the rem and the div:
408 http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
410 ===-------------------------------------------------------------------------===
412 float foo(float X) { return (int)(X); }
427 We could use a target dag combine to turn the lwz/extsw into an lwa when the
428 lwz has a single use. Since LWA is cracked anyway, this would be a codesize
431 ===-------------------------------------------------------------------------===
433 We generate ugly code for this:
435 void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
437 if(dx < -dw) code |= 1;
438 if(dx > dw) code |= 2;
439 if(dy < -dw) code |= 4;
440 if(dy > dw) code |= 8;
441 if(dz < -dw) code |= 16;
442 if(dz > dw) code |= 32;
446 ===-------------------------------------------------------------------------===
448 Complete the signed i32 to FP conversion code using 64-bit registers
449 transformation, good for PI. See PPCISelLowering.cpp, this comment:
451 // FIXME: disable this lowered code. This generates 64-bit register values,
452 // and we don't model the fact that the top part is clobbered by calls. We
453 // need to flag these together so that the value isn't live across a call.
454 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
456 Also, if the registers are spilled to the stack, we have to ensure that all
457 64-bits of them are save/restored, otherwise we will miscompile the code. It
458 sounds like we need to get the 64-bit register classes going.
460 ===-------------------------------------------------------------------------===
462 %struct.B = type { i8, [3 x i8] }
464 define void @bar(%struct.B* %b) {
466 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
467 %tmp = load i32* %tmp ; <uint> [#uses=1]
468 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1]
469 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1]
470 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2]
471 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1]
472 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1]
473 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
474 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1]
475 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1]
476 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1]
477 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1]
478 store i32 %tmp13, i32* %tmp8
488 rlwimi r2, r4, 0, 0, 0
492 We could collapse a bunch of those ORs and ANDs and generate the following
497 rlwinm r4, r2, 1, 0, 0
502 ===-------------------------------------------------------------------------===
506 unsigned test6(unsigned x) {
507 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
514 rlwinm r3, r3, 16, 0, 31
523 rlwinm r3,r3,16,24,31
528 ===-------------------------------------------------------------------------===
530 Consider a function like this:
532 float foo(float X) { return X + 1234.4123f; }
534 The FP constant ends up in the constant pool, so we need to get the LR register.
535 This ends up producing code like this:
544 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb")
545 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2)
551 This is functional, but there is no reason to spill the LR register all the way
552 to the stack (the two marked instrs): spilling it to a GPR is quite enough.
554 Implementing this will require some codegen improvements. Nate writes:
556 "So basically what we need to support the "no stack frame save and restore" is a
557 generalization of the LR optimization to "callee-save regs".
559 Currently, we have LR marked as a callee-save reg. The register allocator sees
560 that it's callee save, and spills it directly to the stack.
562 Ideally, something like this would happen:
564 LR would be in a separate register class from the GPRs. The class of LR would be
565 marked "unspillable". When the register allocator came across an unspillable
566 reg, it would ask "what is the best class to copy this into that I *can* spill"
567 If it gets a class back, which it will in this case (the gprs), it grabs a free
568 register of that class. If it is then later necessary to spill that reg, so be
571 ===-------------------------------------------------------------------------===
575 return X ? 524288 : 0;
583 beq cr0, LBB1_2 ;entry
596 This sort of thing occurs a lot due to globalopt.
598 ===-------------------------------------------------------------------------===
600 We currently compile 32-bit bswap:
602 declare i32 @llvm.bswap.i32(i32 %A)
603 define i32 @test(i32 %A) {
604 %B = call i32 @llvm.bswap.i32(i32 %A)
611 rlwinm r2, r3, 24, 16, 23
613 rlwimi r2, r3, 8, 24, 31
614 rlwimi r4, r3, 8, 8, 15
615 rlwimi r4, r2, 0, 16, 31
619 it would be more efficient to produce:
622 rlwinm r3,r3,8,0xffffffff
624 rlwimi r3,r0,24,16,23
627 ===-------------------------------------------------------------------------===