1 //===- README.txt - Notes for improving PowerPC-specific code gen ---------===//
5 * implement do-loop -> bdnz transform
7 ===-------------------------------------------------------------------------===
9 We only produce the rlwnm instruction for rotate instructions. We should
10 at least match stuff like:
12 unsigned rot_and(unsigned X, int Y) {
13 unsigned T = (X << Y) | (X >> (32-Y));
19 rlwnm r2, r3, r4, 0, 31
20 rlwinm r3, r2, 0, 25, 31
23 ... which is the basic pattern that should be written in the instr. It may
24 also be useful for stuff like:
26 long long foo2(long long X, int C) {
30 which currently produces:
33 rlwinm r2, r5, 0, 27, 25
41 ===-------------------------------------------------------------------------===
43 Support 'update' load/store instructions. These are cracked on the G5, but are
46 ===-------------------------------------------------------------------------===
48 Teach the .td file to pattern match PPC::BR_COND to appropriate bc variant, so
49 we don't have to always run the branch selector for small functions.
51 ===-------------------------------------------------------------------------===
53 Lump the constant pool for each function into ONE pic object, and reference
54 pieces of it as offsets from the start. For functions like this (contrived
55 to have lots of constants obviously):
57 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
62 lis r2, ha16(.CPI_X_0)
63 lfd f0, lo16(.CPI_X_0)(r2)
64 lis r2, ha16(.CPI_X_1)
65 lfd f2, lo16(.CPI_X_1)(r2)
67 lis r2, ha16(.CPI_X_2)
68 lfd f1, lo16(.CPI_X_2)(r2)
69 lis r2, ha16(.CPI_X_3)
70 lfd f2, lo16(.CPI_X_3)(r2)
74 It would be better to materialize .CPI_X into a register, then use immediates
75 off of the register to avoid the lis's. This is even more important in PIC
78 Note that this (and the static variable version) is discussed here for GCC:
79 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
81 ===-------------------------------------------------------------------------===
83 PIC Code Gen IPO optimization:
85 Squish small scalar globals together into a single global struct, allowing the
86 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
87 of the GOT on targets with one).
89 Note that this is discussed here for GCC:
90 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
92 ===-------------------------------------------------------------------------===
94 Implement Newton-Rhapson method for improving estimate instructions to the
95 correct accuracy, and implementing divide as multiply by reciprocal when it has
96 more than one use. Itanium will want this too.
98 ===-------------------------------------------------------------------------===
102 int %f1(int %a, int %b) {
103 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
104 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
105 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
109 without a copy. We make this currently:
112 rlwinm r2, r4, 0, 24, 27
113 rlwimi r2, r3, 0, 28, 31
117 The two-addr pass or RA needs to learn when it is profitable to commute an
118 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
119 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
121 ===-------------------------------------------------------------------------===
123 Compile offsets from allocas:
126 %X = alloca { int, int }
127 %Y = getelementptr {int,int}* %X, int 0, uint 1
131 into a single add, not two:
138 --> important for C++.
140 ===-------------------------------------------------------------------------===
142 int test3(int a, int b) { return (a < 0) ? a : 0; }
144 should be branch free code. LLVM is turning it into < 1 because of the RHS.
146 ===-------------------------------------------------------------------------===
148 No loads or stores of the constants should be needed:
150 struct foo { double X, Y; };
151 void xxx(struct foo F);
152 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
154 ===-------------------------------------------------------------------------===
156 Darwin Stub LICM optimization:
162 Have to go through an indirect stub if bar is external or linkonce. It would
163 be better to compile it as:
168 which only computes the address of bar once (instead of each time through the
169 stub). This is Darwin specific and would have to be done in the code generator.
170 Probably not a win on x86.
172 ===-------------------------------------------------------------------------===
174 PowerPC i1/setcc stuff (depends on subreg stuff):
176 Check out the PPC code we get for 'compare' in this testcase:
177 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
179 oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
180 invert, invert, or), we then have to compare it against zero instead of
181 using the value already in a CR!
183 that should be something like
187 bne cr0, LBB_compare_4
195 rlwinm r7, r7, 30, 31, 31
196 rlwinm r8, r8, 30, 31, 31
202 bne cr0, LBB_compare_4 ; loopexit
204 FreeBench/mason has a basic block that looks like this:
206 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
207 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
208 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
209 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
210 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
211 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
212 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
213 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
214 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
215 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
216 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
217 br bool %bothcond126, label %shortcirc_next.5, label %else.0
219 This is a particularly important case where handling CRs better will help.
221 ===-------------------------------------------------------------------------===
223 Simple IPO for argument passing, change:
224 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
226 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
227 of arguments get assigned to r3 through r10. That is, if you have a function
228 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
229 argument bytes for r4 and r5. The trick then would be to shuffle the argument
230 order for functions we can internalize so that the maximum number of
231 integers/pointers get passed in regs before you see any of the fp arguments.
233 Instead of implementing this, it would actually probably be easier to just
234 implement a PPC fastcc, where we could do whatever we wanted to the CC,
235 including having this work sanely.
237 ===-------------------------------------------------------------------------===
239 Fix Darwin FP-In-Integer Registers ABI
241 Darwin passes doubles in structures in integer registers, which is very very
242 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
243 that percolates these things out of functions.
245 Check out how horrible this is:
246 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
248 This is an extension of "interprocedural CC unmunging" that can't be done with
251 ===-------------------------------------------------------------------------===
258 return b * 3; // ignore the fact that this is always 3.
264 into something not this:
269 rlwinm r2, r2, 29, 31, 31
271 bgt cr0, LBB1_2 ; UnifiedReturnBlock
273 rlwinm r2, r2, 0, 31, 31
276 LBB1_2: ; UnifiedReturnBlock
280 In particular, the two compares (marked 1) could be shared by reversing one.
281 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
282 same operands (but backwards) exists. In this case, this wouldn't save us
283 anything though, because the compares still wouldn't be shared.
285 ===-------------------------------------------------------------------------===
287 The legalizer should lower this:
289 bool %test(ulong %x) {
290 %tmp = setlt ulong %x, 4294967296
294 into "if x.high == 0", not:
310 noticed in 2005-05-11-Popcount-ffs-fls.c.
313 ===-------------------------------------------------------------------------===
315 We should custom expand setcc instead of pretending that we have it. That
316 would allow us to expose the access of the crbit after the mfcr, allowing
317 that access to be trivially folded into other ops. A simple example:
319 int foo(int a, int b) { return (a < b) << 4; }
326 rlwinm r2, r2, 29, 31, 31
330 ===-------------------------------------------------------------------------===
332 Fold add and sub with constant into non-extern, non-weak addresses so this:
335 void bar(int b) { a = b; }
336 void foo(unsigned char *c) {
353 lbz r2, lo16(_a+3)(r2)
357 ===-------------------------------------------------------------------------===
359 We generate really bad code for this:
361 int f(signed char *a, _Bool b, _Bool c) {
367 ===-------------------------------------------------------------------------===
370 int test(unsigned *P) { return *P >> 24; }
385 ===-------------------------------------------------------------------------===
387 On the G5, logical CR operations are more expensive in their three
388 address form: ops that read/write the same register are half as expensive as
389 those that read from two registers that are different from their destination.
391 We should model this with two separate instructions. The isel should generate
392 the "two address" form of the instructions. When the register allocator
393 detects that it needs to insert a copy due to the two-addresness of the CR
394 logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
395 we can convert to the "three address" instruction, to save code space.
397 This only matters when we start generating cr logical ops.
399 ===-------------------------------------------------------------------------===
401 We should compile these two functions to the same thing:
404 void f(int a, int b, int *P) {
405 *P = (a-b)>=0?(a-b):(b-a);
407 void g(int a, int b, int *P) {
411 Further, they should compile to something better than:
417 bgt cr0, LBB2_2 ; entry
434 ... which is much nicer.
436 This theoretically may help improve twolf slightly (used in dimbox.c:142?).
438 ===-------------------------------------------------------------------------===
440 int foo(int N, int ***W, int **TK, int X) {
443 for (t = 0; t < N; ++t)
444 for (i = 0; i < 4; ++i)
445 W[t / X][i][t % X] = TK[i][t];
450 We generate relatively atrocious code for this loop compared to gcc.
452 We could also strength reduce the rem and the div:
453 http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf
455 ===-------------------------------------------------------------------------===
457 float foo(float X) { return (int)(X); }
472 We could use a target dag combine to turn the lwz/extsw into an lwa when the
473 lwz has a single use. Since LWA is cracked anyway, this would be a codesize
476 ===-------------------------------------------------------------------------===
478 We generate ugly code for this:
480 void func(unsigned int *ret, float dx, float dy, float dz, float dw) {
482 if(dx < -dw) code |= 1;
483 if(dx > dw) code |= 2;
484 if(dy < -dw) code |= 4;
485 if(dy > dw) code |= 8;
486 if(dz < -dw) code |= 16;
487 if(dz > dw) code |= 32;
491 ===-------------------------------------------------------------------------===
493 Complete the signed i32 to FP conversion code using 64-bit registers
494 transformation, good for PI. See PPCISelLowering.cpp, this comment:
496 // FIXME: disable this lowered code. This generates 64-bit register values,
497 // and we don't model the fact that the top part is clobbered by calls. We
498 // need to flag these together so that the value isn't live across a call.
499 //setOperationAction(ISD::SINT_TO_FP, MVT::i32, Custom);
501 Also, if the registers are spilled to the stack, we have to ensure that all
502 64-bits of them are save/restored, otherwise we will miscompile the code. It
503 sounds like we need to get the 64-bit register classes going.
505 ===-------------------------------------------------------------------------===
507 %struct.B = type { ubyte, [3 x ubyte] }
509 void %foo(%struct.B* %b) {
511 %tmp = cast %struct.B* %b to uint* ; <uint*> [#uses=1]
512 %tmp = load uint* %tmp ; <uint> [#uses=1]
513 %tmp3 = cast %struct.B* %b to uint* ; <uint*> [#uses=1]
514 %tmp4 = load uint* %tmp3 ; <uint> [#uses=1]
515 %tmp8 = cast %struct.B* %b to uint* ; <uint*> [#uses=2]
516 %tmp9 = load uint* %tmp8 ; <uint> [#uses=1]
517 %tmp4.mask17 = shl uint %tmp4, ubyte 1 ; <uint> [#uses=1]
518 %tmp1415 = and uint %tmp4.mask17, 2147483648 ; <uint> [#uses=1]
519 %tmp.masked = and uint %tmp, 2147483648 ; <uint> [#uses=1]
520 %tmp11 = or uint %tmp1415, %tmp.masked ; <uint> [#uses=1]
521 %tmp12 = and uint %tmp9, 2147483647 ; <uint> [#uses=1]
522 %tmp13 = or uint %tmp12, %tmp11 ; <uint> [#uses=1]
523 store uint %tmp13, uint* %tmp8
533 rlwimi r2, r4, 0, 0, 0
537 We could collapse a bunch of those ORs and ANDs and generate the following
542 rlwinm r4, r2, 1, 0, 0
547 ===-------------------------------------------------------------------------===
549 On PPC64, this results in a truncate followed by a truncstore. These should
553 void foo(unsigned long H) { G = H; }
555 ===-------------------------------------------------------------------------===
559 unsigned test6(unsigned x) {
560 return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
567 rlwinm r3, r3, 16, 0, 31
576 rlwinm r3,r3,16,24,31