3 * implement do-loop -> bdnz transform
4 * implement powerpc-64 for darwin
5 * use stfiwx in float->int
7 * Fold add and sub with constant into non-extern, non-weak addresses so this:
8 lis r2, ha16(l2__ZTV4Cell)
9 la r2, lo16(l2__ZTV4Cell)(r2)
12 lis r2, ha16(l2__ZTV4Cell+8)
13 la r2, lo16(l2__ZTV4Cell+8)(r2)
16 * Teach LLVM how to codegen this:
17 unsigned short foo(float a) { return a; }
29 rlwinm r3, r2, 0, 16, 31
32 * Support 'update' load/store instructions. These are cracked on the G5, but
33 are still a codesize win.
35 * should hint to the branch select pass that it doesn't need to print the
36 second unconditional branch, so we don't end up with things like:
37 b .LBBl42__2E_expand_function_8_674 ; loopentry.24
38 b .LBBl42__2E_expand_function_8_42 ; NewDefault
39 b .LBBl42__2E_expand_function_8_42 ; NewDefault
41 ===-------------------------------------------------------------------------===
46 if (X == 0x12345678) bar();
62 ===-------------------------------------------------------------------------===
64 Lump the constant pool for each function into ONE pic object, and reference
65 pieces of it as offsets from the start. For functions like this (contrived
66 to have lots of constants obviously):
68 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
73 lis r2, ha16(.CPI_X_0)
74 lfd f0, lo16(.CPI_X_0)(r2)
75 lis r2, ha16(.CPI_X_1)
76 lfd f2, lo16(.CPI_X_1)(r2)
78 lis r2, ha16(.CPI_X_2)
79 lfd f1, lo16(.CPI_X_2)(r2)
80 lis r2, ha16(.CPI_X_3)
81 lfd f2, lo16(.CPI_X_3)(r2)
85 It would be better to materialize .CPI_X into a register, then use immediates
86 off of the register to avoid the lis's. This is even more important in PIC
89 Note that this (and the static variable version) is discussed here for GCC:
90 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
92 ===-------------------------------------------------------------------------===
94 Implement Newton-Rhapson method for improving estimate instructions to the
95 correct accuracy, and implementing divide as multiply by reciprocal when it has
96 more than one use. Itanium will want this too.
98 ===-------------------------------------------------------------------------===
100 #define ARRAY_LENGTH 16
105 unsigned int field0 : 6;
106 unsigned int field1 : 6;
107 unsigned int field2 : 6;
108 unsigned int field3 : 6;
109 unsigned int field4 : 3;
110 unsigned int field5 : 4;
111 unsigned int field6 : 1;
113 unsigned int field6 : 1;
114 unsigned int field5 : 4;
115 unsigned int field4 : 3;
116 unsigned int field3 : 6;
117 unsigned int field2 : 6;
118 unsigned int field1 : 6;
119 unsigned int field0 : 6;
128 typedef struct program_t {
129 union bitfield array[ARRAY_LENGTH];
135 void AdjustBitfields(program* prog, unsigned int fmt1)
137 unsigned int shift = 0;
138 unsigned int texCount = 0;
141 for (i = 0; i < 8; i++)
143 prog->array[i].bitfields.field0 = texCount;
144 prog->array[i].bitfields.field1 = texCount + 1;
145 prog->array[i].bitfields.field2 = texCount + 2;
146 prog->array[i].bitfields.field3 = texCount + 3;
148 texCount += (fmt1 >> shift) & 0x7;
153 In the loop above, the bitfield adds get generated as
154 (add (shl bitfield, C1), (shl C2, C1)) where C2 is 1, 2 or 3.
156 Since the input to the (or and, and) is an (add) rather than a (shl), the shift
157 doesn't get folded into the rlwimi instruction. We should ideally see through
158 things like this, rather than forcing llvm to generate the equivalent
160 (shl (add bitfield, C2), C1) with some kind of mask.
162 ===-------------------------------------------------------------------------===
166 int %f1(int %a, int %b) {
167 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
168 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
169 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
173 without a copy. We make this currently:
176 rlwinm r2, r4, 0, 24, 27
177 rlwimi r2, r3, 0, 28, 31
181 The two-addr pass or RA needs to learn when it is profitable to commute an
182 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
183 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
185 ===-------------------------------------------------------------------------===
187 176.gcc contains a bunch of code like this (this occurs dozens of times):
189 int %test(uint %mode.0.i.0) {
190 %tmp.79 = cast uint %mode.0.i.0 to sbyte ; <sbyte> [#uses=1]
191 %tmp.80 = cast sbyte %tmp.79 to int ; <int> [#uses=1]
192 %tmp.81 = shl int %tmp.80, ubyte 16 ; <int> [#uses=1]
193 %tmp.82 = and int %tmp.81, 16711680
201 rlwinm r3, r2, 16, 8, 15
204 The extsb is obviously dead. This can be handled by a future thing like
205 MaskedValueIsZero that checks to see if bits are ever demanded (in this case,
206 the sign bits are never used, so we can fold the sext_inreg to nothing).
208 I'm seeing code like this:
212 rlwimi r4, r3, 16, 8, 15
214 in which the extsb is preventing the srwi from being nuked.
216 ===-------------------------------------------------------------------------===
218 Another example that occurs is:
220 uint %test(int %specbits.6.1) {
221 %tmp.2540 = shr int %specbits.6.1, ubyte 11 ; <int> [#uses=1]
222 %tmp.2541 = cast int %tmp.2540 to uint ; <uint> [#uses=1]
223 %tmp.2542 = shl uint %tmp.2541, ubyte 13 ; <uint> [#uses=1]
224 %tmp.2543 = and uint %tmp.2542, 8192 ; <uint> [#uses=1]
232 rlwinm r3, r2, 13, 18, 18
235 the srawi can be nuked by turning the SAR into a logical SHR (the sext bits are
236 dead), which I think can then be folded into the rlwinm.
238 ===-------------------------------------------------------------------------===
240 Compile offsets from allocas:
243 %X = alloca { int, int }
244 %Y = getelementptr {int,int}* %X, int 0, uint 1
248 into a single add, not two:
255 --> important for C++.
257 ===-------------------------------------------------------------------------===
259 int test3(int a, int b) { return (a < 0) ? a : 0; }
261 should be branch free code. LLVM is turning it into < 1 because of the RHS.
263 ===-------------------------------------------------------------------------===
265 No loads or stores of the constants should be needed:
267 struct foo { double X, Y; };
268 void xxx(struct foo F);
269 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
271 ===-------------------------------------------------------------------------===
273 Darwin Stub LICM optimization:
279 Have to go through an indirect stub if bar is external or linkonce. It would
280 be better to compile it as:
285 which only computes the address of bar once (instead of each time through the
286 stub). This is Darwin specific and would have to be done in the code generator.
287 Probably not a win on x86.
289 ===-------------------------------------------------------------------------===
291 PowerPC i1/setcc stuff (depends on subreg stuff):
293 Check out the PPC code we get for 'compare' in this testcase:
294 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
296 oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
297 invert, invert, or), we then have to compare it against zero instead of
298 using the value already in a CR!
300 that should be something like
304 bne cr0, LBB_compare_4
312 rlwinm r7, r7, 30, 31, 31
313 rlwinm r8, r8, 30, 31, 31
319 bne cr0, LBB_compare_4 ; loopexit
321 ===-------------------------------------------------------------------------===
323 Simple IPO for argument passing, change:
324 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
326 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
327 of arguments get assigned to r3 through r10. That is, if you have a function
328 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
329 argument bytes for r4 and r5. The trick then would be to shuffle the argument
330 order for functions we can internalize so that the maximum number of
331 integers/pointers get passed in regs before you see any of the fp arguments.
333 Instead of implementing this, it would actually probably be easier to just
334 implement a PPC fastcc, where we could do whatever we wanted to the CC,
335 including having this work sanely.
337 ===-------------------------------------------------------------------------===
339 Fix Darwin FP-In-Integer Registers ABI
341 Darwin passes doubles in structures in integer registers, which is very very
342 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
343 that percolates these things out of functions.
345 Check out how horrible this is:
346 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
348 This is an extension of "interprocedural CC unmunging" that can't be done with
351 ===-------------------------------------------------------------------------===
353 Code Gen IPO optimization:
355 Squish small scalar globals together into a single global struct, allowing the
356 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
357 of the GOT on targets with one).
359 ===-------------------------------------------------------------------------===
361 Generate lwbrx and other byteswapping load/store instructions when reasonable.
363 ===-------------------------------------------------------------------------===
365 Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
366 TargetConstantVec's if it's one of the many forms that are algorithmically
367 computable using the spiffy altivec instructions.
369 ===-------------------------------------------------------------------------===
373 double %test(double %X) {
374 %Y = cast double %X to long
375 %Z = cast long %Y to double
392 without the lwz/stw's.
394 ===-------------------------------------------------------------------------===
401 return b * 3; // ignore the fact that this is always 3.
407 into something not this:
412 rlwinm r2, r2, 29, 31, 31
414 bgt cr0, LBB1_2 ; UnifiedReturnBlock
416 rlwinm r2, r2, 0, 31, 31
419 LBB1_2: ; UnifiedReturnBlock
423 In particular, the two compares (marked 1) could be shared by reversing one.
424 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
425 same operands (but backwards) exists. In this case, this wouldn't save us
426 anything though, because the compares still wouldn't be shared.
428 ===-------------------------------------------------------------------------===
430 The legalizer should lower this:
432 bool %test(ulong %x) {
433 %tmp = setlt ulong %x, 4294967296
437 into "if x.high == 0", not:
453 noticed in 2005-05-11-Popcount-ffs-fls.c.
456 ===-------------------------------------------------------------------------===
458 We should custom expand setcc instead of pretending that we have it. That
459 would allow us to expose the access of the crbit after the mfcr, allowing
460 that access to be trivially folded into other ops. A simple example:
462 int foo(int a, int b) { return (a < b) << 4; }
469 rlwinm r2, r2, 29, 31, 31