3 * implement do-loop -> bdnz transform
4 * implement powerpc-64 for darwin
6 ===-------------------------------------------------------------------------===
8 Use the stfiwx instruction for:
10 void foo(float a, int *b) { *b = a; }
12 ===-------------------------------------------------------------------------===
14 unsigned short foo(float a) { return a; }
26 rlwinm r3, r2, 0, 16, 31
29 ===-------------------------------------------------------------------------===
31 Support 'update' load/store instructions. These are cracked on the G5, but are
34 ===-------------------------------------------------------------------------===
36 Should hint to the branch select pass that it doesn't need to print the second
37 unconditional branch, so we don't end up with things like:
38 b .LBBl42__2E_expand_function_8_674 ; loopentry.24
39 b .LBBl42__2E_expand_function_8_42 ; NewDefault
40 b .LBBl42__2E_expand_function_8_42 ; NewDefault
42 ===-------------------------------------------------------------------------===
47 if (X == 0x12345678) bar();
63 ===-------------------------------------------------------------------------===
65 Lump the constant pool for each function into ONE pic object, and reference
66 pieces of it as offsets from the start. For functions like this (contrived
67 to have lots of constants obviously):
69 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
74 lis r2, ha16(.CPI_X_0)
75 lfd f0, lo16(.CPI_X_0)(r2)
76 lis r2, ha16(.CPI_X_1)
77 lfd f2, lo16(.CPI_X_1)(r2)
79 lis r2, ha16(.CPI_X_2)
80 lfd f1, lo16(.CPI_X_2)(r2)
81 lis r2, ha16(.CPI_X_3)
82 lfd f2, lo16(.CPI_X_3)(r2)
86 It would be better to materialize .CPI_X into a register, then use immediates
87 off of the register to avoid the lis's. This is even more important in PIC
90 Note that this (and the static variable version) is discussed here for GCC:
91 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
93 ===-------------------------------------------------------------------------===
95 Implement Newton-Rhapson method for improving estimate instructions to the
96 correct accuracy, and implementing divide as multiply by reciprocal when it has
97 more than one use. Itanium will want this too.
99 ===-------------------------------------------------------------------------===
101 #define ARRAY_LENGTH 16
106 unsigned int field0 : 6;
107 unsigned int field1 : 6;
108 unsigned int field2 : 6;
109 unsigned int field3 : 6;
110 unsigned int field4 : 3;
111 unsigned int field5 : 4;
112 unsigned int field6 : 1;
114 unsigned int field6 : 1;
115 unsigned int field5 : 4;
116 unsigned int field4 : 3;
117 unsigned int field3 : 6;
118 unsigned int field2 : 6;
119 unsigned int field1 : 6;
120 unsigned int field0 : 6;
129 typedef struct program_t {
130 union bitfield array[ARRAY_LENGTH];
136 void AdjustBitfields(program* prog, unsigned int fmt1)
138 unsigned int shift = 0;
139 unsigned int texCount = 0;
142 for (i = 0; i < 8; i++)
144 prog->array[i].bitfields.field0 = texCount;
145 prog->array[i].bitfields.field1 = texCount + 1;
146 prog->array[i].bitfields.field2 = texCount + 2;
147 prog->array[i].bitfields.field3 = texCount + 3;
149 texCount += (fmt1 >> shift) & 0x7;
154 In the loop above, the bitfield adds get generated as
155 (add (shl bitfield, C1), (shl C2, C1)) where C2 is 1, 2 or 3.
157 Since the input to the (or and, and) is an (add) rather than a (shl), the shift
158 doesn't get folded into the rlwimi instruction. We should ideally see through
159 things like this, rather than forcing llvm to generate the equivalent
161 (shl (add bitfield, C2), C1) with some kind of mask.
163 ===-------------------------------------------------------------------------===
167 int %f1(int %a, int %b) {
168 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
169 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
170 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
174 without a copy. We make this currently:
177 rlwinm r2, r4, 0, 24, 27
178 rlwimi r2, r3, 0, 28, 31
182 The two-addr pass or RA needs to learn when it is profitable to commute an
183 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
184 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
186 ===-------------------------------------------------------------------------===
188 176.gcc contains a bunch of code like this (this occurs dozens of times):
190 int %test(uint %mode.0.i.0) {
191 %tmp.79 = cast uint %mode.0.i.0 to sbyte ; <sbyte> [#uses=1]
192 %tmp.80 = cast sbyte %tmp.79 to int ; <int> [#uses=1]
193 %tmp.81 = shl int %tmp.80, ubyte 16 ; <int> [#uses=1]
194 %tmp.82 = and int %tmp.81, 16711680
202 rlwinm r3, r2, 16, 8, 15
205 The extsb is obviously dead. This can be handled by a future thing like
206 MaskedValueIsZero that checks to see if bits are ever demanded (in this case,
207 the sign bits are never used, so we can fold the sext_inreg to nothing).
209 I'm seeing code like this:
213 rlwimi r4, r3, 16, 8, 15
215 in which the extsb is preventing the srwi from being nuked.
217 ===-------------------------------------------------------------------------===
219 Another example that occurs is:
221 uint %test(int %specbits.6.1) {
222 %tmp.2540 = shr int %specbits.6.1, ubyte 11 ; <int> [#uses=1]
223 %tmp.2541 = cast int %tmp.2540 to uint ; <uint> [#uses=1]
224 %tmp.2542 = shl uint %tmp.2541, ubyte 13 ; <uint> [#uses=1]
225 %tmp.2543 = and uint %tmp.2542, 8192 ; <uint> [#uses=1]
233 rlwinm r3, r2, 13, 18, 18
236 the srawi can be nuked by turning the SAR into a logical SHR (the sext bits are
237 dead), which I think can then be folded into the rlwinm.
239 ===-------------------------------------------------------------------------===
241 Compile offsets from allocas:
244 %X = alloca { int, int }
245 %Y = getelementptr {int,int}* %X, int 0, uint 1
249 into a single add, not two:
256 --> important for C++.
258 ===-------------------------------------------------------------------------===
260 int test3(int a, int b) { return (a < 0) ? a : 0; }
262 should be branch free code. LLVM is turning it into < 1 because of the RHS.
264 ===-------------------------------------------------------------------------===
266 No loads or stores of the constants should be needed:
268 struct foo { double X, Y; };
269 void xxx(struct foo F);
270 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
272 ===-------------------------------------------------------------------------===
274 Darwin Stub LICM optimization:
280 Have to go through an indirect stub if bar is external or linkonce. It would
281 be better to compile it as:
286 which only computes the address of bar once (instead of each time through the
287 stub). This is Darwin specific and would have to be done in the code generator.
288 Probably not a win on x86.
290 ===-------------------------------------------------------------------------===
292 PowerPC i1/setcc stuff (depends on subreg stuff):
294 Check out the PPC code we get for 'compare' in this testcase:
295 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
297 oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
298 invert, invert, or), we then have to compare it against zero instead of
299 using the value already in a CR!
301 that should be something like
305 bne cr0, LBB_compare_4
313 rlwinm r7, r7, 30, 31, 31
314 rlwinm r8, r8, 30, 31, 31
320 bne cr0, LBB_compare_4 ; loopexit
322 ===-------------------------------------------------------------------------===
324 Simple IPO for argument passing, change:
325 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
327 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
328 of arguments get assigned to r3 through r10. That is, if you have a function
329 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
330 argument bytes for r4 and r5. The trick then would be to shuffle the argument
331 order for functions we can internalize so that the maximum number of
332 integers/pointers get passed in regs before you see any of the fp arguments.
334 Instead of implementing this, it would actually probably be easier to just
335 implement a PPC fastcc, where we could do whatever we wanted to the CC,
336 including having this work sanely.
338 ===-------------------------------------------------------------------------===
340 Fix Darwin FP-In-Integer Registers ABI
342 Darwin passes doubles in structures in integer registers, which is very very
343 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
344 that percolates these things out of functions.
346 Check out how horrible this is:
347 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
349 This is an extension of "interprocedural CC unmunging" that can't be done with
352 ===-------------------------------------------------------------------------===
354 Code Gen IPO optimization:
356 Squish small scalar globals together into a single global struct, allowing the
357 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
358 of the GOT on targets with one).
360 ===-------------------------------------------------------------------------===
362 Generate lwbrx and other byteswapping load/store instructions when reasonable.
364 ===-------------------------------------------------------------------------===
366 Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
367 TargetConstantVec's if it's one of the many forms that are algorithmically
368 computable using the spiffy altivec instructions.
370 ===-------------------------------------------------------------------------===
374 double %test(double %X) {
375 %Y = cast double %X to long
376 %Z = cast long %Y to double
393 without the lwz/stw's.
395 ===-------------------------------------------------------------------------===
402 return b * 3; // ignore the fact that this is always 3.
408 into something not this:
413 rlwinm r2, r2, 29, 31, 31
415 bgt cr0, LBB1_2 ; UnifiedReturnBlock
417 rlwinm r2, r2, 0, 31, 31
420 LBB1_2: ; UnifiedReturnBlock
424 In particular, the two compares (marked 1) could be shared by reversing one.
425 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
426 same operands (but backwards) exists. In this case, this wouldn't save us
427 anything though, because the compares still wouldn't be shared.
429 ===-------------------------------------------------------------------------===
431 The legalizer should lower this:
433 bool %test(ulong %x) {
434 %tmp = setlt ulong %x, 4294967296
438 into "if x.high == 0", not:
454 noticed in 2005-05-11-Popcount-ffs-fls.c.
457 ===-------------------------------------------------------------------------===
459 We should custom expand setcc instead of pretending that we have it. That
460 would allow us to expose the access of the crbit after the mfcr, allowing
461 that access to be trivially folded into other ops. A simple example:
463 int foo(int a, int b) { return (a < b) << 4; }
470 rlwinm r2, r2, 29, 31, 31
474 ===-------------------------------------------------------------------------===
476 Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
477 precision don't matter (ffastmath). Misc/mandel will like this. :)
479 ===-------------------------------------------------------------------------===
481 Fold add and sub with constant into non-extern, non-weak addresses so this:
484 void bar(int b) { a = b; }
485 void foo(unsigned char *c) {
502 lbz r2, lo16(_a+3)(r2)