3 * implement do-loop -> bdnz transform
4 * implement powerpc-64 for darwin
6 ===-------------------------------------------------------------------------===
8 Support 'update' load/store instructions. These are cracked on the G5, but are
11 ===-------------------------------------------------------------------------===
13 Should hint to the branch select pass that it doesn't need to print the second
14 unconditional branch, so we don't end up with things like:
15 b .LBBl42__2E_expand_function_8_674 ; loopentry.24
16 b .LBBl42__2E_expand_function_8_42 ; NewDefault
17 b .LBBl42__2E_expand_function_8_42 ; NewDefault
21 The power of diet coke came up with a solution to this today:
23 We know the only two cases that can happen here are either:
24 a) we have a conditional branch followed by a fallthrough to the next BB
25 b) we have a conditional branch followed by an unconditional branch
27 We also invented the BRTWOWAY node to model (b).
29 Currently, these are modeled by the PPC_BRCOND node which is a 12-byte pseudo
37 However, realizing that for (a), we can bccinv directly to the fallthrough
38 block, and for (b) we will already have another unconditional branch after
39 the conditional branch (see SPASS case above), then we know that we don't need
40 BRTWOWAY at all, and can just codegen PPC_BRCOND as
45 This will also allow us to selectively not run the ppc branch selector, by just
46 selecting PPC_BRCOND pseudo directly to the correct conditional branch
47 instruction for small functions.
49 ===-------------------------------------------------------------------------===
54 if (X == 0x12345678) bar();
70 ===-------------------------------------------------------------------------===
72 Lump the constant pool for each function into ONE pic object, and reference
73 pieces of it as offsets from the start. For functions like this (contrived
74 to have lots of constants obviously):
76 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }
81 lis r2, ha16(.CPI_X_0)
82 lfd f0, lo16(.CPI_X_0)(r2)
83 lis r2, ha16(.CPI_X_1)
84 lfd f2, lo16(.CPI_X_1)(r2)
86 lis r2, ha16(.CPI_X_2)
87 lfd f1, lo16(.CPI_X_2)(r2)
88 lis r2, ha16(.CPI_X_3)
89 lfd f2, lo16(.CPI_X_3)(r2)
93 It would be better to materialize .CPI_X into a register, then use immediates
94 off of the register to avoid the lis's. This is even more important in PIC
97 Note that this (and the static variable version) is discussed here for GCC:
98 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
100 ===-------------------------------------------------------------------------===
102 PIC Code Gen IPO optimization:
104 Squish small scalar globals together into a single global struct, allowing the
105 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size
106 of the GOT on targets with one).
108 Note that this is discussed here for GCC:
109 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html
111 ===-------------------------------------------------------------------------===
113 Implement Newton-Rhapson method for improving estimate instructions to the
114 correct accuracy, and implementing divide as multiply by reciprocal when it has
115 more than one use. Itanium will want this too.
117 ===-------------------------------------------------------------------------===
119 #define ARRAY_LENGTH 16
124 unsigned int field0 : 6;
125 unsigned int field1 : 6;
126 unsigned int field2 : 6;
127 unsigned int field3 : 6;
128 unsigned int field4 : 3;
129 unsigned int field5 : 4;
130 unsigned int field6 : 1;
132 unsigned int field6 : 1;
133 unsigned int field5 : 4;
134 unsigned int field4 : 3;
135 unsigned int field3 : 6;
136 unsigned int field2 : 6;
137 unsigned int field1 : 6;
138 unsigned int field0 : 6;
147 typedef struct program_t {
148 union bitfield array[ARRAY_LENGTH];
154 void AdjustBitfields(program* prog, unsigned int fmt1)
156 prog->array[0].bitfields.field0 = fmt1;
157 prog->array[0].bitfields.field1 = fmt1 + 1;
160 We currently generate:
165 rlwinm r2, r2, 0, 0, 19
166 rlwinm r5, r5, 6, 20, 25
167 rlwimi r2, r4, 0, 26, 31
172 We should teach someone that or (rlwimi, rlwinm) with disjoint masks can be
173 turned into rlwimi (rlwimi)
175 The better codegen would be:
186 ===-------------------------------------------------------------------------===
190 int %f1(int %a, int %b) {
191 %tmp.1 = and int %a, 15 ; <int> [#uses=1]
192 %tmp.3 = and int %b, 240 ; <int> [#uses=1]
193 %tmp.4 = or int %tmp.3, %tmp.1 ; <int> [#uses=1]
197 without a copy. We make this currently:
200 rlwinm r2, r4, 0, 24, 27
201 rlwimi r2, r3, 0, 28, 31
205 The two-addr pass or RA needs to learn when it is profitable to commute an
206 instruction to avoid a copy AFTER the 2-addr instruction. The 2-addr pass
207 currently only commutes to avoid inserting a copy BEFORE the two addr instr.
209 ===-------------------------------------------------------------------------===
211 Compile offsets from allocas:
214 %X = alloca { int, int }
215 %Y = getelementptr {int,int}* %X, int 0, uint 1
219 into a single add, not two:
226 --> important for C++.
228 ===-------------------------------------------------------------------------===
230 int test3(int a, int b) { return (a < 0) ? a : 0; }
232 should be branch free code. LLVM is turning it into < 1 because of the RHS.
234 ===-------------------------------------------------------------------------===
236 No loads or stores of the constants should be needed:
238 struct foo { double X, Y; };
239 void xxx(struct foo F);
240 void bar() { struct foo R = { 1.0, 2.0 }; xxx(R); }
242 ===-------------------------------------------------------------------------===
244 Darwin Stub LICM optimization:
250 Have to go through an indirect stub if bar is external or linkonce. It would
251 be better to compile it as:
256 which only computes the address of bar once (instead of each time through the
257 stub). This is Darwin specific and would have to be done in the code generator.
258 Probably not a win on x86.
260 ===-------------------------------------------------------------------------===
262 PowerPC i1/setcc stuff (depends on subreg stuff):
264 Check out the PPC code we get for 'compare' in this testcase:
265 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19672
267 oof. on top of not doing the logical crnand instead of (mfcr, mfcr,
268 invert, invert, or), we then have to compare it against zero instead of
269 using the value already in a CR!
271 that should be something like
275 bne cr0, LBB_compare_4
283 rlwinm r7, r7, 30, 31, 31
284 rlwinm r8, r8, 30, 31, 31
290 bne cr0, LBB_compare_4 ; loopexit
292 FreeBench/mason has a basic block that looks like this:
294 %tmp.130 = seteq int %p.0__, 5 ; <bool> [#uses=1]
295 %tmp.134 = seteq int %p.1__, 6 ; <bool> [#uses=1]
296 %tmp.139 = seteq int %p.2__, 12 ; <bool> [#uses=1]
297 %tmp.144 = seteq int %p.3__, 13 ; <bool> [#uses=1]
298 %tmp.149 = seteq int %p.4__, 14 ; <bool> [#uses=1]
299 %tmp.154 = seteq int %p.5__, 15 ; <bool> [#uses=1]
300 %bothcond = and bool %tmp.134, %tmp.130 ; <bool> [#uses=1]
301 %bothcond123 = and bool %bothcond, %tmp.139 ; <bool>
302 %bothcond124 = and bool %bothcond123, %tmp.144 ; <bool>
303 %bothcond125 = and bool %bothcond124, %tmp.149 ; <bool>
304 %bothcond126 = and bool %bothcond125, %tmp.154 ; <bool>
305 br bool %bothcond126, label %shortcirc_next.5, label %else.0
307 This is a particularly important case where handling CRs better will help.
309 ===-------------------------------------------------------------------------===
311 Simple IPO for argument passing, change:
312 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y)
314 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth
315 of arguments get assigned to r3 through r10. That is, if you have a function
316 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the
317 argument bytes for r4 and r5. The trick then would be to shuffle the argument
318 order for functions we can internalize so that the maximum number of
319 integers/pointers get passed in regs before you see any of the fp arguments.
321 Instead of implementing this, it would actually probably be easier to just
322 implement a PPC fastcc, where we could do whatever we wanted to the CC,
323 including having this work sanely.
325 ===-------------------------------------------------------------------------===
327 Fix Darwin FP-In-Integer Registers ABI
329 Darwin passes doubles in structures in integer registers, which is very very
330 bad. Add something like a BIT_CONVERT to LLVM, then do an i-p transformation
331 that percolates these things out of functions.
333 Check out how horrible this is:
334 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html
336 This is an extension of "interprocedural CC unmunging" that can't be done with
339 ===-------------------------------------------------------------------------===
341 Generate lwbrx and other byteswapping load/store instructions when reasonable.
343 ===-------------------------------------------------------------------------===
345 Implement TargetConstantVec, and set up PPC to custom lower ConstantVec into
346 TargetConstantVec's if it's one of the many forms that are algorithmically
347 computable using the spiffy altivec instructions.
349 ===-------------------------------------------------------------------------===
356 return b * 3; // ignore the fact that this is always 3.
362 into something not this:
367 rlwinm r2, r2, 29, 31, 31
369 bgt cr0, LBB1_2 ; UnifiedReturnBlock
371 rlwinm r2, r2, 0, 31, 31
374 LBB1_2: ; UnifiedReturnBlock
378 In particular, the two compares (marked 1) could be shared by reversing one.
379 This could be done in the dag combiner, by swapping a BR_CC when a SETCC of the
380 same operands (but backwards) exists. In this case, this wouldn't save us
381 anything though, because the compares still wouldn't be shared.
383 ===-------------------------------------------------------------------------===
385 The legalizer should lower this:
387 bool %test(ulong %x) {
388 %tmp = setlt ulong %x, 4294967296
392 into "if x.high == 0", not:
408 noticed in 2005-05-11-Popcount-ffs-fls.c.
411 ===-------------------------------------------------------------------------===
413 We should custom expand setcc instead of pretending that we have it. That
414 would allow us to expose the access of the crbit after the mfcr, allowing
415 that access to be trivially folded into other ops. A simple example:
417 int foo(int a, int b) { return (a < b) << 4; }
424 rlwinm r2, r2, 29, 31, 31
428 ===-------------------------------------------------------------------------===
430 Fold add and sub with constant into non-extern, non-weak addresses so this:
433 void bar(int b) { a = b; }
434 void foo(unsigned char *c) {
451 lbz r2, lo16(_a+3)(r2)
455 ===-------------------------------------------------------------------------===
457 We generate really bad code for this:
459 int f(signed char *a, _Bool b, _Bool c) {
465 ===-------------------------------------------------------------------------===
468 int test(unsigned *P) { return *P >> 24; }
483 ===-------------------------------------------------------------------------===
485 On the G5, logical CR operations are more expensive in their three
486 address form: ops that read/write the same register are half as expensive as
487 those that read from two registers that are different from their destination.
489 We should model this with two separate instructions. The isel should generate
490 the "two address" form of the instructions. When the register allocator
491 detects that it needs to insert a copy due to the two-addresness of the CR
492 logical op, it will invoke PPCInstrInfo::convertToThreeAddress. At this point
493 we can convert to the "three address" instruction, to save code space.
495 This only matters when we start generating cr logical ops.
497 ===-------------------------------------------------------------------------===
499 We should compile these two functions to the same thing:
502 void f(int a, int b, int *P) {
503 *P = (a-b)>=0?(a-b):(b-a);
505 void g(int a, int b, int *P) {
509 Further, they should compile to something better than:
515 bgt cr0, LBB2_2 ; entry
532 ... which is much nicer.
534 This theoretically may help improve twolf slightly (used in dimbox.c:142?).
536 ===-------------------------------------------------------------------------===
538 Implement PPCInstrInfo::isLoadFromStackSlot/isStoreToStackSlot for vector
539 registers, to generate better spill code.
541 ===-------------------------------------------------------------------------===
542 int foo(int N, int ***W, int **TK, int X) {
545 for (t = 0; t < N; ++t)
546 for (i = 0; i < 4; ++i)
547 W[t / X][i][t % X] = TK[i][t];
552 We generate relatively atrocious code for this loop compared to gcc.