1 //===---------------------------------------------------------------------===//
2 // Random ideas for the X86 backend.
3 //===---------------------------------------------------------------------===//
5 Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6 Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7 X86, & make the dag combiner produce it when needed. This will eliminate one
8 imul from the code generated for:
10 long long test(long long X, long long Y) { return X*Y; }
12 by using the EAX result from the mul. We should add a similar node for
17 long long test(int X, int Y) { return (long long)X*Y; }
19 ... which should only be one imul instruction.
21 This can be done with a custom expander, but it would be nice to move this to
24 //===---------------------------------------------------------------------===//
26 This should be one DIV/IDIV instruction, not a libcall:
28 unsigned test(unsigned long long X, unsigned Y) {
32 This can be done trivially with a custom legalizer. What about overflow
33 though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
35 //===---------------------------------------------------------------------===//
37 Improvements to the multiply -> shift/add algorithm:
38 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
40 //===---------------------------------------------------------------------===//
42 Improve code like this (occurs fairly frequently, e.g. in LLVM):
43 long long foo(int x) { return 1LL << x; }
45 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
46 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
47 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
49 Another useful one would be ~0ULL >> X and ~0ULL << X.
51 One better solution for 1LL << x is:
60 But that requires good 8-bit subreg support.
62 64-bit shifts (in general) expand to really bad code. Instead of using
63 cmovs, we should expand to a conditional branch like GCC produces.
65 //===---------------------------------------------------------------------===//
68 _Bool f(_Bool a) { return a!=1; }
75 //===---------------------------------------------------------------------===//
79 1. Dynamic programming based approach when compile time if not an
81 2. Code duplication (addressing mode) during isel.
82 3. Other ideas from "Register-Sensitive Selection, Duplication, and
83 Sequencing of Instructions".
84 4. Scheduling for reduced register pressure. E.g. "Minimum Register
85 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
86 and other related papers.
87 http://citeseer.ist.psu.edu/govindarajan01minimum.html
89 //===---------------------------------------------------------------------===//
91 Should we promote i16 to i32 to avoid partial register update stalls?
93 //===---------------------------------------------------------------------===//
95 Leave any_extend as pseudo instruction and hint to register
96 allocator. Delay codegen until post register allocation.
98 //===---------------------------------------------------------------------===//
100 Count leading zeros and count trailing zeros:
102 int clz(int X) { return __builtin_clz(X); }
103 int ctz(int X) { return __builtin_ctz(X); }
105 $ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
107 bsr %eax, DWORD PTR [%esp+4]
111 bsf %eax, DWORD PTR [%esp+4]
114 however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
117 //===---------------------------------------------------------------------===//
119 Use push/pop instructions in prolog/epilog sequences instead of stores off
120 ESP (certain code size win, perf win on some [which?] processors).
121 Also, it appears icc use push for parameter passing. Need to investigate.
123 //===---------------------------------------------------------------------===//
125 Only use inc/neg/not instructions on processors where they are faster than
126 add/sub/xor. They are slower on the P4 due to only updating some processor
129 //===---------------------------------------------------------------------===//
131 The instruction selector sometimes misses folding a load into a compare. The
132 pattern is written as (cmp reg, (load p)). Because the compare isn't
133 commutative, it is not matched with the load on both sides. The dag combiner
134 should be made smart enough to cannonicalize the load into the RHS of a compare
135 when it can invert the result of the compare for free.
137 //===---------------------------------------------------------------------===//
139 How about intrinsics? An example is:
140 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
143 pmuludq (%eax), %xmm0
148 The transformation probably requires a X86 specific pass or a DAG combiner
149 target specific hook.
151 //===---------------------------------------------------------------------===//
153 In many cases, LLVM generates code like this:
162 on some processors (which ones?), it is more efficient to do this:
171 Doing this correctly is tricky though, as the xor clobbers the flags.
173 //===---------------------------------------------------------------------===//
175 We should generate bts/btr/etc instructions on targets where they are cheap or
176 when codesize is important. e.g., for:
178 void setbit(int *target, int bit) {
179 *target |= (1 << bit);
181 void clearbit(int *target, int bit) {
182 *target &= ~(1 << bit);
185 //===---------------------------------------------------------------------===//
187 Instead of the following for memset char*, 1, 10:
189 movl $16843009, 4(%edx)
190 movl $16843009, (%edx)
193 It might be better to generate
200 when we can spare a register. It reduces code size.
202 //===---------------------------------------------------------------------===//
204 Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
221 GCC knows several different ways to codegen it, one of which is this:
231 which is probably slower, but it's interesting at least :)
233 //===---------------------------------------------------------------------===//
235 The first BB of this code:
239 %V = call bool %foo()
240 br bool %V, label %T, label %F
257 It would be better to emit "cmp %al, 1" than a xor and test.
259 //===---------------------------------------------------------------------===//
261 Enable X86InstrInfo::convertToThreeAddress().
263 //===---------------------------------------------------------------------===//
265 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
266 We should leave these as libcalls for everything over a much lower threshold,
267 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
268 stores, TLB preheating, etc)
270 //===---------------------------------------------------------------------===//
272 Optimize this into something reasonable:
273 x * copysign(1.0, y) * copysign(1.0, z)
275 //===---------------------------------------------------------------------===//
277 Optimize copysign(x, *y) to use an integer load from y.
279 //===---------------------------------------------------------------------===//
281 %X = weak global int 0
284 %N = cast int %N to uint
285 %tmp.24 = setgt int %N, 0
286 br bool %tmp.24, label %no_exit, label %return
289 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
290 %i.0.0 = cast uint %indvar to int
291 volatile store int %i.0.0, int* %X
292 %indvar.next = add uint %indvar, 1
293 %exitcond = seteq uint %indvar.next, %N
294 br bool %exitcond, label %return, label %no_exit
308 jl LBB_foo_4 # return
309 LBB_foo_1: # no_exit.preheader
312 movl L_X$non_lazy_ptr, %edx
316 jne LBB_foo_2 # no_exit
317 LBB_foo_3: # return.loopexit
321 We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
322 remateralization is implemented. This can be accomplished with 1) a target
323 dependent LICM pass or 2) makeing SelectDAG represent the whole function.
325 //===---------------------------------------------------------------------===//
327 The following tests perform worse with LSR:
329 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
331 //===---------------------------------------------------------------------===//
333 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
336 //===---------------------------------------------------------------------===//
344 Obviously it would have been better for the first mov (or any op) to store
345 directly %esp[0] if there are no other uses.
347 //===---------------------------------------------------------------------===//
349 Adding to the list of cmp / test poor codegen issues:
351 int test(__m128 *A, __m128 *B) {
352 if (_mm_comige_ss(*A, *B))
372 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
373 are a number of issues. 1) We are introducing a setcc between the result of the
374 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
375 so a any extend (which becomes a zero extend) is added.
377 We probably need some kind of target DAG combine hook to fix this.
379 //===---------------------------------------------------------------------===//
381 We generate significantly worse code for this than GCC:
382 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
383 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
385 There is also one case we do worse on PPC.
387 //===---------------------------------------------------------------------===//
389 If shorter, we should use things like:
394 The former can also be used when the two-addressy nature of the 'and' would
395 require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
397 //===---------------------------------------------------------------------===//
401 char foo(int x) { return x; }
409 SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
412 //===---------------------------------------------------------------------===//
416 typedef struct pair { float A, B; } pair;
417 void pairtest(pair P, float *FP) {
421 We currently generate this code with llvmgcc4:
436 we should be able to generate:
444 The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
445 integer chunks. It does this so that structs like {short,short} are passed in
446 a single 32-bit integer stack slot. We should handle the safe cases above much
447 nicer, while still handling the hard cases.
449 //===---------------------------------------------------------------------===//
451 Another instruction selector deficiency:
454 %tmp = load int (int)** %foo
455 %tmp = tail call int %tmp( int 3 )
461 movl L_foo$non_lazy_ptr, %eax
467 The current isel scheme will not allow the load to be folded in the call since
468 the load's chain result is read by the callseq_start.
470 //===---------------------------------------------------------------------===//
472 Don't forget to find a way to squash noop truncates in the JIT environment.
474 //===---------------------------------------------------------------------===//
476 Implement anyext in the same manner as truncate that would allow them to be
479 //===---------------------------------------------------------------------===//
481 How about implementing truncate / anyext as a property of machine instruction
482 operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
483 Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
484 For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
486 //===---------------------------------------------------------------------===//
496 imull $3, 4(%esp), %eax
498 Perhaps this is what we really should generate is? Is imull three or four
499 cycles? Note: ICC generates this:
501 leal (%eax,%eax,2), %eax
503 The current instruction priority is based on pattern complexity. The former is
504 more "complex" because it folds a load so the latter will not be emitted.
506 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
507 should always try to match LEA first since the LEA matching code does some
508 estimate to determine whether the match is profitable.
510 However, if we care more about code size, then imull is better. It's two bytes
511 shorter than movl + leal.
513 //===---------------------------------------------------------------------===//
515 Implement CTTZ, CTLZ with bsf and bsr.
517 //===---------------------------------------------------------------------===//
519 It appears gcc place string data with linkonce linkage in
520 .section __TEXT,__const_coal,coalesced instead of
521 .section __DATA,__const_coal,coalesced.
522 Take a look at darwin.h, there are other Darwin assembler directives that we
525 //===---------------------------------------------------------------------===//
527 We should handle __attribute__ ((__visibility__ ("hidden"))).
529 //===---------------------------------------------------------------------===//
531 int %foo(int* %a, int %t) {
535 cond_true: ; preds = %cond_true, %entry
536 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
537 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
538 %tmp2 = getelementptr int* %a, int %x.0.0
539 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
540 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
541 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
542 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
543 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
544 br bool %tmp, label %bb12, label %cond_true
546 bb12: ; preds = %cond_true
550 is pessimized by -loop-reduce and -indvars
552 //===---------------------------------------------------------------------===//
554 u32 to float conversion improvement:
556 float uint32_2_float( unsigned u ) {
557 float fl = (int) (u & 0xffff);
558 float fh = (int) (u >> 16);
563 00000000 subl $0x04,%esp
564 00000003 movl 0x08(%esp,1),%eax
565 00000007 movl %eax,%ecx
566 00000009 shrl $0x10,%ecx
567 0000000c cvtsi2ss %ecx,%xmm0
568 00000010 andl $0x0000ffff,%eax
569 00000015 cvtsi2ss %eax,%xmm1
570 00000019 mulss 0x00000078,%xmm0
571 00000021 addss %xmm1,%xmm0
572 00000025 movss %xmm0,(%esp,1)
573 0000002a flds (%esp,1)
574 0000002d addl $0x04,%esp
577 //===---------------------------------------------------------------------===//
579 When using fastcc abi, align stack slot of argument of type double on 8 byte
580 boundary to improve performance.
582 //===---------------------------------------------------------------------===//
586 int f(int a, int b) {
587 if (a == 4 || a == 6)
599 //===---------------------------------------------------------------------===//
601 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
602 simplifications for integer "x cmp y ? a : b". For example, instead of:
605 void f(int X, int Y) {
631 //===---------------------------------------------------------------------===//
633 Currently we don't have elimination of redundant stack manipulations. Consider
638 call fastcc void %test1( )
639 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
643 declare fastcc void %test1()
645 declare fastcc void %test2(sbyte*)
648 This currently compiles to:
658 The add\sub pair is really unneeded here.
660 //===---------------------------------------------------------------------===//
662 We generate really bad code in some cases due to lowering SETCC/SELECT at
663 legalize time, which prevents the post-legalize dag combine pass from
664 understanding the code. As a silly example, this prevents us from folding
667 bool %test(ulong %x) {
668 %tmp = setlt ulong %x, 4294967296
674 //===---------------------------------------------------------------------===//
676 We currently compile sign_extend_inreg into two shifts:
679 return (long)(signed char)X;
696 //===---------------------------------------------------------------------===//
698 Consider the expansion of:
700 uint %test3(uint %X) {
701 %tmp1 = rem uint %X, 255
705 Currently it compiles to:
708 movl $2155905153, %ecx
714 This could be "reassociated" into:
716 movl $2155905153, %eax
720 to avoid the copy. In fact, the existing two-address stuff would do this
721 except that mul isn't a commutative 2-addr instruction. I guess this has
722 to be done at isel time based on the #uses to mul?