1 //===---------------------------------------------------------------------===//
2 // Random ideas for the X86 backend.
3 //===---------------------------------------------------------------------===//
5 Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6 Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7 X86, & make the dag combiner produce it when needed. This will eliminate one
8 imul from the code generated for:
10 long long test(long long X, long long Y) { return X*Y; }
12 by using the EAX result from the mul. We should add a similar node for
17 long long test(int X, int Y) { return (long long)X*Y; }
19 ... which should only be one imul instruction.
21 This can be done with a custom expander, but it would be nice to move this to
24 //===---------------------------------------------------------------------===//
26 CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
27 backend knows how to three-addressify this shift, but it appears the register
28 allocator isn't even asking it to do so in this case. We should investigate
29 why this isn't happening, it could have significant impact on other important
30 cases for X86 as well.
32 //===---------------------------------------------------------------------===//
34 This should be one DIV/IDIV instruction, not a libcall:
36 unsigned test(unsigned long long X, unsigned Y) {
40 This can be done trivially with a custom legalizer. What about overflow
41 though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
43 //===---------------------------------------------------------------------===//
45 Improvements to the multiply -> shift/add algorithm:
46 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
48 //===---------------------------------------------------------------------===//
50 Improve code like this (occurs fairly frequently, e.g. in LLVM):
51 long long foo(int x) { return 1LL << x; }
53 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
54 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
55 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
57 Another useful one would be ~0ULL >> X and ~0ULL << X.
59 One better solution for 1LL << x is:
68 But that requires good 8-bit subreg support.
70 64-bit shifts (in general) expand to really bad code. Instead of using
71 cmovs, we should expand to a conditional branch like GCC produces.
73 //===---------------------------------------------------------------------===//
76 _Bool f(_Bool a) { return a!=1; }
83 //===---------------------------------------------------------------------===//
87 1. Dynamic programming based approach when compile time if not an
89 2. Code duplication (addressing mode) during isel.
90 3. Other ideas from "Register-Sensitive Selection, Duplication, and
91 Sequencing of Instructions".
92 4. Scheduling for reduced register pressure. E.g. "Minimum Register
93 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
94 and other related papers.
95 http://citeseer.ist.psu.edu/govindarajan01minimum.html
97 //===---------------------------------------------------------------------===//
99 Should we promote i16 to i32 to avoid partial register update stalls?
101 //===---------------------------------------------------------------------===//
103 Leave any_extend as pseudo instruction and hint to register
104 allocator. Delay codegen until post register allocation.
106 //===---------------------------------------------------------------------===//
108 Count leading zeros and count trailing zeros:
110 int clz(int X) { return __builtin_clz(X); }
111 int ctz(int X) { return __builtin_ctz(X); }
113 $ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
115 bsr %eax, DWORD PTR [%esp+4]
119 bsf %eax, DWORD PTR [%esp+4]
122 however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
125 Another example (use predsimplify to eliminate a select):
127 int foo (unsigned long j) {
129 return __builtin_ffs (j) - 1;
134 //===---------------------------------------------------------------------===//
136 Use push/pop instructions in prolog/epilog sequences instead of stores off
137 ESP (certain code size win, perf win on some [which?] processors).
138 Also, it appears icc use push for parameter passing. Need to investigate.
140 //===---------------------------------------------------------------------===//
142 Only use inc/neg/not instructions on processors where they are faster than
143 add/sub/xor. They are slower on the P4 due to only updating some processor
146 //===---------------------------------------------------------------------===//
148 The instruction selector sometimes misses folding a load into a compare. The
149 pattern is written as (cmp reg, (load p)). Because the compare isn't
150 commutative, it is not matched with the load on both sides. The dag combiner
151 should be made smart enough to cannonicalize the load into the RHS of a compare
152 when it can invert the result of the compare for free.
154 //===---------------------------------------------------------------------===//
156 How about intrinsics? An example is:
157 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
160 pmuludq (%eax), %xmm0
165 The transformation probably requires a X86 specific pass or a DAG combiner
166 target specific hook.
168 //===---------------------------------------------------------------------===//
170 In many cases, LLVM generates code like this:
179 on some processors (which ones?), it is more efficient to do this:
188 Doing this correctly is tricky though, as the xor clobbers the flags.
190 //===---------------------------------------------------------------------===//
192 We should generate bts/btr/etc instructions on targets where they are cheap or
193 when codesize is important. e.g., for:
195 void setbit(int *target, int bit) {
196 *target |= (1 << bit);
198 void clearbit(int *target, int bit) {
199 *target &= ~(1 << bit);
202 //===---------------------------------------------------------------------===//
204 Instead of the following for memset char*, 1, 10:
206 movl $16843009, 4(%edx)
207 movl $16843009, (%edx)
210 It might be better to generate
217 when we can spare a register. It reduces code size.
219 //===---------------------------------------------------------------------===//
221 Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
238 GCC knows several different ways to codegen it, one of which is this:
248 which is probably slower, but it's interesting at least :)
250 //===---------------------------------------------------------------------===//
252 The first BB of this code:
256 %V = call bool %foo()
257 br bool %V, label %T, label %F
274 It would be better to emit "cmp %al, 1" than a xor and test.
276 //===---------------------------------------------------------------------===//
278 Enable X86InstrInfo::convertToThreeAddress().
280 //===---------------------------------------------------------------------===//
282 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
283 We should leave these as libcalls for everything over a much lower threshold,
284 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
285 stores, TLB preheating, etc)
287 //===---------------------------------------------------------------------===//
289 Optimize this into something reasonable:
290 x * copysign(1.0, y) * copysign(1.0, z)
292 //===---------------------------------------------------------------------===//
294 Optimize copysign(x, *y) to use an integer load from y.
296 //===---------------------------------------------------------------------===//
298 %X = weak global int 0
301 %N = cast int %N to uint
302 %tmp.24 = setgt int %N, 0
303 br bool %tmp.24, label %no_exit, label %return
306 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
307 %i.0.0 = cast uint %indvar to int
308 volatile store int %i.0.0, int* %X
309 %indvar.next = add uint %indvar, 1
310 %exitcond = seteq uint %indvar.next, %N
311 br bool %exitcond, label %return, label %no_exit
325 jl LBB_foo_4 # return
326 LBB_foo_1: # no_exit.preheader
329 movl L_X$non_lazy_ptr, %edx
333 jne LBB_foo_2 # no_exit
334 LBB_foo_3: # return.loopexit
338 We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
339 remateralization is implemented. This can be accomplished with 1) a target
340 dependent LICM pass or 2) makeing SelectDAG represent the whole function.
342 //===---------------------------------------------------------------------===//
344 The following tests perform worse with LSR:
346 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
348 //===---------------------------------------------------------------------===//
350 We are generating far worse code than gcc:
356 for (i = 0; i < N; i++) { X = i; Y = i*4; }
359 LBB1_1: #bb.preheader
363 movl L_X$non_lazy_ptr, %esi
367 movl L_Y$non_lazy_ptr, %edi
377 movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
378 movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
381 leal 0(,%edx,4), %eax
389 1. Lack of post regalloc LICM.
390 2. Poor sub-regclass support. That leads to inability to promote the 16-bit
391 arithmetic op to 32-bit and making use of leal.
392 3. LSR unable to reused IV for a different type (i16 vs. i32) even though
393 the cast would be free.
395 //===---------------------------------------------------------------------===//
397 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
400 //===---------------------------------------------------------------------===//
408 Obviously it would have been better for the first mov (or any op) to store
409 directly %esp[0] if there are no other uses.
411 //===---------------------------------------------------------------------===//
413 Adding to the list of cmp / test poor codegen issues:
415 int test(__m128 *A, __m128 *B) {
416 if (_mm_comige_ss(*A, *B))
436 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
437 are a number of issues. 1) We are introducing a setcc between the result of the
438 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
439 so a any extend (which becomes a zero extend) is added.
441 We probably need some kind of target DAG combine hook to fix this.
443 //===---------------------------------------------------------------------===//
445 We generate significantly worse code for this than GCC:
446 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
447 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
449 There is also one case we do worse on PPC.
451 //===---------------------------------------------------------------------===//
453 If shorter, we should use things like:
458 The former can also be used when the two-addressy nature of the 'and' would
459 require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
461 //===---------------------------------------------------------------------===//
465 char foo(int x) { return x; }
473 SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
476 //===---------------------------------------------------------------------===//
480 typedef struct pair { float A, B; } pair;
481 void pairtest(pair P, float *FP) {
485 We currently generate this code with llvmgcc4:
497 we should be able to generate:
505 The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
506 integer chunks. It does this so that structs like {short,short} are passed in
507 a single 32-bit integer stack slot. We should handle the safe cases above much
508 nicer, while still handling the hard cases.
510 While true in general, in this specific case we could do better by promoting
511 load int + bitcast to float -> load fload. This basically needs alignment info,
512 the code is already implemented (but disabled) in dag combine).
514 //===---------------------------------------------------------------------===//
516 Another instruction selector deficiency:
519 %tmp = load int (int)** %foo
520 %tmp = tail call int %tmp( int 3 )
526 movl L_foo$non_lazy_ptr, %eax
532 The current isel scheme will not allow the load to be folded in the call since
533 the load's chain result is read by the callseq_start.
535 //===---------------------------------------------------------------------===//
537 Don't forget to find a way to squash noop truncates in the JIT environment.
539 //===---------------------------------------------------------------------===//
541 Implement anyext in the same manner as truncate that would allow them to be
544 //===---------------------------------------------------------------------===//
546 How about implementing truncate / anyext as a property of machine instruction
547 operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
548 Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
549 For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
551 //===---------------------------------------------------------------------===//
561 imull $3, 4(%esp), %eax
563 Perhaps this is what we really should generate is? Is imull three or four
564 cycles? Note: ICC generates this:
566 leal (%eax,%eax,2), %eax
568 The current instruction priority is based on pattern complexity. The former is
569 more "complex" because it folds a load so the latter will not be emitted.
571 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
572 should always try to match LEA first since the LEA matching code does some
573 estimate to determine whether the match is profitable.
575 However, if we care more about code size, then imull is better. It's two bytes
576 shorter than movl + leal.
578 //===---------------------------------------------------------------------===//
580 Implement CTTZ, CTLZ with bsf and bsr.
582 //===---------------------------------------------------------------------===//
584 It appears gcc place string data with linkonce linkage in
585 .section __TEXT,__const_coal,coalesced instead of
586 .section __DATA,__const_coal,coalesced.
587 Take a look at darwin.h, there are other Darwin assembler directives that we
590 //===---------------------------------------------------------------------===//
592 int %foo(int* %a, int %t) {
596 cond_true: ; preds = %cond_true, %entry
597 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
598 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
599 %tmp2 = getelementptr int* %a, int %x.0.0
600 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
601 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
602 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
603 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
604 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
605 br bool %tmp, label %bb12, label %cond_true
607 bb12: ; preds = %cond_true
611 is pessimized by -loop-reduce and -indvars
613 //===---------------------------------------------------------------------===//
615 u32 to float conversion improvement:
617 float uint32_2_float( unsigned u ) {
618 float fl = (int) (u & 0xffff);
619 float fh = (int) (u >> 16);
624 00000000 subl $0x04,%esp
625 00000003 movl 0x08(%esp,1),%eax
626 00000007 movl %eax,%ecx
627 00000009 shrl $0x10,%ecx
628 0000000c cvtsi2ss %ecx,%xmm0
629 00000010 andl $0x0000ffff,%eax
630 00000015 cvtsi2ss %eax,%xmm1
631 00000019 mulss 0x00000078,%xmm0
632 00000021 addss %xmm1,%xmm0
633 00000025 movss %xmm0,(%esp,1)
634 0000002a flds (%esp,1)
635 0000002d addl $0x04,%esp
638 //===---------------------------------------------------------------------===//
640 When using fastcc abi, align stack slot of argument of type double on 8 byte
641 boundary to improve performance.
643 //===---------------------------------------------------------------------===//
647 int f(int a, int b) {
648 if (a == 4 || a == 6)
660 //===---------------------------------------------------------------------===//
662 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
663 simplifications for integer "x cmp y ? a : b". For example, instead of:
666 void f(int X, int Y) {
692 //===---------------------------------------------------------------------===//
694 Currently we don't have elimination of redundant stack manipulations. Consider
699 call fastcc void %test1( )
700 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
704 declare fastcc void %test1()
706 declare fastcc void %test2(sbyte*)
709 This currently compiles to:
719 The add\sub pair is really unneeded here.
721 //===---------------------------------------------------------------------===//
723 We currently compile sign_extend_inreg into two shifts:
726 return (long)(signed char)X;
743 //===---------------------------------------------------------------------===//
745 Consider the expansion of:
747 uint %test3(uint %X) {
748 %tmp1 = rem uint %X, 255
752 Currently it compiles to:
755 movl $2155905153, %ecx
761 This could be "reassociated" into:
763 movl $2155905153, %eax
767 to avoid the copy. In fact, the existing two-address stuff would do this
768 except that mul isn't a commutative 2-addr instruction. I guess this has
769 to be done at isel time based on the #uses to mul?
771 //===---------------------------------------------------------------------===//
773 Make sure the instruction which starts a loop does not cross a cacheline
774 boundary. This requires knowning the exact length of each machine instruction.
775 That is somewhat complicated, but doable. Example 256.bzip2:
777 In the new trace, the hot loop has an instruction which crosses a cacheline
778 boundary. In addition to potential cache misses, this can't help decoding as I
779 imagine there has to be some kind of complicated decoder reset and realignment
780 to grab the bytes from the next cacheline.
782 532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
783 942 942 0x3d03 movl %dh, (1809(%esp, %esi)
784 937 937 0x3d0a incl %esi
785 3 3 0x3d0b cmpb %bl, %dl
786 27 27 0x3d0d jnz 0x000062db <main+11707>
788 //===---------------------------------------------------------------------===//
790 In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
792 //===---------------------------------------------------------------------===//
794 This could be a single 16-bit load.
797 if ((p[0] == 1) & (p[1] == 2)) return 1;
801 //===---------------------------------------------------------------------===//
803 We should inline lrintf and probably other libc functions.
805 //===---------------------------------------------------------------------===//
807 Start using the flags more. For example, compile:
809 int add_zf(int *x, int y, int a, int b) {
833 int add_zf(int *x, int y, int a, int b) {
857 //===---------------------------------------------------------------------===//
861 int foo(double X) { return isnan(X); }
872 the pxor is not needed, we could compare the value against itself.
874 //===---------------------------------------------------------------------===//
876 These two functions have identical effects:
878 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
879 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
881 We currently compile them to:
889 jne LBB1_2 #UnifiedReturnBlock
893 LBB1_2: #UnifiedReturnBlock
903 leal 1(%ecx,%eax), %eax
906 both of which are inferior to GCC's:
924 //===---------------------------------------------------------------------===//
932 is currently compiled to:
943 It would be better to produce:
952 This can be applied to any no-return function call that takes no arguments etc.
953 Alternatively, the stack save/restore logic could be shrink-wrapped, producing
964 Both are useful in different situations. Finally, it could be shrink-wrapped
965 and tail called, like this:
972 pop %eax # realign stack.
975 Though this probably isn't worth it.
977 //===---------------------------------------------------------------------===//
979 We need to teach the codegen to convert two-address INC instructions to LEA
980 when the flags are dead. For example, on X86-64, compile:
982 int foo(int A, int B) {
999 //===---------------------------------------------------------------------===//
1001 We use push/pop of stack space around calls in situations where we don't have to.
1002 Call to f below produces:
1003 subl $16, %esp <<<<<
1006 addl $16, %esp <<<<<
1007 The stack push/pop can be moved into the prolog/epilog. It does this because it's
1008 building the frame pointer, but this should not be sufficient, only the use of alloca
1009 should cause it to do this.
1010 (There are other issues shown by this code, but this is one.)
1012 typedef struct _range_t {
1018 unsigned char lut[];
1030 const range_t*const*range;
1033 typedef struct _decode_t decode_t;
1035 extern int f(const decode_t* decode);
1037 int decode_byte (const decode_t* decode) {
1038 if (decode->swap != 0)