1 //===---------------------------------------------------------------------===//
2 // Random ideas for the X86 backend.
3 //===---------------------------------------------------------------------===//
9 //===---------------------------------------------------------------------===//
11 Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
12 Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
13 X86, & make the dag combiner produce it when needed. This will eliminate one
14 imul from the code generated for:
16 long long test(long long X, long long Y) { return X*Y; }
18 by using the EAX result from the mul. We should add a similar node for
23 long long test(int X, int Y) { return (long long)X*Y; }
25 ... which should only be one imul instruction.
27 This can be done with a custom expander, but it would be nice to move this to
30 //===---------------------------------------------------------------------===//
32 CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move. The X86
33 backend knows how to three-addressify this shift, but it appears the register
34 allocator isn't even asking it to do so in this case. We should investigate
35 why this isn't happening, it could have significant impact on other important
36 cases for X86 as well.
38 //===---------------------------------------------------------------------===//
40 This should be one DIV/IDIV instruction, not a libcall:
42 unsigned test(unsigned long long X, unsigned Y) {
46 This can be done trivially with a custom legalizer. What about overflow
47 though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
49 //===---------------------------------------------------------------------===//
51 Improvements to the multiply -> shift/add algorithm:
52 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
54 //===---------------------------------------------------------------------===//
56 Improve code like this (occurs fairly frequently, e.g. in LLVM):
57 long long foo(int x) { return 1LL << x; }
59 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
60 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
61 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
63 Another useful one would be ~0ULL >> X and ~0ULL << X.
65 One better solution for 1LL << x is:
74 But that requires good 8-bit subreg support.
76 64-bit shifts (in general) expand to really bad code. Instead of using
77 cmovs, we should expand to a conditional branch like GCC produces.
79 //===---------------------------------------------------------------------===//
82 _Bool f(_Bool a) { return a!=1; }
89 //===---------------------------------------------------------------------===//
93 1. Dynamic programming based approach when compile time if not an
95 2. Code duplication (addressing mode) during isel.
96 3. Other ideas from "Register-Sensitive Selection, Duplication, and
97 Sequencing of Instructions".
98 4. Scheduling for reduced register pressure. E.g. "Minimum Register
99 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
100 and other related papers.
101 http://citeseer.ist.psu.edu/govindarajan01minimum.html
103 //===---------------------------------------------------------------------===//
105 Should we promote i16 to i32 to avoid partial register update stalls?
107 //===---------------------------------------------------------------------===//
109 Leave any_extend as pseudo instruction and hint to register
110 allocator. Delay codegen until post register allocation.
112 //===---------------------------------------------------------------------===//
114 Count leading zeros and count trailing zeros:
116 int clz(int X) { return __builtin_clz(X); }
117 int ctz(int X) { return __builtin_ctz(X); }
119 $ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
121 bsr %eax, DWORD PTR [%esp+4]
125 bsf %eax, DWORD PTR [%esp+4]
128 however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
131 Another example (use predsimplify to eliminate a select):
133 int foo (unsigned long j) {
135 return __builtin_ffs (j) - 1;
140 //===---------------------------------------------------------------------===//
142 Use push/pop instructions in prolog/epilog sequences instead of stores off
143 ESP (certain code size win, perf win on some [which?] processors).
144 Also, it appears icc use push for parameter passing. Need to investigate.
146 //===---------------------------------------------------------------------===//
148 Only use inc/neg/not instructions on processors where they are faster than
149 add/sub/xor. They are slower on the P4 due to only updating some processor
152 //===---------------------------------------------------------------------===//
154 The instruction selector sometimes misses folding a load into a compare. The
155 pattern is written as (cmp reg, (load p)). Because the compare isn't
156 commutative, it is not matched with the load on both sides. The dag combiner
157 should be made smart enough to cannonicalize the load into the RHS of a compare
158 when it can invert the result of the compare for free.
160 //===---------------------------------------------------------------------===//
162 How about intrinsics? An example is:
163 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
166 pmuludq (%eax), %xmm0
171 The transformation probably requires a X86 specific pass or a DAG combiner
172 target specific hook.
174 //===---------------------------------------------------------------------===//
176 In many cases, LLVM generates code like this:
185 on some processors (which ones?), it is more efficient to do this:
194 Doing this correctly is tricky though, as the xor clobbers the flags.
196 //===---------------------------------------------------------------------===//
198 We should generate bts/btr/etc instructions on targets where they are cheap or
199 when codesize is important. e.g., for:
201 void setbit(int *target, int bit) {
202 *target |= (1 << bit);
204 void clearbit(int *target, int bit) {
205 *target &= ~(1 << bit);
208 //===---------------------------------------------------------------------===//
210 Instead of the following for memset char*, 1, 10:
212 movl $16843009, 4(%edx)
213 movl $16843009, (%edx)
216 It might be better to generate
223 when we can spare a register. It reduces code size.
225 //===---------------------------------------------------------------------===//
227 Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
244 GCC knows several different ways to codegen it, one of which is this:
254 which is probably slower, but it's interesting at least :)
256 //===---------------------------------------------------------------------===//
258 The first BB of this code:
262 %V = call bool %foo()
263 br bool %V, label %T, label %F
280 It would be better to emit "cmp %al, 1" than a xor and test.
282 //===---------------------------------------------------------------------===//
284 Enable X86InstrInfo::convertToThreeAddress().
286 //===---------------------------------------------------------------------===//
288 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
289 We should leave these as libcalls for everything over a much lower threshold,
290 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
291 stores, TLB preheating, etc)
293 //===---------------------------------------------------------------------===//
295 Optimize this into something reasonable:
296 x * copysign(1.0, y) * copysign(1.0, z)
298 //===---------------------------------------------------------------------===//
300 Optimize copysign(x, *y) to use an integer load from y.
302 //===---------------------------------------------------------------------===//
304 %X = weak global int 0
307 %N = cast int %N to uint
308 %tmp.24 = setgt int %N, 0
309 br bool %tmp.24, label %no_exit, label %return
312 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
313 %i.0.0 = cast uint %indvar to int
314 volatile store int %i.0.0, int* %X
315 %indvar.next = add uint %indvar, 1
316 %exitcond = seteq uint %indvar.next, %N
317 br bool %exitcond, label %return, label %no_exit
331 jl LBB_foo_4 # return
332 LBB_foo_1: # no_exit.preheader
335 movl L_X$non_lazy_ptr, %edx
339 jne LBB_foo_2 # no_exit
340 LBB_foo_3: # return.loopexit
344 We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
345 remateralization is implemented. This can be accomplished with 1) a target
346 dependent LICM pass or 2) makeing SelectDAG represent the whole function.
348 //===---------------------------------------------------------------------===//
350 The following tests perform worse with LSR:
352 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
354 //===---------------------------------------------------------------------===//
356 We are generating far worse code than gcc:
362 for (i = 0; i < N; i++) { X = i; Y = i*4; }
365 LBB1_1: #bb.preheader
369 movl L_X$non_lazy_ptr, %esi
373 movl L_Y$non_lazy_ptr, %edi
383 movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
384 movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
387 leal 0(,%edx,4), %eax
395 1. Lack of post regalloc LICM.
396 2. Poor sub-regclass support. That leads to inability to promote the 16-bit
397 arithmetic op to 32-bit and making use of leal.
398 3. LSR unable to reused IV for a different type (i16 vs. i32) even though
399 the cast would be free.
401 //===---------------------------------------------------------------------===//
403 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
406 //===---------------------------------------------------------------------===//
414 Obviously it would have been better for the first mov (or any op) to store
415 directly %esp[0] if there are no other uses.
417 //===---------------------------------------------------------------------===//
419 Adding to the list of cmp / test poor codegen issues:
421 int test(__m128 *A, __m128 *B) {
422 if (_mm_comige_ss(*A, *B))
442 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
443 are a number of issues. 1) We are introducing a setcc between the result of the
444 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
445 so a any extend (which becomes a zero extend) is added.
447 We probably need some kind of target DAG combine hook to fix this.
449 //===---------------------------------------------------------------------===//
451 We generate significantly worse code for this than GCC:
452 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
453 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
455 There is also one case we do worse on PPC.
457 //===---------------------------------------------------------------------===//
459 If shorter, we should use things like:
464 The former can also be used when the two-addressy nature of the 'and' would
465 require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
467 //===---------------------------------------------------------------------===//
471 char foo(int x) { return x; }
479 SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
482 //===---------------------------------------------------------------------===//
486 typedef struct pair { float A, B; } pair;
487 void pairtest(pair P, float *FP) {
491 We currently generate this code with llvmgcc4:
503 we should be able to generate:
511 The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
512 integer chunks. It does this so that structs like {short,short} are passed in
513 a single 32-bit integer stack slot. We should handle the safe cases above much
514 nicer, while still handling the hard cases.
516 While true in general, in this specific case we could do better by promoting
517 load int + bitcast to float -> load fload. This basically needs alignment info,
518 the code is already implemented (but disabled) in dag combine).
520 //===---------------------------------------------------------------------===//
522 Another instruction selector deficiency:
525 %tmp = load int (int)** %foo
526 %tmp = tail call int %tmp( int 3 )
532 movl L_foo$non_lazy_ptr, %eax
538 The current isel scheme will not allow the load to be folded in the call since
539 the load's chain result is read by the callseq_start.
541 //===---------------------------------------------------------------------===//
543 Don't forget to find a way to squash noop truncates in the JIT environment.
545 //===---------------------------------------------------------------------===//
547 Implement anyext in the same manner as truncate that would allow them to be
550 //===---------------------------------------------------------------------===//
552 How about implementing truncate / anyext as a property of machine instruction
553 operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
554 Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
555 For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
557 //===---------------------------------------------------------------------===//
567 imull $3, 4(%esp), %eax
569 Perhaps this is what we really should generate is? Is imull three or four
570 cycles? Note: ICC generates this:
572 leal (%eax,%eax,2), %eax
574 The current instruction priority is based on pattern complexity. The former is
575 more "complex" because it folds a load so the latter will not be emitted.
577 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
578 should always try to match LEA first since the LEA matching code does some
579 estimate to determine whether the match is profitable.
581 However, if we care more about code size, then imull is better. It's two bytes
582 shorter than movl + leal.
584 //===---------------------------------------------------------------------===//
586 Implement CTTZ, CTLZ with bsf and bsr.
588 //===---------------------------------------------------------------------===//
590 It appears gcc place string data with linkonce linkage in
591 .section __TEXT,__const_coal,coalesced instead of
592 .section __DATA,__const_coal,coalesced.
593 Take a look at darwin.h, there are other Darwin assembler directives that we
596 //===---------------------------------------------------------------------===//
598 int %foo(int* %a, int %t) {
602 cond_true: ; preds = %cond_true, %entry
603 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
604 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
605 %tmp2 = getelementptr int* %a, int %x.0.0
606 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
607 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
608 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
609 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
610 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
611 br bool %tmp, label %bb12, label %cond_true
613 bb12: ; preds = %cond_true
617 is pessimized by -loop-reduce and -indvars
619 //===---------------------------------------------------------------------===//
621 u32 to float conversion improvement:
623 float uint32_2_float( unsigned u ) {
624 float fl = (int) (u & 0xffff);
625 float fh = (int) (u >> 16);
630 00000000 subl $0x04,%esp
631 00000003 movl 0x08(%esp,1),%eax
632 00000007 movl %eax,%ecx
633 00000009 shrl $0x10,%ecx
634 0000000c cvtsi2ss %ecx,%xmm0
635 00000010 andl $0x0000ffff,%eax
636 00000015 cvtsi2ss %eax,%xmm1
637 00000019 mulss 0x00000078,%xmm0
638 00000021 addss %xmm1,%xmm0
639 00000025 movss %xmm0,(%esp,1)
640 0000002a flds (%esp,1)
641 0000002d addl $0x04,%esp
644 //===---------------------------------------------------------------------===//
646 When using fastcc abi, align stack slot of argument of type double on 8 byte
647 boundary to improve performance.
649 //===---------------------------------------------------------------------===//
653 int f(int a, int b) {
654 if (a == 4 || a == 6)
666 //===---------------------------------------------------------------------===//
668 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
669 simplifications for integer "x cmp y ? a : b". For example, instead of:
672 void f(int X, int Y) {
698 //===---------------------------------------------------------------------===//
700 Currently we don't have elimination of redundant stack manipulations. Consider
705 call fastcc void %test1( )
706 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
710 declare fastcc void %test1()
712 declare fastcc void %test2(sbyte*)
715 This currently compiles to:
725 The add\sub pair is really unneeded here.
727 //===---------------------------------------------------------------------===//
729 We currently compile sign_extend_inreg into two shifts:
732 return (long)(signed char)X;
749 //===---------------------------------------------------------------------===//
751 Consider the expansion of:
753 uint %test3(uint %X) {
754 %tmp1 = rem uint %X, 255
758 Currently it compiles to:
761 movl $2155905153, %ecx
767 This could be "reassociated" into:
769 movl $2155905153, %eax
773 to avoid the copy. In fact, the existing two-address stuff would do this
774 except that mul isn't a commutative 2-addr instruction. I guess this has
775 to be done at isel time based on the #uses to mul?
777 //===---------------------------------------------------------------------===//
779 Make sure the instruction which starts a loop does not cross a cacheline
780 boundary. This requires knowning the exact length of each machine instruction.
781 That is somewhat complicated, but doable. Example 256.bzip2:
783 In the new trace, the hot loop has an instruction which crosses a cacheline
784 boundary. In addition to potential cache misses, this can't help decoding as I
785 imagine there has to be some kind of complicated decoder reset and realignment
786 to grab the bytes from the next cacheline.
788 532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
789 942 942 0x3d03 movl %dh, (1809(%esp, %esi)
790 937 937 0x3d0a incl %esi
791 3 3 0x3d0b cmpb %bl, %dl
792 27 27 0x3d0d jnz 0x000062db <main+11707>
794 //===---------------------------------------------------------------------===//
796 In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
798 //===---------------------------------------------------------------------===//
800 This could be a single 16-bit load.
803 if ((p[0] == 1) & (p[1] == 2)) return 1;
807 //===---------------------------------------------------------------------===//
809 We should inline lrintf and probably other libc functions.
811 //===---------------------------------------------------------------------===//
813 Start using the flags more. For example, compile:
815 int add_zf(int *x, int y, int a, int b) {
839 int add_zf(int *x, int y, int a, int b) {
863 //===---------------------------------------------------------------------===//
867 int foo(double X) { return isnan(X); }
878 the pxor is not needed, we could compare the value against itself.
880 //===---------------------------------------------------------------------===//
882 These two functions have identical effects:
884 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
885 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
887 We currently compile them to:
895 jne LBB1_2 #UnifiedReturnBlock
899 LBB1_2: #UnifiedReturnBlock
909 leal 1(%ecx,%eax), %eax
912 both of which are inferior to GCC's:
930 //===---------------------------------------------------------------------===//
938 is currently compiled to:
949 It would be better to produce:
958 This can be applied to any no-return function call that takes no arguments etc.
959 Alternatively, the stack save/restore logic could be shrink-wrapped, producing
970 Both are useful in different situations. Finally, it could be shrink-wrapped
971 and tail called, like this:
978 pop %eax # realign stack.
981 Though this probably isn't worth it.
983 //===---------------------------------------------------------------------===//
985 We need to teach the codegen to convert two-address INC instructions to LEA
986 when the flags are dead. For example, on X86-64, compile:
988 int foo(int A, int B) {
1005 //===---------------------------------------------------------------------===//
1007 We use push/pop of stack space around calls in situations where we don't have to.
1008 Call to f below produces:
1009 subl $16, %esp <<<<<
1012 addl $16, %esp <<<<<
1013 The stack push/pop can be moved into the prolog/epilog. It does this because it's
1014 building the frame pointer, but this should not be sufficient, only the use of alloca
1015 should cause it to do this.
1016 (There are other issues shown by this code, but this is one.)
1018 typedef struct _range_t {
1024 unsigned char lut[];
1036 const range_t*const*range;
1039 typedef struct _decode_t decode_t;
1041 extern int f(const decode_t* decode);
1043 int decode_byte (const decode_t* decode) {
1044 if (decode->swap != 0)
1050 //===---------------------------------------------------------------------===//
1053 #include <xmmintrin.h>
1054 unsigned test(float f) {
1055 return _mm_cvtsi128_si32( (__m128i) _mm_set_ss( f ));
1060 movss 4(%esp), %xmm0
1064 it should compile to a move from the stack slot directly into eax. DAGCombine
1065 has this xform, but it is currently disabled until the alignment fields of
1066 the load/store nodes are trustworthy.