1 //===---------------------------------------------------------------------===//
2 // Random ideas for the X86 backend.
3 //===---------------------------------------------------------------------===//
5 Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6 Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7 X86, & make the dag combiner produce it when needed. This will eliminate one
8 imul from the code generated for:
10 long long test(long long X, long long Y) { return X*Y; }
12 by using the EAX result from the mul. We should add a similar node for
17 long long test(int X, int Y) { return (long long)X*Y; }
19 ... which should only be one imul instruction.
21 This can be done with a custom expander, but it would be nice to move this to
24 //===---------------------------------------------------------------------===//
26 This should be one DIV/IDIV instruction, not a libcall:
28 unsigned test(unsigned long long X, unsigned Y) {
32 This can be done trivially with a custom legalizer. What about overflow
33 though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
35 //===---------------------------------------------------------------------===//
37 Improvements to the multiply -> shift/add algorithm:
38 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
40 //===---------------------------------------------------------------------===//
42 Improve code like this (occurs fairly frequently, e.g. in LLVM):
43 long long foo(int x) { return 1LL << x; }
45 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
46 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
47 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
49 Another useful one would be ~0ULL >> X and ~0ULL << X.
51 One better solution for 1LL << x is:
60 But that requires good 8-bit subreg support.
62 64-bit shifts (in general) expand to really bad code. Instead of using
63 cmovs, we should expand to a conditional branch like GCC produces.
65 //===---------------------------------------------------------------------===//
68 _Bool f(_Bool a) { return a!=1; }
75 //===---------------------------------------------------------------------===//
79 1. Dynamic programming based approach when compile time if not an
81 2. Code duplication (addressing mode) during isel.
82 3. Other ideas from "Register-Sensitive Selection, Duplication, and
83 Sequencing of Instructions".
84 4. Scheduling for reduced register pressure. E.g. "Minimum Register
85 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
86 and other related papers.
87 http://citeseer.ist.psu.edu/govindarajan01minimum.html
89 //===---------------------------------------------------------------------===//
91 Should we promote i16 to i32 to avoid partial register update stalls?
93 //===---------------------------------------------------------------------===//
95 Leave any_extend as pseudo instruction and hint to register
96 allocator. Delay codegen until post register allocation.
98 //===---------------------------------------------------------------------===//
100 Count leading zeros and count trailing zeros:
102 int clz(int X) { return __builtin_clz(X); }
103 int ctz(int X) { return __builtin_ctz(X); }
105 $ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
107 bsr %eax, DWORD PTR [%esp+4]
111 bsf %eax, DWORD PTR [%esp+4]
114 however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
117 Another example (use predsimplify to eliminate a select):
119 int foo (unsigned long j) {
121 return __builtin_ffs (j) - 1;
126 //===---------------------------------------------------------------------===//
128 Use push/pop instructions in prolog/epilog sequences instead of stores off
129 ESP (certain code size win, perf win on some [which?] processors).
130 Also, it appears icc use push for parameter passing. Need to investigate.
132 //===---------------------------------------------------------------------===//
134 Only use inc/neg/not instructions on processors where they are faster than
135 add/sub/xor. They are slower on the P4 due to only updating some processor
138 //===---------------------------------------------------------------------===//
140 The instruction selector sometimes misses folding a load into a compare. The
141 pattern is written as (cmp reg, (load p)). Because the compare isn't
142 commutative, it is not matched with the load on both sides. The dag combiner
143 should be made smart enough to cannonicalize the load into the RHS of a compare
144 when it can invert the result of the compare for free.
146 //===---------------------------------------------------------------------===//
148 How about intrinsics? An example is:
149 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
152 pmuludq (%eax), %xmm0
157 The transformation probably requires a X86 specific pass or a DAG combiner
158 target specific hook.
160 //===---------------------------------------------------------------------===//
162 In many cases, LLVM generates code like this:
171 on some processors (which ones?), it is more efficient to do this:
180 Doing this correctly is tricky though, as the xor clobbers the flags.
182 //===---------------------------------------------------------------------===//
184 We should generate bts/btr/etc instructions on targets where they are cheap or
185 when codesize is important. e.g., for:
187 void setbit(int *target, int bit) {
188 *target |= (1 << bit);
190 void clearbit(int *target, int bit) {
191 *target &= ~(1 << bit);
194 //===---------------------------------------------------------------------===//
196 Instead of the following for memset char*, 1, 10:
198 movl $16843009, 4(%edx)
199 movl $16843009, (%edx)
202 It might be better to generate
209 when we can spare a register. It reduces code size.
211 //===---------------------------------------------------------------------===//
213 Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
230 GCC knows several different ways to codegen it, one of which is this:
240 which is probably slower, but it's interesting at least :)
242 //===---------------------------------------------------------------------===//
244 The first BB of this code:
248 %V = call bool %foo()
249 br bool %V, label %T, label %F
266 It would be better to emit "cmp %al, 1" than a xor and test.
268 //===---------------------------------------------------------------------===//
270 Enable X86InstrInfo::convertToThreeAddress().
272 //===---------------------------------------------------------------------===//
274 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
275 We should leave these as libcalls for everything over a much lower threshold,
276 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
277 stores, TLB preheating, etc)
279 //===---------------------------------------------------------------------===//
281 Optimize this into something reasonable:
282 x * copysign(1.0, y) * copysign(1.0, z)
284 //===---------------------------------------------------------------------===//
286 Optimize copysign(x, *y) to use an integer load from y.
288 //===---------------------------------------------------------------------===//
290 %X = weak global int 0
293 %N = cast int %N to uint
294 %tmp.24 = setgt int %N, 0
295 br bool %tmp.24, label %no_exit, label %return
298 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
299 %i.0.0 = cast uint %indvar to int
300 volatile store int %i.0.0, int* %X
301 %indvar.next = add uint %indvar, 1
302 %exitcond = seteq uint %indvar.next, %N
303 br bool %exitcond, label %return, label %no_exit
317 jl LBB_foo_4 # return
318 LBB_foo_1: # no_exit.preheader
321 movl L_X$non_lazy_ptr, %edx
325 jne LBB_foo_2 # no_exit
326 LBB_foo_3: # return.loopexit
330 We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
331 remateralization is implemented. This can be accomplished with 1) a target
332 dependent LICM pass or 2) makeing SelectDAG represent the whole function.
334 //===---------------------------------------------------------------------===//
336 The following tests perform worse with LSR:
338 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
340 //===---------------------------------------------------------------------===//
342 We are generating far worse code than gcc:
348 for (i = 0; i < N; i++) { X = i; Y = i*4; }
351 LBB1_1: #bb.preheader
355 movl L_X$non_lazy_ptr, %esi
359 movl L_Y$non_lazy_ptr, %edi
369 movl L_X$non_lazy_ptr-"L00000000001$pb"(%ebx), %esi
370 movl L_Y$non_lazy_ptr-"L00000000001$pb"(%ebx), %ecx
373 leal 0(,%edx,4), %eax
381 1. Lack of post regalloc LICM.
382 2. Poor sub-regclass support. That leads to inability to promote the 16-bit
383 arithmetic op to 32-bit and making use of leal.
384 3. LSR unable to reused IV for a different type (i16 vs. i32) even though
385 the cast would be free.
387 //===---------------------------------------------------------------------===//
389 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
392 //===---------------------------------------------------------------------===//
400 Obviously it would have been better for the first mov (or any op) to store
401 directly %esp[0] if there are no other uses.
403 //===---------------------------------------------------------------------===//
405 Adding to the list of cmp / test poor codegen issues:
407 int test(__m128 *A, __m128 *B) {
408 if (_mm_comige_ss(*A, *B))
428 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
429 are a number of issues. 1) We are introducing a setcc between the result of the
430 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
431 so a any extend (which becomes a zero extend) is added.
433 We probably need some kind of target DAG combine hook to fix this.
435 //===---------------------------------------------------------------------===//
437 We generate significantly worse code for this than GCC:
438 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
439 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
441 There is also one case we do worse on PPC.
443 //===---------------------------------------------------------------------===//
445 If shorter, we should use things like:
450 The former can also be used when the two-addressy nature of the 'and' would
451 require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
453 //===---------------------------------------------------------------------===//
457 char foo(int x) { return x; }
465 SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
468 //===---------------------------------------------------------------------===//
472 typedef struct pair { float A, B; } pair;
473 void pairtest(pair P, float *FP) {
477 We currently generate this code with llvmgcc4:
489 we should be able to generate:
497 The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
498 integer chunks. It does this so that structs like {short,short} are passed in
499 a single 32-bit integer stack slot. We should handle the safe cases above much
500 nicer, while still handling the hard cases.
502 While true in general, in this specific case we could do better by promoting
503 load int + bitcast to float -> load fload. This basically needs alignment info,
504 the code is already implemented (but disabled) in dag combine).
506 //===---------------------------------------------------------------------===//
508 Another instruction selector deficiency:
511 %tmp = load int (int)** %foo
512 %tmp = tail call int %tmp( int 3 )
518 movl L_foo$non_lazy_ptr, %eax
524 The current isel scheme will not allow the load to be folded in the call since
525 the load's chain result is read by the callseq_start.
527 //===---------------------------------------------------------------------===//
529 Don't forget to find a way to squash noop truncates in the JIT environment.
531 //===---------------------------------------------------------------------===//
533 Implement anyext in the same manner as truncate that would allow them to be
536 //===---------------------------------------------------------------------===//
538 How about implementing truncate / anyext as a property of machine instruction
539 operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
540 Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
541 For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
543 //===---------------------------------------------------------------------===//
553 imull $3, 4(%esp), %eax
555 Perhaps this is what we really should generate is? Is imull three or four
556 cycles? Note: ICC generates this:
558 leal (%eax,%eax,2), %eax
560 The current instruction priority is based on pattern complexity. The former is
561 more "complex" because it folds a load so the latter will not be emitted.
563 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
564 should always try to match LEA first since the LEA matching code does some
565 estimate to determine whether the match is profitable.
567 However, if we care more about code size, then imull is better. It's two bytes
568 shorter than movl + leal.
570 //===---------------------------------------------------------------------===//
572 Implement CTTZ, CTLZ with bsf and bsr.
574 //===---------------------------------------------------------------------===//
576 It appears gcc place string data with linkonce linkage in
577 .section __TEXT,__const_coal,coalesced instead of
578 .section __DATA,__const_coal,coalesced.
579 Take a look at darwin.h, there are other Darwin assembler directives that we
582 //===---------------------------------------------------------------------===//
584 int %foo(int* %a, int %t) {
588 cond_true: ; preds = %cond_true, %entry
589 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ]
590 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ]
591 %tmp2 = getelementptr int* %a, int %x.0.0
592 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
593 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
594 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
595 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
596 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
597 br bool %tmp, label %bb12, label %cond_true
599 bb12: ; preds = %cond_true
603 is pessimized by -loop-reduce and -indvars
605 //===---------------------------------------------------------------------===//
607 u32 to float conversion improvement:
609 float uint32_2_float( unsigned u ) {
610 float fl = (int) (u & 0xffff);
611 float fh = (int) (u >> 16);
616 00000000 subl $0x04,%esp
617 00000003 movl 0x08(%esp,1),%eax
618 00000007 movl %eax,%ecx
619 00000009 shrl $0x10,%ecx
620 0000000c cvtsi2ss %ecx,%xmm0
621 00000010 andl $0x0000ffff,%eax
622 00000015 cvtsi2ss %eax,%xmm1
623 00000019 mulss 0x00000078,%xmm0
624 00000021 addss %xmm1,%xmm0
625 00000025 movss %xmm0,(%esp,1)
626 0000002a flds (%esp,1)
627 0000002d addl $0x04,%esp
630 //===---------------------------------------------------------------------===//
632 When using fastcc abi, align stack slot of argument of type double on 8 byte
633 boundary to improve performance.
635 //===---------------------------------------------------------------------===//
639 int f(int a, int b) {
640 if (a == 4 || a == 6)
652 //===---------------------------------------------------------------------===//
654 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
655 simplifications for integer "x cmp y ? a : b". For example, instead of:
658 void f(int X, int Y) {
684 //===---------------------------------------------------------------------===//
686 Currently we don't have elimination of redundant stack manipulations. Consider
691 call fastcc void %test1( )
692 call fastcc void %test2( sbyte* cast (void ()* %test1 to sbyte*) )
696 declare fastcc void %test1()
698 declare fastcc void %test2(sbyte*)
701 This currently compiles to:
711 The add\sub pair is really unneeded here.
713 //===---------------------------------------------------------------------===//
715 We currently compile sign_extend_inreg into two shifts:
718 return (long)(signed char)X;
735 //===---------------------------------------------------------------------===//
737 Consider the expansion of:
739 uint %test3(uint %X) {
740 %tmp1 = rem uint %X, 255
744 Currently it compiles to:
747 movl $2155905153, %ecx
753 This could be "reassociated" into:
755 movl $2155905153, %eax
759 to avoid the copy. In fact, the existing two-address stuff would do this
760 except that mul isn't a commutative 2-addr instruction. I guess this has
761 to be done at isel time based on the #uses to mul?
763 //===---------------------------------------------------------------------===//
765 Make sure the instruction which starts a loop does not cross a cacheline
766 boundary. This requires knowning the exact length of each machine instruction.
767 That is somewhat complicated, but doable. Example 256.bzip2:
769 In the new trace, the hot loop has an instruction which crosses a cacheline
770 boundary. In addition to potential cache misses, this can't help decoding as I
771 imagine there has to be some kind of complicated decoder reset and realignment
772 to grab the bytes from the next cacheline.
774 532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
775 942 942 0x3d03 movl %dh, (1809(%esp, %esi)
776 937 937 0x3d0a incl %esi
777 3 3 0x3d0b cmpb %bl, %dl
778 27 27 0x3d0d jnz 0x000062db <main+11707>
780 //===---------------------------------------------------------------------===//
782 In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
784 //===---------------------------------------------------------------------===//
786 This could be a single 16-bit load.
789 if ((p[0] == 1) & (p[1] == 2)) return 1;
793 //===---------------------------------------------------------------------===//
795 We should inline lrintf and probably other libc functions.
797 //===---------------------------------------------------------------------===//
799 Start using the flags more. For example, compile:
801 int add_zf(int *x, int y, int a, int b) {
825 int add_zf(int *x, int y, int a, int b) {
849 //===---------------------------------------------------------------------===//
853 int foo(double X) { return isnan(X); }
864 the pxor is not needed, we could compare the value against itself.
866 //===---------------------------------------------------------------------===//
868 These two functions have identical effects:
870 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
871 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
873 We currently compile them to:
881 jne LBB1_2 #UnifiedReturnBlock
885 LBB1_2: #UnifiedReturnBlock
895 leal 1(%ecx,%eax), %eax
898 both of which are inferior to GCC's:
916 //===---------------------------------------------------------------------===//
924 is currently compiled to:
935 It would be better to produce:
944 This can be applied to any no-return function call that takes no arguments etc.
945 Alternatively, the stack save/restore logic could be shrink-wrapped, producing
956 Both are useful in different situations. Finally, it could be shrink-wrapped
957 and tail called, like this:
964 pop %eax # realign stack.
967 Though this probably isn't worth it.
969 //===---------------------------------------------------------------------===//
971 We need to teach the codegen to convert two-address INC instructions to LEA
972 when the flags are dead. For example, on X86-64, compile:
974 int foo(int A, int B) {
991 //===---------------------------------------------------------------------===//
993 We use push/pop of stack space around calls in situations where we don't have to.
994 Call to f below produces:
999 The stack push/pop can be moved into the prolog/epilog. It does this because it's
1000 building the frame pointer, but this should not be sufficient, only the use of alloca
1001 should cause it to do this.
1002 (There are other issues shown by this code, but this is one.)
1004 typedef struct _range_t {
1010 unsigned char lut[];
1022 const range_t*const*range;
1025 typedef struct _decode_t decode_t;
1027 extern int f(const decode_t* decode);
1029 int decode_byte (const decode_t* decode) {
1030 if (decode->swap != 0)