1 //===---------------------------------------------------------------------===//
2 // Random ideas for the X86 backend.
3 //===---------------------------------------------------------------------===//
5 Add a MUL2U and MUL2S nodes to represent a multiply that returns both the
6 Hi and Lo parts (combination of MUL and MULH[SU] into one node). Add this to
7 X86, & make the dag combiner produce it when needed. This will eliminate one
8 imul from the code generated for:
10 long long test(long long X, long long Y) { return X*Y; }
12 by using the EAX result from the mul. We should add a similar node for
17 long long test(int X, int Y) { return (long long)X*Y; }
19 ... which should only be one imul instruction.
21 This can be done with a custom expander, but it would be nice to move this to
24 //===---------------------------------------------------------------------===//
26 This should be one DIV/IDIV instruction, not a libcall:
28 unsigned test(unsigned long long X, unsigned Y) {
32 This can be done trivially with a custom legalizer. What about overflow
33 though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
35 //===---------------------------------------------------------------------===//
37 Improvements to the multiply -> shift/add algorithm:
38 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
40 //===---------------------------------------------------------------------===//
42 Improve code like this (occurs fairly frequently, e.g. in LLVM):
43 long long foo(int x) { return 1LL << x; }
45 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
46 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
47 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
49 Another useful one would be ~0ULL >> X and ~0ULL << X.
51 One better solution for 1LL << x is:
60 But that requires good 8-bit subreg support.
64 //===---------------------------------------------------------------------===//
67 _Bool f(_Bool a) { return a!=1; }
74 //===---------------------------------------------------------------------===//
78 1. Dynamic programming based approach when compile time if not an
80 2. Code duplication (addressing mode) during isel.
81 3. Other ideas from "Register-Sensitive Selection, Duplication, and
82 Sequencing of Instructions".
83 4. Scheduling for reduced register pressure. E.g. "Minimum Register
84 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
85 and other related papers.
86 http://citeseer.ist.psu.edu/govindarajan01minimum.html
88 //===---------------------------------------------------------------------===//
90 Should we promote i16 to i32 to avoid partial register update stalls?
92 //===---------------------------------------------------------------------===//
94 Leave any_extend as pseudo instruction and hint to register
95 allocator. Delay codegen until post register allocation.
97 //===---------------------------------------------------------------------===//
99 Count leading zeros and count trailing zeros:
101 int clz(int X) { return __builtin_clz(X); }
102 int ctz(int X) { return __builtin_ctz(X); }
104 $ gcc t.c -S -o - -O3 -fomit-frame-pointer -masm=intel
106 bsr %eax, DWORD PTR [%esp+4]
110 bsf %eax, DWORD PTR [%esp+4]
113 however, check that these are defined for 0 and 32. Our intrinsics are, GCC's
116 //===---------------------------------------------------------------------===//
118 Use push/pop instructions in prolog/epilog sequences instead of stores off
119 ESP (certain code size win, perf win on some [which?] processors).
120 Also, it appears icc use push for parameter passing. Need to investigate.
122 //===---------------------------------------------------------------------===//
124 Only use inc/neg/not instructions on processors where they are faster than
125 add/sub/xor. They are slower on the P4 due to only updating some processor
128 //===---------------------------------------------------------------------===//
130 The instruction selector sometimes misses folding a load into a compare. The
131 pattern is written as (cmp reg, (load p)). Because the compare isn't
132 commutative, it is not matched with the load on both sides. The dag combiner
133 should be made smart enough to cannonicalize the load into the RHS of a compare
134 when it can invert the result of the compare for free.
136 //===---------------------------------------------------------------------===//
138 How about intrinsics? An example is:
139 *res = _mm_mulhi_epu16(*A, _mm_mul_epu32(*B, *C));
142 pmuludq (%eax), %xmm0
147 The transformation probably requires a X86 specific pass or a DAG combiner
148 target specific hook.
150 //===---------------------------------------------------------------------===//
152 In many cases, LLVM generates code like this:
161 on some processors (which ones?), it is more efficient to do this:
170 Doing this correctly is tricky though, as the xor clobbers the flags.
172 //===---------------------------------------------------------------------===//
174 We should generate bts/btr/etc instructions on targets where they are cheap or
175 when codesize is important. e.g., for:
177 void setbit(int *target, int bit) {
178 *target |= (1 << bit);
180 void clearbit(int *target, int bit) {
181 *target &= ~(1 << bit);
184 //===---------------------------------------------------------------------===//
186 Instead of the following for memset char*, 1, 10:
188 movl $16843009, 4(%edx)
189 movl $16843009, (%edx)
192 It might be better to generate
199 when we can spare a register. It reduces code size.
201 //===---------------------------------------------------------------------===//
203 Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
220 GCC knows several different ways to codegen it, one of which is this:
230 which is probably slower, but it's interesting at least :)
232 //===---------------------------------------------------------------------===//
234 Should generate min/max for stuff like:
236 void minf(float a, float b, float *X) {
240 Make use of floating point min / max instructions. Perhaps introduce ISD::FMIN
241 and ISD::FMAX node types?
243 //===---------------------------------------------------------------------===//
245 The first BB of this code:
249 %V = call bool %foo()
250 br bool %V, label %T, label %F
267 It would be better to emit "cmp %al, 1" than a xor and test.
269 //===---------------------------------------------------------------------===//
271 Enable X86InstrInfo::convertToThreeAddress().
273 //===---------------------------------------------------------------------===//
275 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
276 We should leave these as libcalls for everything over a much lower threshold,
277 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
278 stores, TLB preheating, etc)
280 //===---------------------------------------------------------------------===//
282 Optimize this into something reasonable:
283 x * copysign(1.0, y) * copysign(1.0, z)
285 //===---------------------------------------------------------------------===//
287 Optimize copysign(x, *y) to use an integer load from y.
289 //===---------------------------------------------------------------------===//
291 %X = weak global int 0
294 %N = cast int %N to uint
295 %tmp.24 = setgt int %N, 0
296 br bool %tmp.24, label %no_exit, label %return
299 %indvar = phi uint [ 0, %entry ], [ %indvar.next, %no_exit ]
300 %i.0.0 = cast uint %indvar to int
301 volatile store int %i.0.0, int* %X
302 %indvar.next = add uint %indvar, 1
303 %exitcond = seteq uint %indvar.next, %N
304 br bool %exitcond, label %return, label %no_exit
318 jl LBB_foo_4 # return
319 LBB_foo_1: # no_exit.preheader
322 movl L_X$non_lazy_ptr, %edx
326 jne LBB_foo_2 # no_exit
327 LBB_foo_3: # return.loopexit
331 We should hoist "movl L_X$non_lazy_ptr, %edx" out of the loop after
332 remateralization is implemented. This can be accomplished with 1) a target
333 dependent LICM pass or 2) makeing SelectDAG represent the whole function.
335 //===---------------------------------------------------------------------===//
337 The following tests perform worse with LSR:
339 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
341 //===---------------------------------------------------------------------===//
343 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
346 //===---------------------------------------------------------------------===//
354 Obviously it would have been better for the first mov (or any op) to store
355 directly %esp[0] if there are no other uses.
357 //===---------------------------------------------------------------------===//
359 Adding to the list of cmp / test poor codegen issues:
361 int test(__m128 *A, __m128 *B) {
362 if (_mm_comige_ss(*A, *B))
382 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
383 are a number of issues. 1) We are introducing a setcc between the result of the
384 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
385 so a any extend (which becomes a zero extend) is added.
387 We probably need some kind of target DAG combine hook to fix this.
389 //===---------------------------------------------------------------------===//
391 We generate significantly worse code for this than GCC:
392 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
393 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
395 There is also one case we do worse on PPC.
397 //===---------------------------------------------------------------------===//
399 If shorter, we should use things like:
404 The former can also be used when the two-addressy nature of the 'and' would
405 require a copy to be inserted (in X86InstrInfo::convertToThreeAddress).
407 //===---------------------------------------------------------------------===//
411 char foo(int x) { return x; }
419 SIGN_EXTEND_INREG can be implemented as (sext (trunc)) to take advantage of
422 //===---------------------------------------------------------------------===//
426 typedef struct pair { float A, B; } pair;
427 void pairtest(pair P, float *FP) {
431 We currently generate this code with llvmgcc4:
446 we should be able to generate:
454 The issue is that llvmgcc4 is forcing the struct to memory, then passing it as
455 integer chunks. It does this so that structs like {short,short} are passed in
456 a single 32-bit integer stack slot. We should handle the safe cases above much
457 nicer, while still handling the hard cases.
459 //===---------------------------------------------------------------------===//
461 Another instruction selector deficiency:
464 %tmp = load int (int)** %foo
465 %tmp = tail call int %tmp( int 3 )
471 movl L_foo$non_lazy_ptr, %eax
477 The current isel scheme will not allow the load to be folded in the call since
478 the load's chain result is read by the callseq_start.
480 //===---------------------------------------------------------------------===//
482 Don't forget to find a way to squash noop truncates in the JIT environment.
484 //===---------------------------------------------------------------------===//
486 Implement anyext in the same manner as truncate that would allow them to be
489 //===---------------------------------------------------------------------===//
491 How about implementing truncate / anyext as a property of machine instruction
492 operand? i.e. Print as 32-bit super-class register / 16-bit sub-class register.
493 Do this for the cases where a truncate / anyext is guaranteed to be eliminated.
494 For IA32 that is truncate from 32 to 16 and anyext from 16 to 32.
496 //===---------------------------------------------------------------------===//
506 imull $3, 4(%esp), %eax
508 Perhaps this is what we really should generate is? Is imull three or four
509 cycles? Note: ICC generates this:
511 leal (%eax,%eax,2), %eax
513 The current instruction priority is based on pattern complexity. The former is
514 more "complex" because it folds a load so the latter will not be emitted.
516 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
517 should always try to match LEA first since the LEA matching code does some
518 estimate to determine whether the match is profitable.
520 However, if we care more about code size, then imull is better. It's two bytes
521 shorter than movl + leal.
523 //===---------------------------------------------------------------------===//
525 Implement CTTZ, CTLZ with bsf and bsr.
527 //===---------------------------------------------------------------------===//
529 It appears gcc place string data with linkonce linkage in
530 .section __TEXT,__const_coal,coalesced instead of
531 .section __DATA,__const_coal,coalesced.
532 Take a look at darwin.h, there are other Darwin assembler directives that we
535 //===---------------------------------------------------------------------===//
537 We should handle __attribute__ ((__visibility__ ("hidden"))).
539 //===---------------------------------------------------------------------===//
541 int %foo(int* %a, int %t) {
545 cond_true: ; preds = %cond_true, %entry
546 %x.0.0 = phi int [ 0, %entry ], [ %tmp9, %cond_true ] ; <int> [#uses=3]
547 %t_addr.0.0 = phi int [ %t, %entry ], [ %tmp7, %cond_true ] ; <int> [#uses=1]
548 %tmp2 = getelementptr int* %a, int %x.0.0 ; <int*> [#uses=1]
549 %tmp3 = load int* %tmp2 ; <int> [#uses=1]
550 %tmp5 = add int %t_addr.0.0, %x.0.0 ; <int> [#uses=1]
551 %tmp7 = add int %tmp5, %tmp3 ; <int> [#uses=2]
552 %tmp9 = add int %x.0.0, 1 ; <int> [#uses=2]
553 %tmp = setgt int %tmp9, 39 ; <bool> [#uses=1]
554 br bool %tmp, label %bb12, label %cond_true
556 bb12: ; preds = %cond_true
560 is pessimized by -loop-reduce and -indvars
562 //===---------------------------------------------------------------------===//
564 Use cpuid to auto-detect CPU features such as SSE, SSE2, and SSE3.
566 //===---------------------------------------------------------------------===//
568 u32 to float conversion improvement:
570 float uint32_2_float( unsigned u ) {
571 float fl = (int) (u & 0xffff);
572 float fh = (int) (u >> 16);
577 00000000 subl $0x04,%esp
578 00000003 movl 0x08(%esp,1),%eax
579 00000007 movl %eax,%ecx
580 00000009 shrl $0x10,%ecx
581 0000000c cvtsi2ss %ecx,%xmm0
582 00000010 andl $0x0000ffff,%eax
583 00000015 cvtsi2ss %eax,%xmm1
584 00000019 mulss 0x00000078,%xmm0
585 00000021 addss %xmm1,%xmm0
586 00000025 movss %xmm0,(%esp,1)
587 0000002a flds (%esp,1)
588 0000002d addl $0x04,%esp
591 //===---------------------------------------------------------------------===//
593 When using fastcc abi, align stack slot of argument of type double on 8 byte
594 boundary to improve performance.
596 //===---------------------------------------------------------------------===//
600 int f(int a, int b) {
601 if (a == 4 || a == 6)
613 If we aren't going to do this, we should lower the switch better. We compile
625 jmp LBB1_2 #UnifiedReturnBlock
628 jne LBB1_2 #UnifiedReturnBlock
632 LBB1_2: #UnifiedReturnBlock
635 In the code above, the 'if' is turned into a 'switch' at the mid-level. It looks
636 like the 'lower to branches' mode could be improved a little here. In particular,
637 the fall-through to LBB1_3 doesn't need a branch. It would also be nice to
638 eliminate the redundant "cmp 6", maybe by lowering to a linear sequence of
639 compares if there are below a certain number of cases (instead of a binary sequence)?
641 //===---------------------------------------------------------------------===//
644 int %test(ulong *%tmp) {
645 %tmp = load ulong* %tmp ; <ulong> [#uses=1]
646 %tmp.mask = shr ulong %tmp, ubyte 50 ; <ulong> [#uses=1]
647 %tmp.mask = cast ulong %tmp.mask to ubyte ; <ubyte> [#uses=1]
648 %tmp2 = and ubyte %tmp.mask, 3 ; <ubyte> [#uses=1]
649 %tmp2 = cast ubyte %tmp2 to int ; <int> [#uses=1]
668 # TRUNCATE movb %al, %al
673 This saves a movzbl, and saves a truncate if it doesn't get coallesced right.
674 This is a simple DAGCombine to propagate the zext through the and.
676 //===---------------------------------------------------------------------===//
678 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
679 simplifications for integer "x cmp y ? a : b". For example, instead of:
682 void f(int X, int Y) {
708 //===---------------------------------------------------------------------===//