1 //===---------------------------------------------------------------------===//
2 // Random ideas for the X86 backend.
3 //===---------------------------------------------------------------------===//
5 This should be one DIV/IDIV instruction, not a libcall:
7 unsigned test(unsigned long long X, unsigned Y) {
11 This can be done trivially with a custom legalizer. What about overflow
12 though? http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
14 //===---------------------------------------------------------------------===//
16 Improvements to the multiply -> shift/add algorithm:
17 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
19 //===---------------------------------------------------------------------===//
21 Improve code like this (occurs fairly frequently, e.g. in LLVM):
22 long long foo(int x) { return 1LL << x; }
24 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
25 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
26 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
28 Another useful one would be ~0ULL >> X and ~0ULL << X.
30 One better solution for 1LL << x is:
39 But that requires good 8-bit subreg support.
41 Also, this might be better. It's an extra shift, but it's one instruction
42 shorter, and doesn't stress 8-bit subreg support.
43 (From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
44 but without the unnecessary and.)
52 64-bit shifts (in general) expand to really bad code. Instead of using
53 cmovs, we should expand to a conditional branch like GCC produces.
55 //===---------------------------------------------------------------------===//
59 1. Dynamic programming based approach when compile time is not an
61 2. Code duplication (addressing mode) during isel.
62 3. Other ideas from "Register-Sensitive Selection, Duplication, and
63 Sequencing of Instructions".
64 4. Scheduling for reduced register pressure. E.g. "Minimum Register
65 Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
66 and other related papers.
67 http://citeseer.ist.psu.edu/govindarajan01minimum.html
69 //===---------------------------------------------------------------------===//
71 Should we promote i16 to i32 to avoid partial register update stalls?
73 //===---------------------------------------------------------------------===//
75 Leave any_extend as pseudo instruction and hint to register
76 allocator. Delay codegen until post register allocation.
77 Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
78 the coalescer how to deal with it though.
80 //===---------------------------------------------------------------------===//
82 It appears icc use push for parameter passing. Need to investigate.
84 //===---------------------------------------------------------------------===//
89 void bar(int x, int *P) {
104 Instead of doing an explicit test, we can use the flags off the sar. This
105 occurs in a bigger testcase like this, which is pretty common:
108 int test1(std::vector<int> &X) {
110 for (long i = 0, e = X.size(); i != e; ++i)
115 //===---------------------------------------------------------------------===//
117 Only use inc/neg/not instructions on processors where they are faster than
118 add/sub/xor. They are slower on the P4 due to only updating some processor
121 //===---------------------------------------------------------------------===//
123 The instruction selector sometimes misses folding a load into a compare. The
124 pattern is written as (cmp reg, (load p)). Because the compare isn't
125 commutative, it is not matched with the load on both sides. The dag combiner
126 should be made smart enough to canonicalize the load into the RHS of a compare
127 when it can invert the result of the compare for free.
129 //===---------------------------------------------------------------------===//
131 In many cases, LLVM generates code like this:
140 on some processors (which ones?), it is more efficient to do this:
149 Doing this correctly is tricky though, as the xor clobbers the flags.
151 //===---------------------------------------------------------------------===//
153 We should generate bts/btr/etc instructions on targets where they are cheap or
154 when codesize is important. e.g., for:
156 void setbit(int *target, int bit) {
157 *target |= (1 << bit);
159 void clearbit(int *target, int bit) {
160 *target &= ~(1 << bit);
163 //===---------------------------------------------------------------------===//
165 Instead of the following for memset char*, 1, 10:
167 movl $16843009, 4(%edx)
168 movl $16843009, (%edx)
171 It might be better to generate
178 when we can spare a register. It reduces code size.
180 //===---------------------------------------------------------------------===//
182 Evaluate what the best way to codegen sdiv X, (2^C) is. For X/8, we currently
185 define i32 @test1(i32 %X) {
199 GCC knows several different ways to codegen it, one of which is this:
209 which is probably slower, but it's interesting at least :)
211 //===---------------------------------------------------------------------===//
213 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
214 We should leave these as libcalls for everything over a much lower threshold,
215 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
216 stores, TLB preheating, etc)
218 //===---------------------------------------------------------------------===//
220 Optimize this into something reasonable:
221 x * copysign(1.0, y) * copysign(1.0, z)
223 //===---------------------------------------------------------------------===//
225 Optimize copysign(x, *y) to use an integer load from y.
227 //===---------------------------------------------------------------------===//
229 The following tests perform worse with LSR:
231 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
233 //===---------------------------------------------------------------------===//
235 Adding to the list of cmp / test poor codegen issues:
237 int test(__m128 *A, __m128 *B) {
238 if (_mm_comige_ss(*A, *B))
258 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
259 are a number of issues. 1) We are introducing a setcc between the result of the
260 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
261 so a any extend (which becomes a zero extend) is added.
263 We probably need some kind of target DAG combine hook to fix this.
265 //===---------------------------------------------------------------------===//
267 We generate significantly worse code for this than GCC:
268 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
269 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
271 There is also one case we do worse on PPC.
273 //===---------------------------------------------------------------------===//
283 imull $3, 4(%esp), %eax
285 Perhaps this is what we really should generate is? Is imull three or four
286 cycles? Note: ICC generates this:
288 leal (%eax,%eax,2), %eax
290 The current instruction priority is based on pattern complexity. The former is
291 more "complex" because it folds a load so the latter will not be emitted.
293 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
294 should always try to match LEA first since the LEA matching code does some
295 estimate to determine whether the match is profitable.
297 However, if we care more about code size, then imull is better. It's two bytes
298 shorter than movl + leal.
300 On a Pentium M, both variants have the same characteristics with regard
301 to throughput; however, the multiplication has a latency of four cycles, as
302 opposed to two cycles for the movl+lea variant.
304 //===---------------------------------------------------------------------===//
306 __builtin_ffs codegen is messy.
308 int ffs_(unsigned X) { return __builtin_ffs(X); }
331 Another example of __builtin_ffs (use predsimplify to eliminate a select):
333 int foo (unsigned long j) {
335 return __builtin_ffs (j) - 1;
340 //===---------------------------------------------------------------------===//
342 It appears gcc place string data with linkonce linkage in
343 .section __TEXT,__const_coal,coalesced instead of
344 .section __DATA,__const_coal,coalesced.
345 Take a look at darwin.h, there are other Darwin assembler directives that we
348 //===---------------------------------------------------------------------===//
350 define i32 @foo(i32* %a, i32 %t) {
354 cond_true: ; preds = %cond_true, %entry
355 %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ] ; <i32> [#uses=3]
356 %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ] ; <i32> [#uses=1]
357 %tmp2 = getelementptr i32* %a, i32 %x.0.0 ; <i32*> [#uses=1]
358 %tmp3 = load i32* %tmp2 ; <i32> [#uses=1]
359 %tmp5 = add i32 %t_addr.0.0, %x.0.0 ; <i32> [#uses=1]
360 %tmp7 = add i32 %tmp5, %tmp3 ; <i32> [#uses=2]
361 %tmp9 = add i32 %x.0.0, 1 ; <i32> [#uses=2]
362 %tmp = icmp sgt i32 %tmp9, 39 ; <i1> [#uses=1]
363 br i1 %tmp, label %bb12, label %cond_true
365 bb12: ; preds = %cond_true
368 is pessimized by -loop-reduce and -indvars
370 //===---------------------------------------------------------------------===//
372 u32 to float conversion improvement:
374 float uint32_2_float( unsigned u ) {
375 float fl = (int) (u & 0xffff);
376 float fh = (int) (u >> 16);
381 00000000 subl $0x04,%esp
382 00000003 movl 0x08(%esp,1),%eax
383 00000007 movl %eax,%ecx
384 00000009 shrl $0x10,%ecx
385 0000000c cvtsi2ss %ecx,%xmm0
386 00000010 andl $0x0000ffff,%eax
387 00000015 cvtsi2ss %eax,%xmm1
388 00000019 mulss 0x00000078,%xmm0
389 00000021 addss %xmm1,%xmm0
390 00000025 movss %xmm0,(%esp,1)
391 0000002a flds (%esp,1)
392 0000002d addl $0x04,%esp
395 //===---------------------------------------------------------------------===//
397 When using fastcc abi, align stack slot of argument of type double on 8 byte
398 boundary to improve performance.
400 //===---------------------------------------------------------------------===//
402 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
403 simplifications for integer "x cmp y ? a : b".
405 //===---------------------------------------------------------------------===//
407 Consider the expansion of:
409 define i32 @test3(i32 %X) {
410 %tmp1 = urem i32 %X, 255
414 Currently it compiles to:
417 movl $2155905153, %ecx
423 This could be "reassociated" into:
425 movl $2155905153, %eax
429 to avoid the copy. In fact, the existing two-address stuff would do this
430 except that mul isn't a commutative 2-addr instruction. I guess this has
431 to be done at isel time based on the #uses to mul?
433 //===---------------------------------------------------------------------===//
435 Make sure the instruction which starts a loop does not cross a cacheline
436 boundary. This requires knowning the exact length of each machine instruction.
437 That is somewhat complicated, but doable. Example 256.bzip2:
439 In the new trace, the hot loop has an instruction which crosses a cacheline
440 boundary. In addition to potential cache misses, this can't help decoding as I
441 imagine there has to be some kind of complicated decoder reset and realignment
442 to grab the bytes from the next cacheline.
444 532 532 0x3cfc movb (1809(%esp, %esi), %bl <<<--- spans 2 64 byte lines
445 942 942 0x3d03 movl %dh, (1809(%esp, %esi)
446 937 937 0x3d0a incl %esi
447 3 3 0x3d0b cmpb %bl, %dl
448 27 27 0x3d0d jnz 0x000062db <main+11707>
450 //===---------------------------------------------------------------------===//
452 In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
454 //===---------------------------------------------------------------------===//
456 This could be a single 16-bit load.
459 if ((p[0] == 1) & (p[1] == 2)) return 1;
463 //===---------------------------------------------------------------------===//
465 We should inline lrintf and probably other libc functions.
467 //===---------------------------------------------------------------------===//
469 Use the FLAGS values from arithmetic instructions more. For example, compile:
471 int add_zf(int *x, int y, int a, int b) {
493 As another example, compile function f2 in test/CodeGen/X86/cmp-test.ll
494 without a test instruction.
496 //===---------------------------------------------------------------------===//
498 These two functions have identical effects:
500 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
501 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
503 We currently compile them to:
511 jne LBB1_2 #UnifiedReturnBlock
515 LBB1_2: #UnifiedReturnBlock
525 leal 1(%ecx,%eax), %eax
528 both of which are inferior to GCC's:
546 //===---------------------------------------------------------------------===//
554 is currently compiled to:
565 It would be better to produce:
574 This can be applied to any no-return function call that takes no arguments etc.
575 Alternatively, the stack save/restore logic could be shrink-wrapped, producing
586 Both are useful in different situations. Finally, it could be shrink-wrapped
587 and tail called, like this:
594 pop %eax # realign stack.
597 Though this probably isn't worth it.
599 //===---------------------------------------------------------------------===//
601 Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
602 a neg instead of a sub instruction. Consider:
604 int test(char X) { return 7-X; }
606 we currently produce:
613 We would use one fewer register if codegen'd as:
620 Note that this isn't beneficial if the load can be folded into the sub. In
621 this case, we want a sub:
623 int test(int X) { return 7-X; }
629 //===---------------------------------------------------------------------===//
631 Leaf functions that require one 4-byte spill slot have a prolog like this:
637 and an epilog like this:
642 It would be smaller, and potentially faster, to push eax on entry and to
643 pop into a dummy register instead of using addl/subl of esp. Just don't pop
644 into any return registers :)
646 //===---------------------------------------------------------------------===//
648 The X86 backend should fold (branch (or (setcc, setcc))) into multiple
649 branches. We generate really poor code for:
651 double testf(double a) {
652 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
655 For example, the entry BB is:
660 movsd 24(%esp), %xmm1
665 jne LBB1_5 # UnifiedReturnBlock
669 it would be better to replace the last four instructions with:
675 We also codegen the inner ?: into a diamond:
677 cvtss2sd LCPI1_0(%rip), %xmm2
678 cvtss2sd LCPI1_1(%rip), %xmm3
680 ja LBB1_3 # cond_true
687 We should sink the load into xmm3 into the LBB1_2 block. This should
688 be pretty easy, and will nuke all the copies.
690 //===---------------------------------------------------------------------===//
694 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
695 { return std::make_pair(a + b, a + b < a); }
696 bool no_overflow(unsigned a, unsigned b)
697 { return !full_add(a, b).second; }
705 on x86-64, instead of the rather stupid-looking:
713 //===---------------------------------------------------------------------===//
717 bb114.preheader: ; preds = %cond_next94
718 %tmp231232 = sext i16 %tmp62 to i32 ; <i32> [#uses=1]
719 %tmp233 = sub i32 32, %tmp231232 ; <i32> [#uses=1]
720 %tmp245246 = sext i16 %tmp65 to i32 ; <i32> [#uses=1]
721 %tmp252253 = sext i16 %tmp68 to i32 ; <i32> [#uses=1]
722 %tmp254 = sub i32 32, %tmp252253 ; <i32> [#uses=1]
723 %tmp553554 = bitcast i16* %tmp37 to i8* ; <i8*> [#uses=2]
724 %tmp583584 = sext i16 %tmp98 to i32 ; <i32> [#uses=1]
725 %tmp585 = sub i32 32, %tmp583584 ; <i32> [#uses=1]
726 %tmp614615 = sext i16 %tmp101 to i32 ; <i32> [#uses=1]
727 %tmp621622 = sext i16 %tmp104 to i32 ; <i32> [#uses=1]
728 %tmp623 = sub i32 32, %tmp621622 ; <i32> [#uses=1]
733 LBB3_5: # bb114.preheader
734 movswl -68(%ebp), %eax
738 movswl -52(%ebp), %eax
741 movswl -70(%ebp), %eax
744 movswl -50(%ebp), %eax
747 movswl -42(%ebp), %eax
749 movswl -66(%ebp), %eax
753 This appears to be bad because the RA is not folding the store to the stack
754 slot into the movl. The above instructions could be:
759 This seems like a cross between remat and spill folding.
761 This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
762 change, so we could simply subtract %eax from %ecx first and then use %ecx (or
765 //===---------------------------------------------------------------------===//
769 %tmp659 = icmp slt i16 %tmp654, 0 ; <i1> [#uses=1]
770 br i1 %tmp659, label %cond_true662, label %cond_next715
776 jns LBB4_109 # cond_next715
778 Shark tells us that using %cx in the testw instruction is sub-optimal. It
779 suggests using the 32-bit register (which is what ICC uses).
781 //===---------------------------------------------------------------------===//
785 void compare (long long foo) {
786 if (foo < 4294967297LL)
802 jne .LBB1_2 # UnifiedReturnBlock
805 .LBB1_2: # UnifiedReturnBlock
809 (also really horrible code on ppc). This is due to the expand code for 64-bit
810 compares. GCC produces multiple branches, which is much nicer:
831 //===---------------------------------------------------------------------===//
833 Tail call optimization improvements: Tail call optimization currently
834 pushes all arguments on the top of the stack (their normal place for
835 non-tail call optimized calls) that source from the callers arguments
836 or that source from a virtual register (also possibly sourcing from
838 This is done to prevent overwriting of parameters (see example
839 below) that might be used later.
843 int callee(int32, int64);
844 int caller(int32 arg1, int32 arg2) {
845 int64 local = arg2 * 2;
846 return callee(arg2, (int64)local);
849 [arg1] [!arg2 no longer valid since we moved local onto it]
853 Moving arg1 onto the stack slot of callee function would overwrite
856 Possible optimizations:
859 - Analyse the actual parameters of the callee to see which would
860 overwrite a caller parameter which is used by the callee and only
861 push them onto the top of the stack.
863 int callee (int32 arg1, int32 arg2);
864 int caller (int32 arg1, int32 arg2) {
865 return callee(arg1,arg2);
868 Here we don't need to write any variables to the top of the stack
869 since they don't overwrite each other.
871 int callee (int32 arg1, int32 arg2);
872 int caller (int32 arg1, int32 arg2) {
873 return callee(arg2,arg1);
876 Here we need to push the arguments because they overwrite each
879 //===---------------------------------------------------------------------===//
884 unsigned long int z = 0;
895 gcc compiles this to:
921 jge LBB1_4 # cond_true
924 addl $4294950912, %ecx
934 1. LSR should rewrite the first cmp with induction variable %ecx.
935 2. DAG combiner should fold
941 //===---------------------------------------------------------------------===//
943 define i64 @test(double %X) {
944 %Y = fptosi double %X to i64
952 movsd 24(%esp), %xmm0
962 This should just fldl directly from the input stack slot.
964 //===---------------------------------------------------------------------===//
967 int foo (int x) { return (x & 65535) | 255; }
983 //===---------------------------------------------------------------------===//
985 We're codegen'ing multiply of long longs inefficiently:
987 unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
991 We compile to (fomit-frame-pointer):
1001 imull 20(%esp), %ecx
1007 This looks like a scheduling deficiency and lack of remat of the load from
1008 the argument area. ICC apparently produces:
1011 imull 12(%esp), %ecx
1020 Note that it remat'd loads from 4(esp) and 12(esp). See this GCC PR:
1021 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
1023 //===---------------------------------------------------------------------===//
1025 We can fold a store into "zeroing a reg". Instead of:
1028 movl %eax, 124(%esp)
1034 if the flags of the xor are dead.
1036 Likewise, we isel "x<<1" into "add reg,reg". If reg is spilled, this should
1037 be folded into: shl [mem], 1
1039 //===---------------------------------------------------------------------===//
1041 In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1042 or and instruction, for example:
1044 xorpd LCPI1_0, %xmm2
1046 However, if xmm2 gets spilled, we end up with really ugly code like this:
1049 xorpd LCPI1_0, %xmm0
1052 Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1053 the neg/abs instruction, turning it into an *integer* operation, like this:
1055 xorl 2147483648, [mem+4] ## 2147483648 = (1 << 31)
1057 you could also use xorb, but xorl is less likely to lead to a partial register
1058 stall. Here is a contrived testcase:
1061 void test(double *P) {
1071 //===---------------------------------------------------------------------===//
1073 The generated code on x86 for checking for signed overflow on a multiply the
1074 obvious way is much longer than it needs to be.
1076 int x(int a, int b) {
1077 long long prod = (long long)a*b;
1078 return prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1081 See PR2053 for more details.
1083 //===---------------------------------------------------------------------===//
1085 We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1086 more aggressively; it should cost the same as a move+shift on any modern
1087 processor, but it's a lot shorter. Downside is that it puts more
1088 pressure on register allocation because it has fixed operands.
1091 int abs(int x) {return x < 0 ? -x : x;}
1093 gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1101 //===---------------------------------------------------------------------===//
1103 Take the following code (from
1104 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1106 extern unsigned char first_one[65536];
1107 int FirstOnet(unsigned long long arg1)
1110 return (first_one[arg1 >> 48]);
1115 The following code is currently generated:
1120 jb .LBB1_2 # UnifiedReturnBlock
1123 movzbl first_one(%eax), %eax
1125 .LBB1_2: # UnifiedReturnBlock
1129 We could change the "movl 8(%esp), %eax" into "movzwl 10(%esp), %eax"; this
1130 lets us change the cmpl into a testl, which is shorter, and eliminate the shift.
1132 //===---------------------------------------------------------------------===//
1134 We compile this function:
1136 define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext %d) nounwind {
1138 %tmp2 = icmp eq i8 %d, 0 ; <i1> [#uses=1]
1139 br i1 %tmp2, label %bb7, label %bb
1141 bb: ; preds = %entry
1142 %tmp6 = add i32 %b, %a ; <i32> [#uses=1]
1145 bb7: ; preds = %entry
1146 %tmp10 = sub i32 %a, %c ; <i32> [#uses=1]
1167 There's an obviously unnecessary movl in .LBB0_2, and we could eliminate a
1168 couple more movls by putting 4(%esp) into %eax instead of %ecx.
1170 //===---------------------------------------------------------------------===//
1177 cvtss2sd LCPI1_0, %xmm1
1179 movsd 176(%esp), %xmm2
1184 mulsd LCPI1_23, %xmm4
1185 addsd LCPI1_24, %xmm4
1187 addsd LCPI1_25, %xmm4
1189 addsd LCPI1_26, %xmm4
1191 addsd LCPI1_27, %xmm4
1193 addsd LCPI1_28, %xmm4
1197 movsd 152(%esp), %xmm1
1199 movsd %xmm1, 152(%esp)
1203 LBB1_16: # bb358.loopexit
1204 movsd 152(%esp), %xmm0
1206 addsd LCPI1_22, %xmm0
1207 movsd %xmm0, 152(%esp)
1209 Rather than spilling the result of the last addsd in the loop, we should have
1210 insert a copy to split the interval (one for the duration of the loop, one
1211 extending to the fall through). The register pressure in the loop isn't high
1212 enough to warrant the spill.
1214 Also check why xmm7 is not used at all in the function.
1216 //===---------------------------------------------------------------------===//
1220 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-S128"
1221 target triple = "i386-apple-darwin8"
1222 @in_exit.4870.b = internal global i1 false ; <i1*> [#uses=2]
1223 define fastcc void @abort_gzip() noreturn nounwind {
1225 %tmp.b.i = load i1* @in_exit.4870.b ; <i1> [#uses=1]
1226 br i1 %tmp.b.i, label %bb.i, label %bb4.i
1227 bb.i: ; preds = %entry
1228 tail call void @exit( i32 1 ) noreturn nounwind
1230 bb4.i: ; preds = %entry
1231 store i1 true, i1* @in_exit.4870.b
1232 tail call void @exit( i32 1 ) noreturn nounwind
1235 declare void @exit(i32) noreturn nounwind
1238 _abort_gzip: ## @abort_gzip
1241 movb _in_exit.4870.b, %al
1245 We somehow miss folding the movb into the cmpb.
1247 //===---------------------------------------------------------------------===//
1251 int test(int x, int y) {
1263 it would be better to codegen as: x+~y (notl+addl)
1265 //===---------------------------------------------------------------------===//
1269 int foo(const char *str,...)
1271 __builtin_va_list a; int x;
1272 __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1276 gets compiled into this on x86-64:
1278 movaps %xmm7, 160(%rsp)
1279 movaps %xmm6, 144(%rsp)
1280 movaps %xmm5, 128(%rsp)
1281 movaps %xmm4, 112(%rsp)
1282 movaps %xmm3, 96(%rsp)
1283 movaps %xmm2, 80(%rsp)
1284 movaps %xmm1, 64(%rsp)
1285 movaps %xmm0, 48(%rsp)
1292 movq %rax, 192(%rsp)
1293 leaq 208(%rsp), %rax
1294 movq %rax, 184(%rsp)
1297 movl 176(%rsp), %eax
1301 movq 184(%rsp), %rcx
1303 movq %rax, 184(%rsp)
1311 addq 192(%rsp), %rcx
1312 movl %eax, 176(%rsp)
1318 leaq 104(%rsp), %rax
1319 movq %rsi, -80(%rsp)
1321 movq %rax, -112(%rsp)
1322 leaq -88(%rsp), %rax
1323 movq %rax, -104(%rsp)
1327 movq -112(%rsp), %rdx
1335 addq -104(%rsp), %rdx
1337 movl %eax, -120(%rsp)
1342 and it gets compiled into this on x86:
1362 //===---------------------------------------------------------------------===//
1364 Teach tblgen not to check bitconvert source type in some cases. This allows us
1365 to consolidate the following patterns in X86InstrMMX.td:
1367 def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1369 (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1370 def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1372 (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1373 def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1375 (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1377 There are other cases in various td files.
1379 //===---------------------------------------------------------------------===//
1381 Take something like the following on x86-32:
1382 unsigned a(unsigned long long x, unsigned y) {return x % y;}
1384 We currently generate a libcall, but we really shouldn't: the expansion is
1385 shorter and likely faster than the libcall. The expected code is something
1397 A similar code sequence works for division.
1399 //===---------------------------------------------------------------------===//
1401 These should compile to the same code, but the later codegen's to useless
1402 instructions on X86. This may be a trivial dag combine (GCC PR7061):
1404 struct s1 { unsigned char a, b; };
1405 unsigned long f1(struct s1 x) {
1408 struct s2 { unsigned a: 8, b: 8; };
1409 unsigned long f2(struct s2 x) {
1413 //===---------------------------------------------------------------------===//
1415 We currently compile this:
1417 define i32 @func1(i32 %v1, i32 %v2) nounwind {
1419 %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1420 %sum = extractvalue {i32, i1} %t, 0
1421 %obit = extractvalue {i32, i1} %t, 1
1422 br i1 %obit, label %overflow, label %normal
1426 call void @llvm.trap()
1429 declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1430 declare void @llvm.trap()
1437 jo LBB1_2 ## overflow
1443 it would be nice to produce "into" someday.
1445 //===---------------------------------------------------------------------===//
1449 void vec_mpys1(int y[], const int x[], int scaler) {
1451 for (i = 0; i < 150; i++)
1452 y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1455 Compiles to this loop with GCC 3.x:
1460 shrdl $31, %edx, %eax
1461 addl %eax, (%esi,%ecx,4)
1466 llvm-gcc compiles it to the much uglier:
1470 movl (%eax,%edi,4), %ebx
1479 shldl $1, %eax, %ebx
1481 addl %ebx, (%eax,%edi,4)
1486 The issue is that we hoist the cast of "scaler" to long long outside of the
1487 loop, the value comes into the loop as two values, and
1488 RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
1489 constructed BUILD_PAIR which represents the cast value.
1491 This can be handled by making CodeGenPrepare sink the cast.
1493 //===---------------------------------------------------------------------===//
1495 Test instructions can be eliminated by using EFLAGS values from arithmetic
1496 instructions. This is currently not done for mul, and, or, xor, neg, shl,
1497 sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1498 for read-modify-write instructions. It is also current not done if the
1499 OF or CF flags are needed.
1501 The shift operators have the complication that when the shift count is
1502 zero, EFLAGS is not set, so they can only subsume a test instruction if
1503 the shift count is known to be non-zero. Also, using the EFLAGS value
1504 from a shift is apparently very slow on some x86 implementations.
1506 In read-modify-write instructions, the root node in the isel match is
1507 the store, and isel has no way for the use of the EFLAGS result of the
1508 arithmetic to be remapped to the new node.
1510 Add and subtract instructions set OF on signed overflow and CF on unsiged
1511 overflow, while test instructions always clear OF and CF. In order to
1512 replace a test with an add or subtract in a situation where OF or CF is
1513 needed, codegen must be able to prove that the operation cannot see
1514 signed or unsigned overflow, respectively.
1516 //===---------------------------------------------------------------------===//
1518 memcpy/memmove do not lower to SSE copies when possible. A silly example is:
1519 define <16 x float> @foo(<16 x float> %A) nounwind {
1520 %tmp = alloca <16 x float>, align 16
1521 %tmp2 = alloca <16 x float>, align 16
1522 store <16 x float> %A, <16 x float>* %tmp
1523 %s = bitcast <16 x float>* %tmp to i8*
1524 %s2 = bitcast <16 x float>* %tmp2 to i8*
1525 call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1526 %R = load <16 x float>* %tmp2
1530 declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1536 movaps %xmm3, 112(%esp)
1537 movaps %xmm2, 96(%esp)
1538 movaps %xmm1, 80(%esp)
1539 movaps %xmm0, 64(%esp)
1541 movl %eax, 124(%esp)
1543 movl %eax, 120(%esp)
1545 <many many more 32-bit copies>
1546 movaps (%esp), %xmm0
1547 movaps 16(%esp), %xmm1
1548 movaps 32(%esp), %xmm2
1549 movaps 48(%esp), %xmm3
1553 On Nehalem, it may even be cheaper to just use movups when unaligned than to
1554 fall back to lower-granularity chunks.
1556 //===---------------------------------------------------------------------===//
1558 Implement processor-specific optimizations for parity with GCC on these
1559 processors. GCC does two optimizations:
1561 1. ix86_pad_returns inserts a noop before ret instructions if immediately
1562 preceded by a conditional branch or is the target of a jump.
1563 2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1564 code contains more than 3 branches.
1566 The first one is done for all AMDs, Core2, and "Generic"
1567 The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1568 Core 2, and "Generic"
1570 //===---------------------------------------------------------------------===//
1572 int x(int a) { return (a&0xf0)>>4; }
1581 movzbl 4(%esp), %eax
1585 //===---------------------------------------------------------------------===//
1587 Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
1590 When the return value is not used (i.e. only care about the value in the
1591 memory), x86 does not have to use add to implement these. Instead, it can use
1592 add, sub, inc, dec instructions with the "lock" prefix.
1594 This is currently implemented using a bit of instruction selection trick. The
1595 issue is the target independent pattern produces one output and a chain and we
1596 want to map it into one that just output a chain. The current trick is to select
1597 it into a MERGE_VALUES with the first definition being an implicit_def. The
1598 proper solution is to add new ISD opcodes for the no-output variant. DAG
1599 combiner can then transform the node before it gets to target node selection.
1601 Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
1602 fact these instructions are identical to the non-lock versions. We need a way to
1603 add target specific information to target nodes and have this information
1604 carried over to machine instructions. Asm printer (or JIT) can use this
1605 information to add the "lock" prefix.
1607 //===---------------------------------------------------------------------===//
1610 unsigned char y0 : 1;
1613 int bar(struct B* a) { return a->y0; }
1615 define i32 @bar(%struct.B* nocapture %a) nounwind readonly optsize {
1616 %1 = getelementptr inbounds %struct.B* %a, i64 0, i32 0
1617 %2 = load i8* %1, align 1
1619 %4 = zext i8 %3 to i32
1630 Missed optimization: should be movl+andl.
1632 //===---------------------------------------------------------------------===//
1634 The x86_64 abi says:
1636 Booleans, when stored in a memory object, are stored as single byte objects the
1637 value of which is always 0 (false) or 1 (true).
1639 We are not using this fact:
1641 int bar(_Bool *a) { return *a; }
1643 define i32 @bar(i8* nocapture %a) nounwind readonly optsize {
1644 %1 = load i8* %a, align 1, !tbaa !0
1646 %2 = zext i8 %tmp to i32
1662 //===---------------------------------------------------------------------===//
1664 Consider the following two functions compiled with clang:
1665 _Bool foo(int *x) { return !(*x & 4); }
1666 unsigned bar(int *x) { return !(*x & 4); }
1683 The second function generates more code even though the two functions are
1684 are functionally identical.
1686 //===---------------------------------------------------------------------===//
1688 Take the following C code:
1689 int f(int a, int b) { return (unsigned char)a == (unsigned char)b; }
1691 We generate the following IR with clang:
1692 define i32 @f(i32 %a, i32 %b) nounwind readnone {
1694 %tmp = xor i32 %b, %a ; <i32> [#uses=1]
1695 %tmp6 = and i32 %tmp, 255 ; <i32> [#uses=1]
1696 %cmp = icmp eq i32 %tmp6, 0 ; <i1> [#uses=1]
1697 %conv5 = zext i1 %cmp to i32 ; <i32> [#uses=1]
1701 And the following x86 code:
1708 A cmpb instead of the xorl+testb would be one instruction shorter.
1710 //===---------------------------------------------------------------------===//
1712 Given the following C code:
1713 int f(int a, int b) { return (signed char)a == (signed char)b; }
1715 We generate the following IR with clang:
1716 define i32 @f(i32 %a, i32 %b) nounwind readnone {
1718 %sext = shl i32 %a, 24 ; <i32> [#uses=1]
1719 %conv1 = ashr i32 %sext, 24 ; <i32> [#uses=1]
1720 %sext6 = shl i32 %b, 24 ; <i32> [#uses=1]
1721 %conv4 = ashr i32 %sext6, 24 ; <i32> [#uses=1]
1722 %cmp = icmp eq i32 %conv1, %conv4 ; <i1> [#uses=1]
1723 %conv5 = zext i1 %cmp to i32 ; <i32> [#uses=1]
1727 And the following x86 code:
1736 It should be possible to eliminate the sign extensions.
1738 //===---------------------------------------------------------------------===//
1740 LLVM misses a load+store narrowing opportunity in this code:
1742 %struct.bf = type { i64, i16, i16, i32 }
1744 @bfi = external global %struct.bf* ; <%struct.bf**> [#uses=2]
1746 define void @t1() nounwind ssp {
1748 %0 = load %struct.bf** @bfi, align 8 ; <%struct.bf*> [#uses=1]
1749 %1 = getelementptr %struct.bf* %0, i64 0, i32 1 ; <i16*> [#uses=1]
1750 %2 = bitcast i16* %1 to i32* ; <i32*> [#uses=2]
1751 %3 = load i32* %2, align 1 ; <i32> [#uses=1]
1752 %4 = and i32 %3, -65537 ; <i32> [#uses=1]
1753 store i32 %4, i32* %2, align 1
1754 %5 = load %struct.bf** @bfi, align 8 ; <%struct.bf*> [#uses=1]
1755 %6 = getelementptr %struct.bf* %5, i64 0, i32 1 ; <i16*> [#uses=1]
1756 %7 = bitcast i16* %6 to i32* ; <i32*> [#uses=2]
1757 %8 = load i32* %7, align 1 ; <i32> [#uses=1]
1758 %9 = and i32 %8, -131073 ; <i32> [#uses=1]
1759 store i32 %9, i32* %7, align 1
1763 LLVM currently emits this:
1765 movq bfi(%rip), %rax
1766 andl $-65537, 8(%rax)
1767 movq bfi(%rip), %rax
1768 andl $-131073, 8(%rax)
1771 It could narrow the loads and stores to emit this:
1773 movq bfi(%rip), %rax
1775 movq bfi(%rip), %rax
1779 The trouble is that there is a TokenFactor between the store and the
1780 load, making it non-trivial to determine if there's anything between
1781 the load and the store which would prohibit narrowing.
1783 //===---------------------------------------------------------------------===//
1786 void foo(unsigned x) {
1788 else if (x == 1) qux();
1791 currently compiles into:
1799 the testl could be removed:
1806 0 is the only unsigned number < 1.
1808 //===---------------------------------------------------------------------===//
1812 %0 = type { i32, i1 }
1814 define i32 @add32carry(i32 %sum, i32 %x) nounwind readnone ssp {
1816 %uadd = tail call %0 @llvm.uadd.with.overflow.i32(i32 %sum, i32 %x)
1817 %cmp = extractvalue %0 %uadd, 1
1818 %inc = zext i1 %cmp to i32
1819 %add = add i32 %x, %sum
1820 %z.0 = add i32 %add, %inc
1824 declare %0 @llvm.uadd.with.overflow.i32(i32, i32) nounwind readnone
1828 _add32carry: ## @add32carry
1838 leal (%rsi,%rdi), %eax
1843 //===---------------------------------------------------------------------===//
1845 The hot loop of 256.bzip2 contains code that looks a bit like this:
1847 int foo(char *P, char *Q, int x, int y) {
1857 In the real code, we get a lot more wrong than this. However, even in this
1874 ## BB#3: ## %if.end38
1879 ## BB#4: ## %if.end60
1882 LBB0_5: ## %if.end60
1887 Note that we generate jumps to LBB0_1 which does a redundant compare. The
1888 redundant compare also forces the register values to be live, which prevents
1889 folding one of the loads into the compare. In contrast, GCC 4.2 produces:
1896 movzbl 1(%rsi), %eax
1899 movzbl 2(%rsi), %eax
1902 movzbl 3(%rdi), %eax
1911 //===---------------------------------------------------------------------===//
1913 For the branch in the following code:
1915 int b(int x, int y) {
1921 We currently generate:
1928 movl+andl would be shorter than the movb+andb+movzbl sequence.
1930 //===---------------------------------------------------------------------===//
1936 float foo(struct u1 u) {
1940 We currently generate:
1942 pshufd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0,0,0]
1946 We could save an instruction here by commuting the addss.
1948 //===---------------------------------------------------------------------===//
1952 float clamp_float(float a) {
1963 clamp_float: # @clamp_float
1964 movss .LCPI0_0(%rip), %xmm1
1972 //===---------------------------------------------------------------------===//
1974 This function (from PR9803):
1998 The move of 0 could be scheduled above the test to make it is xor reg,reg.
2000 //===---------------------------------------------------------------------===//
2002 GCC PR48986. We currently compile this:
2006 if (__sync_fetch_and_add(p, -1) == 1)
2017 Instead we could generate:
2023 The trick is to match "fetch_and_add(X, -C) == C".
2025 //===---------------------------------------------------------------------===//
2027 unsigned t(unsigned a, unsigned b) {
2028 return a <= b ? 5 : -5;
2043 //===---------------------------------------------------------------------===//