lib/Target/X86/README.txt

   1 //===---------------------------------------------------------------------===//
   2 // Random ideas for the X86 backend.
   3 //===---------------------------------------------------------------------===//
   4
   5 We should add support for the "movbe" instruction, which does a byte-swapping
   6 copy (3-addr bswap + memory support?)  This is available on Atom processors.
   7
   8 //===---------------------------------------------------------------------===//
   9
  10 CodeGen/X86/lea-3.ll:test3 should be a single LEA, not a shift/move.  The X86
  11 backend knows how to three-addressify this shift, but it appears the register
  12 allocator isn't even asking it to do so in this case.  We should investigate
  13 why this isn't happening, it could have significant impact on other important
  14 cases for X86 as well.
  15
  16 //===---------------------------------------------------------------------===//
  17
  18 This should be one DIV/IDIV instruction, not a libcall:
  19
  20 unsigned test(unsigned long long X, unsigned Y) {
  21         return X/Y;
  22 }
  23
  24 This can be done trivially with a custom legalizer.  What about overflow
  25 though?  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14224
  26
  27 //===---------------------------------------------------------------------===//
  28
  29 Improvements to the multiply -> shift/add algorithm:
  30 http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01590.html
  31
  32 //===---------------------------------------------------------------------===//
  33
  34 Improve code like this (occurs fairly frequently, e.g. in LLVM):
  35 long long foo(int x) { return 1LL << x; }
  36
  37 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01109.html
  38 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01128.html
  39 http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01136.html
  40
  41 Another useful one would be  ~0ULL >> X and ~0ULL << X.
  42
  43 One better solution for 1LL << x is:
  44         xorl    %eax, %eax
  45         xorl    %edx, %edx
  46         testb   $32, %cl
  47         sete    %al
  48         setne   %dl
  49         sall    %cl, %eax
  50         sall    %cl, %edx
  51
  52 But that requires good 8-bit subreg support.
  53
  54 Also, this might be better.  It's an extra shift, but it's one instruction
  55 shorter, and doesn't stress 8-bit subreg support.
  56 (From http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01148.html,
  57 but without the unnecessary and.)
  58         movl %ecx, %eax
  59         shrl $5, %eax
  60         movl %eax, %edx
  61         xorl $1, %edx
  62         sall %cl, %eax
  63         sall %cl. %edx
  64
  65 64-bit shifts (in general) expand to really bad code.  Instead of using
  66 cmovs, we should expand to a conditional branch like GCC produces.
  67
  68 //===---------------------------------------------------------------------===//
  69
  70 Compile this:
  71 _Bool f(_Bool a) { return a!=1; }
  72
  73 into:
  74         movzbl  %dil, %eax
  75         xorl    $1, %eax
  76         ret
  77
  78 (Although note that this isn't a legal way to express the code that llvm-gcc
  79 currently generates for that function.)
  80
  81 //===---------------------------------------------------------------------===//
  82
  83 Some isel ideas:
  84
  85 1. Dynamic programming based approach when compile time if not an
  86    issue.
  87 2. Code duplication (addressing mode) during isel.
  88 3. Other ideas from "Register-Sensitive Selection, Duplication, and
  89    Sequencing of Instructions".
  90 4. Scheduling for reduced register pressure.  E.g. "Minimum Register
  91    Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs"
  92    and other related papers.
  93    http://citeseer.ist.psu.edu/govindarajan01minimum.html
  94
  95 //===---------------------------------------------------------------------===//
  96
  97 Should we promote i16 to i32 to avoid partial register update stalls?
  98
  99 //===---------------------------------------------------------------------===//
 100
 101 Leave any_extend as pseudo instruction and hint to register
 102 allocator. Delay codegen until post register allocation.
 103 Note. any_extend is now turned into an INSERT_SUBREG. We still need to teach
 104 the coalescer how to deal with it though.
 105
 106 //===---------------------------------------------------------------------===//
 107
 108 It appears icc use push for parameter passing. Need to investigate.
 109
 110 //===---------------------------------------------------------------------===//
 111
 112 Only use inc/neg/not instructions on processors where they are faster than
 113 add/sub/xor.  They are slower on the P4 due to only updating some processor
 114 flags.
 115
 116 //===---------------------------------------------------------------------===//
 117
 118 The instruction selector sometimes misses folding a load into a compare.  The
 119 pattern is written as (cmp reg, (load p)).  Because the compare isn't
 120 commutative, it is not matched with the load on both sides.  The dag combiner
 121 should be made smart enough to cannonicalize the load into the RHS of a compare
 122 when it can invert the result of the compare for free.
 123
 124 //===---------------------------------------------------------------------===//
 125
 126 In many cases, LLVM generates code like this:
 127
 128 _test:
 129         movl 8(%esp), %eax
 130         cmpl %eax, 4(%esp)
 131         setl %al
 132         movzbl %al, %eax
 133         ret
 134
 135 on some processors (which ones?), it is more efficient to do this:
 136
 137 _test:
 138         movl 8(%esp), %ebx
 139         xor  %eax, %eax
 140         cmpl %ebx, 4(%esp)
 141         setl %al
 142         ret
 143
 144 Doing this correctly is tricky though, as the xor clobbers the flags.
 145
 146 //===---------------------------------------------------------------------===//
 147
 148 We should generate bts/btr/etc instructions on targets where they are cheap or
 149 when codesize is important.  e.g., for:
 150
 151 void setbit(int *target, int bit) {
 152     *target |= (1 << bit);
 153 }
 154 void clearbit(int *target, int bit) {
 155     *target &= ~(1 << bit);
 156 }
 157
 158 //===---------------------------------------------------------------------===//
 159
 160 Instead of the following for memset char*, 1, 10:
 161
 162         movl $16843009, 4(%edx)
 163         movl $16843009, (%edx)
 164         movw $257, 8(%edx)
 165
 166 It might be better to generate
 167
 168         movl $16843009, %eax
 169         movl %eax, 4(%edx)
 170         movl %eax, (%edx)
 171         movw al, 8(%edx)
 172
 173 when we can spare a register. It reduces code size.
 174
 175 //===---------------------------------------------------------------------===//
 176
 177 Evaluate what the best way to codegen sdiv X, (2^C) is.  For X/8, we currently
 178 get this:
 179
 180 define i32 @test1(i32 %X) {
 181     %Y = sdiv i32 %X, 8
 182     ret i32 %Y
 183 }
 184
 185 _test1:
 186         movl 4(%esp), %eax
 187         movl %eax, %ecx
 188         sarl $31, %ecx
 189         shrl $29, %ecx
 190         addl %ecx, %eax
 191         sarl $3, %eax
 192         ret
 193
 194 GCC knows several different ways to codegen it, one of which is this:
 195
 196 _test1:
 197         movl    4(%esp), %eax
 198         cmpl    $-1, %eax
 199         leal    7(%eax), %ecx
 200         cmovle  %ecx, %eax
 201         sarl    $3, %eax
 202         ret
 203
 204 which is probably slower, but it's interesting at least :)
 205
 206 //===---------------------------------------------------------------------===//
 207
 208 We are currently lowering large (1MB+) memmove/memcpy to rep/stosl and rep/movsl
 209 We should leave these as libcalls for everything over a much lower threshold,
 210 since libc is hand tuned for medium and large mem ops (avoiding RFO for large
 211 stores, TLB preheating, etc)
 212
 213 //===---------------------------------------------------------------------===//
 214
 215 Optimize this into something reasonable:
 216  x * copysign(1.0, y) * copysign(1.0, z)
 217
 218 //===---------------------------------------------------------------------===//
 219
 220 Optimize copysign(x, *y) to use an integer load from y.
 221
 222 //===---------------------------------------------------------------------===//
 223
 224 The following tests perform worse with LSR:
 225
 226 lambda, siod, optimizer-eval, ackermann, hash2, nestedloop, strcat, and Treesor.
 227
 228 //===---------------------------------------------------------------------===//
 229
 230 Teach the coalescer to coalesce vregs of different register classes. e.g. FR32 /
 231 FR64 to VR128.
 232
 233 //===---------------------------------------------------------------------===//
 234
 235 Adding to the list of cmp / test poor codegen issues:
 236
 237 int test(__m128 *A, __m128 *B) {
 238   if (_mm_comige_ss(*A, *B))
 239     return 3;
 240   else
 241     return 4;
 242 }
 243
 244 _test:
 245         movl 8(%esp), %eax
 246         movaps (%eax), %xmm0
 247         movl 4(%esp), %eax
 248         movaps (%eax), %xmm1
 249         comiss %xmm0, %xmm1
 250         setae %al
 251         movzbl %al, %ecx
 252         movl $3, %eax
 253         movl $4, %edx
 254         cmpl $0, %ecx
 255         cmove %edx, %eax
 256         ret
 257
 258 Note the setae, movzbl, cmpl, cmove can be replaced with a single cmovae. There
 259 are a number of issues. 1) We are introducing a setcc between the result of the
 260 intrisic call and select. 2) The intrinsic is expected to produce a i32 value
 261 so a any extend (which becomes a zero extend) is added.
 262
 263 We probably need some kind of target DAG combine hook to fix this.
 264
 265 //===---------------------------------------------------------------------===//
 266
 267 We generate significantly worse code for this than GCC:
 268 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
 269 http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701
 270
 271 There is also one case we do worse on PPC.
 272
 273 //===---------------------------------------------------------------------===//
 274
 275 For this:
 276
 277 int test(int a)
 278 {
 279   return a * 3;
 280 }
 281
 282 We currently emits
 283         imull $3, 4(%esp), %eax
 284
 285 Perhaps this is what we really should generate is? Is imull three or four
 286 cycles? Note: ICC generates this:
 287         movl    4(%esp), %eax
 288         leal    (%eax,%eax,2), %eax
 289
 290 The current instruction priority is based on pattern complexity. The former is
 291 more "complex" because it folds a load so the latter will not be emitted.
 292
 293 Perhaps we should use AddedComplexity to give LEA32r a higher priority? We
 294 should always try to match LEA first since the LEA matching code does some
 295 estimate to determine whether the match is profitable.
 296
 297 However, if we care more about code size, then imull is better. It's two bytes
 298 shorter than movl + leal.
 299
 300 On a Pentium M, both variants have the same characteristics with regard
 301 to throughput; however, the multiplication has a latency of four cycles, as
 302 opposed to two cycles for the movl+lea variant.
 303
 304 //===---------------------------------------------------------------------===//
 305
 306 __builtin_ffs codegen is messy.
 307
 308 int ffs_(unsigned X) { return __builtin_ffs(X); }
 309
 310 llvm produces:
 311 ffs_:
 312         movl    4(%esp), %ecx
 313         bsfl    %ecx, %eax
 314         movl    $32, %edx
 315         cmove   %edx, %eax
 316         incl    %eax
 317         xorl    %edx, %edx
 318         testl   %ecx, %ecx
 319         cmove   %edx, %eax
 320         ret
 321
 322 vs gcc:
 323
 324 _ffs_:
 325         movl    $-1, %edx
 326         bsfl    4(%esp), %eax
 327         cmove   %edx, %eax
 328         addl    $1, %eax
 329         ret
 330
 331 Another example of __builtin_ffs (use predsimplify to eliminate a select):
 332
 333 int foo (unsigned long j) {
 334   if (j)
 335     return __builtin_ffs (j) - 1;
 336   else
 337     return 0;
 338 }
 339
 340 //===---------------------------------------------------------------------===//
 341
 342 It appears gcc place string data with linkonce linkage in
 343 .section __TEXT,__const_coal,coalesced instead of
 344 .section __DATA,__const_coal,coalesced.
 345 Take a look at darwin.h, there are other Darwin assembler directives that we
 346 do not make use of.
 347
 348 //===---------------------------------------------------------------------===//
 349
 350 define i32 @foo(i32* %a, i32 %t) {
 351 entry:
 352         br label %cond_true
 353
 354 cond_true:              ; preds = %cond_true, %entry
 355         %x.0.0 = phi i32 [ 0, %entry ], [ %tmp9, %cond_true ]           ; <i32> [#uses=3]
 356         %t_addr.0.0 = phi i32 [ %t, %entry ], [ %tmp7, %cond_true ]             ; <i32> [#uses=1]
 357         %tmp2 = getelementptr i32* %a, i32 %x.0.0               ; <i32*> [#uses=1]
 358         %tmp3 = load i32* %tmp2         ; <i32> [#uses=1]
 359         %tmp5 = add i32 %t_addr.0.0, %x.0.0             ; <i32> [#uses=1]
 360         %tmp7 = add i32 %tmp5, %tmp3            ; <i32> [#uses=2]
 361         %tmp9 = add i32 %x.0.0, 1               ; <i32> [#uses=2]
 362         %tmp = icmp sgt i32 %tmp9, 39           ; <i1> [#uses=1]
 363         br i1 %tmp, label %bb12, label %cond_true
 364
 365 bb12:           ; preds = %cond_true
 366         ret i32 %tmp7
 367 }
 368 is pessimized by -loop-reduce and -indvars
 369
 370 //===---------------------------------------------------------------------===//
 371
 372 u32 to float conversion improvement:
 373
 374 float uint32_2_float( unsigned u ) {
 375   float fl = (int) (u & 0xffff);
 376   float fh = (int) (u >> 16);
 377   fh *= 0x1.0p16f;
 378   return fh + fl;
 379 }
 380
 381 00000000        subl    $0x04,%esp
 382 00000003        movl    0x08(%esp,1),%eax
 383 00000007        movl    %eax,%ecx
 384 00000009        shrl    $0x10,%ecx
 385 0000000c        cvtsi2ss        %ecx,%xmm0
 386 00000010        andl    $0x0000ffff,%eax
 387 00000015        cvtsi2ss        %eax,%xmm1
 388 00000019        mulss   0x00000078,%xmm0
 389 00000021        addss   %xmm1,%xmm0
 390 00000025        movss   %xmm0,(%esp,1)
 391 0000002a        flds    (%esp,1)
 392 0000002d        addl    $0x04,%esp
 393 00000030        ret
 394
 395 //===---------------------------------------------------------------------===//
 396
 397 When using fastcc abi, align stack slot of argument of type double on 8 byte
 398 boundary to improve performance.
 399
 400 //===---------------------------------------------------------------------===//
 401
 402 Codegen:
 403
 404 int f(int a, int b) {
 405   if (a == 4 || a == 6)
 406     b++;
 407   return b;
 408 }
 409
 410
 411 as:
 412
 413 or eax, 2
 414 cmp eax, 6
 415 jz label
 416
 417 //===---------------------------------------------------------------------===//
 418
 419 GCC's ix86_expand_int_movcc function (in i386.c) has a ton of interesting
 420 simplifications for integer "x cmp y ? a : b".  For example, instead of:
 421
 422 int G;
 423 void f(int X, int Y) {
 424   G = X < 0 ? 14 : 13;
 425 }
 426
 427 compiling to:
 428
 429 _f:
 430         movl $14, %eax
 431         movl $13, %ecx
 432         movl 4(%esp), %edx
 433         testl %edx, %edx
 434         cmovl %eax, %ecx
 435         movl %ecx, _G
 436         ret
 437
 438 it could be:
 439 _f:
 440         movl    4(%esp), %eax
 441         sarl    $31, %eax
 442         notl    %eax
 443         addl    $14, %eax
 444         movl    %eax, _G
 445         ret
 446
 447 etc.
 448
 449 Another is:
 450 int usesbb(unsigned int a, unsigned int b) {
 451        return (a < b ? -1 : 0);
 452 }
 453 to:
 454 _usesbb:
 455         movl    8(%esp), %eax
 456         cmpl    %eax, 4(%esp)
 457         sbbl    %eax, %eax
 458         ret
 459
 460 instead of:
 461 _usesbb:
 462         xorl    %eax, %eax
 463         movl    8(%esp), %ecx
 464         cmpl    %ecx, 4(%esp)
 465         movl    $4294967295, %ecx
 466         cmovb   %ecx, %eax
 467         ret
 468
 469 //===---------------------------------------------------------------------===//
 470
 471 Consider the expansion of:
 472
 473 define i32 @test3(i32 %X) {
 474         %tmp1 = urem i32 %X, 255
 475         ret i32 %tmp1
 476 }
 477
 478 Currently it compiles to:
 479
 480 ...
 481         movl $2155905153, %ecx
 482         movl 8(%esp), %esi
 483         movl %esi, %eax
 484         mull %ecx
 485 ...
 486
 487 This could be "reassociated" into:
 488
 489         movl $2155905153, %eax
 490         movl 8(%esp), %ecx
 491         mull %ecx
 492
 493 to avoid the copy.  In fact, the existing two-address stuff would do this
 494 except that mul isn't a commutative 2-addr instruction.  I guess this has
 495 to be done at isel time based on the #uses to mul?
 496
 497 //===---------------------------------------------------------------------===//
 498
 499 Make sure the instruction which starts a loop does not cross a cacheline
 500 boundary. This requires knowning the exact length of each machine instruction.
 501 That is somewhat complicated, but doable. Example 256.bzip2:
 502
 503 In the new trace, the hot loop has an instruction which crosses a cacheline
 504 boundary.  In addition to potential cache misses, this can't help decoding as I
 505 imagine there has to be some kind of complicated decoder reset and realignment
 506 to grab the bytes from the next cacheline.
 507
 508 532  532 0x3cfc movb     (1809(%esp, %esi), %bl   <<<--- spans 2 64 byte lines
 509 942  942 0x3d03 movl     %dh, (1809(%esp, %esi)
 510 937  937 0x3d0a incl     %esi
 511 3    3   0x3d0b cmpb     %bl, %dl
 512 27   27  0x3d0d jnz      0x000062db <main+11707>
 513
 514 //===---------------------------------------------------------------------===//
 515
 516 In c99 mode, the preprocessor doesn't like assembly comments like #TRUNCATE.
 517
 518 //===---------------------------------------------------------------------===//
 519
 520 This could be a single 16-bit load.
 521
 522 int f(char *p) {
 523     if ((p[0] == 1) & (p[1] == 2)) return 1;
 524     return 0;
 525 }
 526
 527 //===---------------------------------------------------------------------===//
 528
 529 We should inline lrintf and probably other libc functions.
 530
 531 //===---------------------------------------------------------------------===//
 532
 533 Start using the flags more.  For example, compile:
 534
 535 int add_zf(int *x, int y, int a, int b) {
 536      if ((*x += y) == 0)
 537           return a;
 538      else
 539           return b;
 540 }
 541
 542 to:
 543        addl    %esi, (%rdi)
 544        movl    %edx, %eax
 545        cmovne  %ecx, %eax
 546        ret
 547 instead of:
 548
 549 _add_zf:
 550         addl (%rdi), %esi
 551         movl %esi, (%rdi)
 552         testl %esi, %esi
 553         cmove %edx, %ecx
 554         movl %ecx, %eax
 555         ret
 556
 557 and:
 558
 559 int add_zf(int *x, int y, int a, int b) {
 560      if ((*x + y) < 0)
 561           return a;
 562      else
 563           return b;
 564 }
 565
 566 to:
 567
 568 add_zf:
 569         addl    (%rdi), %esi
 570         movl    %edx, %eax
 571         cmovns  %ecx, %eax
 572         ret
 573
 574 instead of:
 575
 576 _add_zf:
 577         addl (%rdi), %esi
 578         testl %esi, %esi
 579         cmovs %edx, %ecx
 580         movl %ecx, %eax
 581         ret
 582
 583 //===---------------------------------------------------------------------===//
 584
 585 These two functions have identical effects:
 586
 587 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return i;}
 588 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;}
 589
 590 We currently compile them to:
 591
 592 _f:
 593         movl 4(%esp), %eax
 594         movl %eax, %ecx
 595         incl %ecx
 596         movl 8(%esp), %edx
 597         cmpl %edx, %ecx
 598         jne LBB1_2      #UnifiedReturnBlock
 599 LBB1_1: #cond_true
 600         addl $2, %eax
 601         ret
 602 LBB1_2: #UnifiedReturnBlock
 603         movl %ecx, %eax
 604         ret
 605 _f2:
 606         movl 4(%esp), %eax
 607         movl %eax, %ecx
 608         incl %ecx
 609         cmpl 8(%esp), %ecx
 610         sete %cl
 611         movzbl %cl, %ecx
 612         leal 1(%ecx,%eax), %eax
 613         ret
 614
 615 both of which are inferior to GCC's:
 616
 617 _f:
 618         movl    4(%esp), %edx
 619         leal    1(%edx), %eax
 620         addl    $2, %edx
 621         cmpl    8(%esp), %eax
 622         cmove   %edx, %eax
 623         ret
 624 _f2:
 625         movl    4(%esp), %eax
 626         addl    $1, %eax
 627         xorl    %edx, %edx
 628         cmpl    8(%esp), %eax
 629         sete    %dl
 630         addl    %edx, %eax
 631         ret
 632
 633 //===---------------------------------------------------------------------===//
 634
 635 This code:
 636
 637 void test(int X) {
 638   if (X) abort();
 639 }
 640
 641 is currently compiled to:
 642
 643 _test:
 644         subl $12, %esp
 645         cmpl $0, 16(%esp)
 646         jne LBB1_1
 647         addl $12, %esp
 648         ret
 649 LBB1_1:
 650         call L_abort$stub
 651
 652 It would be better to produce:
 653
 654 _test:
 655         subl $12, %esp
 656         cmpl $0, 16(%esp)
 657         jne L_abort$stub
 658         addl $12, %esp
 659         ret
 660
 661 This can be applied to any no-return function call that takes no arguments etc.
 662 Alternatively, the stack save/restore logic could be shrink-wrapped, producing
 663 something like this:
 664
 665 _test:
 666         cmpl $0, 4(%esp)
 667         jne LBB1_1
 668         ret
 669 LBB1_1:
 670         subl $12, %esp
 671         call L_abort$stub
 672
 673 Both are useful in different situations.  Finally, it could be shrink-wrapped
 674 and tail called, like this:
 675
 676 _test:
 677         cmpl $0, 4(%esp)
 678         jne LBB1_1
 679         ret
 680 LBB1_1:
 681         pop %eax   # realign stack.
 682         call L_abort$stub
 683
 684 Though this probably isn't worth it.
 685
 686 //===---------------------------------------------------------------------===//
 687
 688 We need to teach the codegen to convert two-address INC instructions to LEA
 689 when the flags are dead (likewise dec).  For example, on X86-64, compile:
 690
 691 int foo(int A, int B) {
 692   return A+1;
 693 }
 694
 695 to:
 696
 697 _foo:
 698         leal    1(%edi), %eax
 699         ret
 700
 701 instead of:
 702
 703 _foo:
 704         incl %edi
 705         movl %edi, %eax
 706         ret
 707
 708 Another example is:
 709
 710 ;; X's live range extends beyond the shift, so the register allocator
 711 ;; cannot coalesce it with Y.  Because of this, a copy needs to be
 712 ;; emitted before the shift to save the register value before it is
 713 ;; clobbered.  However, this copy is not needed if the register
 714 ;; allocator turns the shift into an LEA.  This also occurs for ADD.
 715
 716 ; Check that the shift gets turned into an LEA.
 717 ; RUN: llvm-as < %s | llc -march=x86 -x86-asm-syntax=intel | \
 718 ; RUN:   not grep {mov E.X, E.X}
 719
 720 @G = external global i32                ; <i32*> [#uses=3]
 721
 722 define i32 @test1(i32 %X, i32 %Y) {
 723         %Z = add i32 %X, %Y             ; <i32> [#uses=1]
 724         volatile store i32 %Y, i32* @G
 725         volatile store i32 %Z, i32* @G
 726         ret i32 %X
 727 }
 728
 729 define i32 @test2(i32 %X) {
 730         %Z = add i32 %X, 1              ; <i32> [#uses=1]
 731         volatile store i32 %Z, i32* @G
 732         ret i32 %X
 733 }
 734
 735 //===---------------------------------------------------------------------===//
 736
 737 Sometimes it is better to codegen subtractions from a constant (e.g. 7-x) with
 738 a neg instead of a sub instruction.  Consider:
 739
 740 int test(char X) { return 7-X; }
 741
 742 we currently produce:
 743 _test:
 744         movl $7, %eax
 745         movsbl 4(%esp), %ecx
 746         subl %ecx, %eax
 747         ret
 748
 749 We would use one fewer register if codegen'd as:
 750
 751         movsbl 4(%esp), %eax
 752         neg %eax
 753         add $7, %eax
 754         ret
 755
 756 Note that this isn't beneficial if the load can be folded into the sub.  In
 757 this case, we want a sub:
 758
 759 int test(int X) { return 7-X; }
 760 _test:
 761         movl $7, %eax
 762         subl 4(%esp), %eax
 763         ret
 764
 765 //===---------------------------------------------------------------------===//
 766
 767 Leaf functions that require one 4-byte spill slot have a prolog like this:
 768
 769 _foo:
 770         pushl   %esi
 771         subl    $4, %esp
 772 ...
 773 and an epilog like this:
 774         addl    $4, %esp
 775         popl    %esi
 776         ret
 777
 778 It would be smaller, and potentially faster, to push eax on entry and to
 779 pop into a dummy register instead of using addl/subl of esp.  Just don't pop
 780 into any return registers :)
 781
 782 //===---------------------------------------------------------------------===//
 783
 784 The X86 backend should fold (branch (or (setcc, setcc))) into multiple
 785 branches.  We generate really poor code for:
 786
 787 double testf(double a) {
 788        return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
 789 }
 790
 791 For example, the entry BB is:
 792
 793 _testf:
 794         subl    $20, %esp
 795         pxor    %xmm0, %xmm0
 796         movsd   24(%esp), %xmm1
 797         ucomisd %xmm0, %xmm1
 798         setnp   %al
 799         sete    %cl
 800         testb   %cl, %al
 801         jne     LBB1_5  # UnifiedReturnBlock
 802 LBB1_1: # cond_true
 803
 804
 805 it would be better to replace the last four instructions with:
 806
 807         jp LBB1_1
 808         je LBB1_5
 809 LBB1_1:
 810
 811 We also codegen the inner ?: into a diamond:
 812
 813        cvtss2sd        LCPI1_0(%rip), %xmm2
 814         cvtss2sd        LCPI1_1(%rip), %xmm3
 815         ucomisd %xmm1, %xmm0
 816         ja      LBB1_3  # cond_true
 817 LBB1_2: # cond_true
 818         movapd  %xmm3, %xmm2
 819 LBB1_3: # cond_true
 820         movapd  %xmm2, %xmm0
 821         ret
 822
 823 We should sink the load into xmm3 into the LBB1_2 block.  This should
 824 be pretty easy, and will nuke all the copies.
 825
 826 //===---------------------------------------------------------------------===//
 827
 828 This:
 829         #include <algorithm>
 830         inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b)
 831         { return std::make_pair(a + b, a + b < a); }
 832         bool no_overflow(unsigned a, unsigned b)
 833         { return !full_add(a, b).second; }
 834
 835 Should compile to:
 836
 837
 838         _Z11no_overflowjj:
 839                 addl    %edi, %esi
 840                 setae   %al
 841                 ret
 842
 843 FIXME: That code looks wrong; bool return is normally defined as zext.
 844
 845 on x86-64, not:
 846
 847 __Z11no_overflowjj:
 848         addl    %edi, %esi
 849         cmpl    %edi, %esi
 850         setae   %al
 851         movzbl  %al, %eax
 852         ret
 853
 854
 855 //===---------------------------------------------------------------------===//
 856
 857 Re-materialize MOV32r0 etc. with xor instead of changing them to moves if the
 858 condition register is dead. xor reg reg is shorter than mov reg, #0.
 859
 860 //===---------------------------------------------------------------------===//
 861
 862 The following code:
 863
 864 bb114.preheader:                ; preds = %cond_next94
 865         %tmp231232 = sext i16 %tmp62 to i32             ; <i32> [#uses=1]
 866         %tmp233 = sub i32 32, %tmp231232                ; <i32> [#uses=1]
 867         %tmp245246 = sext i16 %tmp65 to i32             ; <i32> [#uses=1]
 868         %tmp252253 = sext i16 %tmp68 to i32             ; <i32> [#uses=1]
 869         %tmp254 = sub i32 32, %tmp252253                ; <i32> [#uses=1]
 870         %tmp553554 = bitcast i16* %tmp37 to i8*         ; <i8*> [#uses=2]
 871         %tmp583584 = sext i16 %tmp98 to i32             ; <i32> [#uses=1]
 872         %tmp585 = sub i32 32, %tmp583584                ; <i32> [#uses=1]
 873         %tmp614615 = sext i16 %tmp101 to i32            ; <i32> [#uses=1]
 874         %tmp621622 = sext i16 %tmp104 to i32            ; <i32> [#uses=1]
 875         %tmp623 = sub i32 32, %tmp621622                ; <i32> [#uses=1]
 876         br label %bb114
 877
 878 produces:
 879
 880 LBB3_5: # bb114.preheader
 881         movswl  -68(%ebp), %eax
 882         movl    $32, %ecx
 883         movl    %ecx, -80(%ebp)
 884         subl    %eax, -80(%ebp)
 885         movswl  -52(%ebp), %eax
 886         movl    %ecx, -84(%ebp)
 887         subl    %eax, -84(%ebp)
 888         movswl  -70(%ebp), %eax
 889         movl    %ecx, -88(%ebp)
 890         subl    %eax, -88(%ebp)
 891         movswl  -50(%ebp), %eax
 892         subl    %eax, %ecx
 893         movl    %ecx, -76(%ebp)
 894         movswl  -42(%ebp), %eax
 895         movl    %eax, -92(%ebp)
 896         movswl  -66(%ebp), %eax
 897         movl    %eax, -96(%ebp)
 898         movw    $0, -98(%ebp)
 899
 900 This appears to be bad because the RA is not folding the store to the stack
 901 slot into the movl.  The above instructions could be:
 902         movl    $32, -80(%ebp)
 903 ...
 904         movl    $32, -84(%ebp)
 905 ...
 906 This seems like a cross between remat and spill folding.
 907
 908 This has redundant subtractions of %eax from a stack slot. However, %ecx doesn't
 909 change, so we could simply subtract %eax from %ecx first and then use %ecx (or
 910 vice-versa).
 911
 912 //===---------------------------------------------------------------------===//
 913
 914 This code:
 915
 916         %tmp659 = icmp slt i16 %tmp654, 0               ; <i1> [#uses=1]
 917         br i1 %tmp659, label %cond_true662, label %cond_next715
 918
 919 produces this:
 920
 921         testw   %cx, %cx
 922         movswl  %cx, %esi
 923         jns     LBB4_109        # cond_next715
 924
 925 Shark tells us that using %cx in the testw instruction is sub-optimal. It
 926 suggests using the 32-bit register (which is what ICC uses).
 927
 928 //===---------------------------------------------------------------------===//
 929
 930 We compile this:
 931
 932 void compare (long long foo) {
 933   if (foo < 4294967297LL)
 934     abort();
 935 }
 936
 937 to:
 938
 939 compare:
 940         subl    $4, %esp
 941         cmpl    $0, 8(%esp)
 942         setne   %al
 943         movzbw  %al, %ax
 944         cmpl    $1, 12(%esp)
 945         setg    %cl
 946         movzbw  %cl, %cx
 947         cmove   %ax, %cx
 948         testb   $1, %cl
 949         jne     .LBB1_2 # UnifiedReturnBlock
 950 .LBB1_1:        # ifthen
 951         call    abort
 952 .LBB1_2:        # UnifiedReturnBlock
 953         addl    $4, %esp
 954         ret
 955
 956 (also really horrible code on ppc).  This is due to the expand code for 64-bit
 957 compares.  GCC produces multiple branches, which is much nicer:
 958
 959 compare:
 960         subl    $12, %esp
 961         movl    20(%esp), %edx
 962         movl    16(%esp), %eax
 963         decl    %edx
 964         jle     .L7
 965 .L5:
 966         addl    $12, %esp
 967         ret
 968         .p2align 4,,7
 969 .L7:
 970         jl      .L4
 971         cmpl    $0, %eax
 972         .p2align 4,,8
 973         ja      .L5
 974 .L4:
 975         .p2align 4,,9
 976         call    abort
 977
 978 //===---------------------------------------------------------------------===//
 979
 980 Tail call optimization improvements: Tail call optimization currently
 981 pushes all arguments on the top of the stack (their normal place for
 982 non-tail call optimized calls) that source from the callers arguments
 983 or  that source from a virtual register (also possibly sourcing from
 984 callers arguments).
 985 This is done to prevent overwriting of parameters (see example
 986 below) that might be used later.
 987
 988 example:
 989
 990 int callee(int32, int64);
 991 int caller(int32 arg1, int32 arg2) {
 992   int64 local = arg2 * 2;
 993   return callee(arg2, (int64)local);
 994 }
 995
 996 [arg1]          [!arg2 no longer valid since we moved local onto it]
 997 [arg2]      ->  [(int64)
 998 [RETADDR]        local  ]
 999
1000 Moving arg1 onto the stack slot of callee function would overwrite
1001 arg2 of the caller.
1002
1003 Possible optimizations:
1004
1005
1006  - Analyse the actual parameters of the callee to see which would
1007    overwrite a caller parameter which is used by the callee and only
1008    push them onto the top of the stack.
1009
1010    int callee (int32 arg1, int32 arg2);
1011    int caller (int32 arg1, int32 arg2) {
1012        return callee(arg1,arg2);
1013    }
1014
1015    Here we don't need to write any variables to the top of the stack
1016    since they don't overwrite each other.
1017
1018    int callee (int32 arg1, int32 arg2);
1019    int caller (int32 arg1, int32 arg2) {
1020        return callee(arg2,arg1);
1021    }
1022
1023    Here we need to push the arguments because they overwrite each
1024    other.
1025
1026 //===---------------------------------------------------------------------===//
1027
1028 main ()
1029 {
1030   int i = 0;
1031   unsigned long int z = 0;
1032
1033   do {
1034     z -= 0x00004000;
1035     i++;
1036     if (i > 0x00040000)
1037       abort ();
1038   } while (z > 0);
1039   exit (0);
1040 }
1041
1042 gcc compiles this to:
1043
1044 _main:
1045         subl    $28, %esp
1046         xorl    %eax, %eax
1047         jmp     L2
1048 L3:
1049         cmpl    $262144, %eax
1050         je      L10
1051 L2:
1052         addl    $1, %eax
1053         cmpl    $262145, %eax
1054         jne     L3
1055         call    L_abort$stub
1056 L10:
1057         movl    $0, (%esp)
1058         call    L_exit$stub
1059
1060 llvm:
1061
1062 _main:
1063         subl    $12, %esp
1064         movl    $1, %eax
1065         movl    $16384, %ecx
1066 LBB1_1: # bb
1067         cmpl    $262145, %eax
1068         jge     LBB1_4  # cond_true
1069 LBB1_2: # cond_next
1070         incl    %eax
1071         addl    $4294950912, %ecx
1072         cmpl    $16384, %ecx
1073         jne     LBB1_1  # bb
1074 LBB1_3: # bb11
1075         xorl    %eax, %eax
1076         addl    $12, %esp
1077         ret
1078 LBB1_4: # cond_true
1079         call    L_abort$stub
1080
1081 1. LSR should rewrite the first cmp with induction variable %ecx.
1082 2. DAG combiner should fold
1083         leal    1(%eax), %edx
1084         cmpl    $262145, %edx
1085    =>
1086         cmpl    $262144, %eax
1087
1088 //===---------------------------------------------------------------------===//
1089
1090 define i64 @test(double %X) {
1091         %Y = fptosi double %X to i64
1092         ret i64 %Y
1093 }
1094
1095 compiles to:
1096
1097 _test:
1098         subl    $20, %esp
1099         movsd   24(%esp), %xmm0
1100         movsd   %xmm0, 8(%esp)
1101         fldl    8(%esp)
1102         fisttpll        (%esp)
1103         movl    4(%esp), %edx
1104         movl    (%esp), %eax
1105         addl    $20, %esp
1106         #FP_REG_KILL
1107         ret
1108
1109 This should just fldl directly from the input stack slot.
1110
1111 //===---------------------------------------------------------------------===//
1112
1113 This code:
1114 int foo (int x) { return (x & 65535) | 255; }
1115
1116 Should compile into:
1117
1118 _foo:
1119         movzwl  4(%esp), %eax
1120         orl     $255, %eax
1121         ret
1122
1123 instead of:
1124 _foo:
1125         movl    $255, %eax
1126         orl     4(%esp), %eax
1127         andl    $65535, %eax
1128         ret
1129
1130 //===---------------------------------------------------------------------===//
1131
1132 We're codegen'ing multiply of long longs inefficiently:
1133
1134 unsigned long long LLM(unsigned long long arg1, unsigned long long arg2) {
1135   return arg1 *  arg2;
1136 }
1137
1138 We compile to (fomit-frame-pointer):
1139
1140 _LLM:
1141         pushl   %esi
1142         movl    8(%esp), %ecx
1143         movl    16(%esp), %esi
1144         movl    %esi, %eax
1145         mull    %ecx
1146         imull   12(%esp), %esi
1147         addl    %edx, %esi
1148         imull   20(%esp), %ecx
1149         movl    %esi, %edx
1150         addl    %ecx, %edx
1151         popl    %esi
1152         ret
1153
1154 This looks like a scheduling deficiency and lack of remat of the load from
1155 the argument area.  ICC apparently produces:
1156
1157         movl      8(%esp), %ecx
1158         imull     12(%esp), %ecx
1159         movl      16(%esp), %eax
1160         imull     4(%esp), %eax
1161         addl      %eax, %ecx
1162         movl      4(%esp), %eax
1163         mull      12(%esp)
1164         addl      %ecx, %edx
1165         ret
1166
1167 Note that it remat'd loads from 4(esp) and 12(esp).  See this GCC PR:
1168 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17236
1169
1170 //===---------------------------------------------------------------------===//
1171
1172 We can fold a store into "zeroing a reg".  Instead of:
1173
1174 xorl    %eax, %eax
1175 movl    %eax, 124(%esp)
1176
1177 we should get:
1178
1179 movl    $0, 124(%esp)
1180
1181 if the flags of the xor are dead.
1182
1183 Likewise, we isel "x<<1" into "add reg,reg".  If reg is spilled, this should
1184 be folded into: shl [mem], 1
1185
1186 //===---------------------------------------------------------------------===//
1187
1188 This testcase misses a read/modify/write opportunity (from PR1425):
1189
1190 void vertical_decompose97iH1(int *b0, int *b1, int *b2, int width){
1191     int i;
1192     for(i=0; i<width; i++)
1193         b1[i] += (1*(b0[i] + b2[i])+0)>>0;
1194 }
1195
1196 We compile it down to:
1197
1198 LBB1_2: # bb
1199         movl    (%esi,%edi,4), %ebx
1200         addl    (%ecx,%edi,4), %ebx
1201         addl    (%edx,%edi,4), %ebx
1202         movl    %ebx, (%ecx,%edi,4)
1203         incl    %edi
1204         cmpl    %eax, %edi
1205         jne     LBB1_2  # bb
1206
1207 the inner loop should add to the memory location (%ecx,%edi,4), saving
1208 a mov.  Something like:
1209
1210         movl    (%esi,%edi,4), %ebx
1211         addl    (%edx,%edi,4), %ebx
1212         addl    %ebx, (%ecx,%edi,4)
1213
1214 Here is another interesting example:
1215
1216 void vertical_compose97iH1(int *b0, int *b1, int *b2, int width){
1217     int i;
1218     for(i=0; i<width; i++)
1219         b1[i] -= (1*(b0[i] + b2[i])+0)>>0;
1220 }
1221
1222 We miss the r/m/w opportunity here by using 2 subs instead of an add+sub[mem]:
1223
1224 LBB9_2: # bb
1225         movl    (%ecx,%edi,4), %ebx
1226         subl    (%esi,%edi,4), %ebx
1227         subl    (%edx,%edi,4), %ebx
1228         movl    %ebx, (%ecx,%edi,4)
1229         incl    %edi
1230         cmpl    %eax, %edi
1231         jne     LBB9_2  # bb
1232
1233 Additionally, LSR should rewrite the exit condition of these loops to use
1234 a stride-4 IV, would would allow all the scales in the loop to go away.
1235 This would result in smaller code and more efficient microops.
1236
1237 //===---------------------------------------------------------------------===//
1238
1239 In SSE mode, we turn abs and neg into a load from the constant pool plus a xor
1240 or and instruction, for example:
1241
1242         xorpd   LCPI1_0, %xmm2
1243
1244 However, if xmm2 gets spilled, we end up with really ugly code like this:
1245
1246         movsd   (%esp), %xmm0
1247         xorpd   LCPI1_0, %xmm0
1248         movsd   %xmm0, (%esp)
1249
1250 Since we 'know' that this is a 'neg', we can actually "fold" the spill into
1251 the neg/abs instruction, turning it into an *integer* operation, like this:
1252
1253         xorl 2147483648, [mem+4]     ## 2147483648 = (1 << 31)
1254
1255 you could also use xorb, but xorl is less likely to lead to a partial register
1256 stall.  Here is a contrived testcase:
1257
1258 double a, b, c;
1259 void test(double *P) {
1260   double X = *P;
1261   a = X;
1262   bar();
1263   X = -X;
1264   b = X;
1265   bar();
1266   c = X;
1267 }
1268
1269 //===---------------------------------------------------------------------===//
1270
1271 handling llvm.memory.barrier on pre SSE2 cpus
1272
1273 should generate:
1274 lock ; mov %esp, %esp
1275
1276 //===---------------------------------------------------------------------===//
1277
1278 The generated code on x86 for checking for signed overflow on a multiply the
1279 obvious way is much longer than it needs to be.
1280
1281 int x(int a, int b) {
1282   long long prod = (long long)a*b;
1283   return  prod > 0x7FFFFFFF || prod < (-0x7FFFFFFF-1);
1284 }
1285
1286 See PR2053 for more details.
1287
1288 //===---------------------------------------------------------------------===//
1289
1290 We should investigate using cdq/ctld (effect: edx = sar eax, 31)
1291 more aggressively; it should cost the same as a move+shift on any modern
1292 processor, but it's a lot shorter. Downside is that it puts more
1293 pressure on register allocation because it has fixed operands.
1294
1295 Example:
1296 int abs(int x) {return x < 0 ? -x : x;}
1297
1298 gcc compiles this to the following when using march/mtune=pentium2/3/4/m/etc.:
1299 abs:
1300         movl    4(%esp), %eax
1301         cltd
1302         xorl    %edx, %eax
1303         subl    %edx, %eax
1304         ret
1305
1306 //===---------------------------------------------------------------------===//
1307
1308 Consider:
1309 int test(unsigned long a, unsigned long b) { return -(a < b); }
1310
1311 We currently compile this to:
1312
1313 define i32 @test(i32 %a, i32 %b) nounwind  {
1314         %tmp3 = icmp ult i32 %a, %b             ; <i1> [#uses=1]
1315         %tmp34 = zext i1 %tmp3 to i32           ; <i32> [#uses=1]
1316         %tmp5 = sub i32 0, %tmp34               ; <i32> [#uses=1]
1317         ret i32 %tmp5
1318 }
1319
1320 and
1321
1322 _test:
1323         movl    8(%esp), %eax
1324         cmpl    %eax, 4(%esp)
1325         setb    %al
1326         movzbl  %al, %eax
1327         negl    %eax
1328         ret
1329
1330 Several deficiencies here.  First, we should instcombine zext+neg into sext:
1331
1332 define i32 @test2(i32 %a, i32 %b) nounwind  {
1333         %tmp3 = icmp ult i32 %a, %b             ; <i1> [#uses=1]
1334         %tmp34 = sext i1 %tmp3 to i32           ; <i32> [#uses=1]
1335         ret i32 %tmp34
1336 }
1337
1338 However, before we can do that, we have to fix the bad codegen that we get for
1339 sext from bool:
1340
1341 _test2:
1342         movl    8(%esp), %eax
1343         cmpl    %eax, 4(%esp)
1344         setb    %al
1345         movzbl  %al, %eax
1346         shll    $31, %eax
1347         sarl    $31, %eax
1348         ret
1349
1350 This code should be at least as good as the code above.  Once this is fixed, we
1351 can optimize this specific case even more to:
1352
1353         movl    8(%esp), %eax
1354         xorl    %ecx, %ecx
1355         cmpl    %eax, 4(%esp)
1356         sbbl    %ecx, %ecx
1357
1358 //===---------------------------------------------------------------------===//
1359
1360 Take the following code (from
1361 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=16541):
1362
1363 extern unsigned char first_one[65536];
1364 int FirstOnet(unsigned long long arg1)
1365 {
1366   if (arg1 >> 48)
1367     return (first_one[arg1 >> 48]);
1368   return 0;
1369 }
1370
1371
1372 The following code is currently generated:
1373 FirstOnet:
1374         movl    8(%esp), %eax
1375         cmpl    $65536, %eax
1376         movl    4(%esp), %ecx
1377         jb      .LBB1_2 # UnifiedReturnBlock
1378 .LBB1_1:        # ifthen
1379         shrl    $16, %eax
1380         movzbl  first_one(%eax), %eax
1381         ret
1382 .LBB1_2:        # UnifiedReturnBlock
1383         xorl    %eax, %eax
1384         ret
1385
1386 There are a few possible improvements here:
1387 1. We should be able to eliminate the dead load into %ecx
1388 2. We could change the "movl 8(%esp), %eax" into
1389    "movzwl 10(%esp), %eax"; this lets us change the cmpl
1390    into a testl, which is shorter, and eliminate the shift.
1391
1392 We could also in theory eliminate the branch by using a conditional
1393 for the address of the load, but that seems unlikely to be worthwhile
1394 in general.
1395
1396 //===---------------------------------------------------------------------===//
1397
1398 We compile this function:
1399
1400 define i32 @foo(i32 %a, i32 %b, i32 %c, i8 zeroext  %d) nounwind  {
1401 entry:
1402         %tmp2 = icmp eq i8 %d, 0                ; <i1> [#uses=1]
1403         br i1 %tmp2, label %bb7, label %bb
1404
1405 bb:             ; preds = %entry
1406         %tmp6 = add i32 %b, %a          ; <i32> [#uses=1]
1407         ret i32 %tmp6
1408
1409 bb7:            ; preds = %entry
1410         %tmp10 = sub i32 %a, %c         ; <i32> [#uses=1]
1411         ret i32 %tmp10
1412 }
1413
1414 to:
1415
1416 _foo:
1417         cmpb    $0, 16(%esp)
1418         movl    12(%esp), %ecx
1419         movl    8(%esp), %eax
1420         movl    4(%esp), %edx
1421         je      LBB1_2  # bb7
1422 LBB1_1: # bb
1423         addl    %edx, %eax
1424         ret
1425 LBB1_2: # bb7
1426         movl    %edx, %eax
1427         subl    %ecx, %eax
1428         ret
1429
1430 The coalescer could coalesce "edx" with "eax" to avoid the movl in LBB1_2
1431 if it commuted the addl in LBB1_1.
1432
1433 //===---------------------------------------------------------------------===//
1434
1435 See rdar://4653682.
1436
1437 From flops:
1438
1439 LBB1_15:        # bb310
1440         cvtss2sd        LCPI1_0, %xmm1
1441         addsd   %xmm1, %xmm0
1442         movsd   176(%esp), %xmm2
1443         mulsd   %xmm0, %xmm2
1444         movapd  %xmm2, %xmm3
1445         mulsd   %xmm3, %xmm3
1446         movapd  %xmm3, %xmm4
1447         mulsd   LCPI1_23, %xmm4
1448         addsd   LCPI1_24, %xmm4
1449         mulsd   %xmm3, %xmm4
1450         addsd   LCPI1_25, %xmm4
1451         mulsd   %xmm3, %xmm4
1452         addsd   LCPI1_26, %xmm4
1453         mulsd   %xmm3, %xmm4
1454         addsd   LCPI1_27, %xmm4
1455         mulsd   %xmm3, %xmm4
1456         addsd   LCPI1_28, %xmm4
1457         mulsd   %xmm3, %xmm4
1458         addsd   %xmm1, %xmm4
1459         mulsd   %xmm2, %xmm4
1460         movsd   152(%esp), %xmm1
1461         addsd   %xmm4, %xmm1
1462         movsd   %xmm1, 152(%esp)
1463         incl    %eax
1464         cmpl    %eax, %esi
1465         jge     LBB1_15 # bb310
1466 LBB1_16:        # bb358.loopexit
1467         movsd   152(%esp), %xmm0
1468         addsd   %xmm0, %xmm0
1469         addsd   LCPI1_22, %xmm0
1470         movsd   %xmm0, 152(%esp)
1471
1472 Rather than spilling the result of the last addsd in the loop, we should have
1473 insert a copy to split the interval (one for the duration of the loop, one
1474 extending to the fall through). The register pressure in the loop isn't high
1475 enough to warrant the spill.
1476
1477 Also check why xmm7 is not used at all in the function.
1478
1479 //===---------------------------------------------------------------------===//
1480
1481 Legalize loses track of the fact that bools are always zero extended when in
1482 memory.  This causes us to compile abort_gzip (from 164.gzip) from:
1483
1484 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
1485 target triple = "i386-apple-darwin8"
1486 @in_exit.4870.b = internal global i1 false              ; <i1*> [#uses=2]
1487 define fastcc void @abort_gzip() noreturn nounwind  {
1488 entry:
1489         %tmp.b.i = load i1* @in_exit.4870.b             ; <i1> [#uses=1]
1490         br i1 %tmp.b.i, label %bb.i, label %bb4.i
1491 bb.i:           ; preds = %entry
1492         tail call void @exit( i32 1 ) noreturn nounwind
1493         unreachable
1494 bb4.i:          ; preds = %entry
1495         store i1 true, i1* @in_exit.4870.b
1496         tail call void @exit( i32 1 ) noreturn nounwind
1497         unreachable
1498 }
1499 declare void @exit(i32) noreturn nounwind
1500
1501 into:
1502
1503 _abort_gzip:
1504         subl    $12, %esp
1505         movb    _in_exit.4870.b, %al
1506         notb    %al
1507         testb   $1, %al
1508         jne     LBB1_2  ## bb4.i
1509 LBB1_1: ## bb.i
1510   ...
1511
1512 //===---------------------------------------------------------------------===//
1513
1514 We compile:
1515
1516 int test(int x, int y) {
1517   return x-y-1;
1518 }
1519
1520 into (-m64):
1521
1522 _test:
1523         decl    %edi
1524         movl    %edi, %eax
1525         subl    %esi, %eax
1526         ret
1527
1528 it would be better to codegen as: x+~y  (notl+addl)
1529
1530 //===---------------------------------------------------------------------===//
1531
1532 This code:
1533
1534 int foo(const char *str,...)
1535 {
1536  __builtin_va_list a; int x;
1537  __builtin_va_start(a,str); x = __builtin_va_arg(a,int); __builtin_va_end(a);
1538  return x;
1539 }
1540
1541 gets compiled into this on x86-64:
1542         subq    $200, %rsp
1543         movaps  %xmm7, 160(%rsp)
1544         movaps  %xmm6, 144(%rsp)
1545         movaps  %xmm5, 128(%rsp)
1546         movaps  %xmm4, 112(%rsp)
1547         movaps  %xmm3, 96(%rsp)
1548         movaps  %xmm2, 80(%rsp)
1549         movaps  %xmm1, 64(%rsp)
1550         movaps  %xmm0, 48(%rsp)
1551         movq    %r9, 40(%rsp)
1552         movq    %r8, 32(%rsp)
1553         movq    %rcx, 24(%rsp)
1554         movq    %rdx, 16(%rsp)
1555         movq    %rsi, 8(%rsp)
1556         leaq    (%rsp), %rax
1557         movq    %rax, 192(%rsp)
1558         leaq    208(%rsp), %rax
1559         movq    %rax, 184(%rsp)
1560         movl    $48, 180(%rsp)
1561         movl    $8, 176(%rsp)
1562         movl    176(%rsp), %eax
1563         cmpl    $47, %eax
1564         jbe     .LBB1_3 # bb
1565 .LBB1_1:        # bb3
1566         movq    184(%rsp), %rcx
1567         leaq    8(%rcx), %rax
1568         movq    %rax, 184(%rsp)
1569 .LBB1_2:        # bb4
1570         movl    (%rcx), %eax
1571         addq    $200, %rsp
1572         ret
1573 .LBB1_3:        # bb
1574         movl    %eax, %ecx
1575         addl    $8, %eax
1576         addq    192(%rsp), %rcx
1577         movl    %eax, 176(%rsp)
1578         jmp     .LBB1_2 # bb4
1579
1580 gcc 4.3 generates:
1581         subq    $96, %rsp
1582 .LCFI0:
1583         leaq    104(%rsp), %rax
1584         movq    %rsi, -80(%rsp)
1585         movl    $8, -120(%rsp)
1586         movq    %rax, -112(%rsp)
1587         leaq    -88(%rsp), %rax
1588         movq    %rax, -104(%rsp)
1589         movl    $8, %eax
1590         cmpl    $48, %eax
1591         jb      .L6
1592         movq    -112(%rsp), %rdx
1593         movl    (%rdx), %eax
1594         addq    $96, %rsp
1595         ret
1596         .p2align 4,,10
1597         .p2align 3
1598 .L6:
1599         mov     %eax, %edx
1600         addq    -104(%rsp), %rdx
1601         addl    $8, %eax
1602         movl    %eax, -120(%rsp)
1603         movl    (%rdx), %eax
1604         addq    $96, %rsp
1605         ret
1606
1607 and it gets compiled into this on x86:
1608         pushl   %ebp
1609         movl    %esp, %ebp
1610         subl    $4, %esp
1611         leal    12(%ebp), %eax
1612         movl    %eax, -4(%ebp)
1613         leal    16(%ebp), %eax
1614         movl    %eax, -4(%ebp)
1615         movl    12(%ebp), %eax
1616         addl    $4, %esp
1617         popl    %ebp
1618         ret
1619
1620 gcc 4.3 generates:
1621         pushl   %ebp
1622         movl    %esp, %ebp
1623         movl    12(%ebp), %eax
1624         popl    %ebp
1625         ret
1626
1627 //===---------------------------------------------------------------------===//
1628
1629 Teach tblgen not to check bitconvert source type in some cases. This allows us
1630 to consolidate the following patterns in X86InstrMMX.td:
1631
1632 def : Pat<(v2i32 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1633                                                   (iPTR 0))))),
1634           (v2i32 (MMX_MOVDQ2Qrr VR128:$src))>;
1635 def : Pat<(v4i16 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1636                                                   (iPTR 0))))),
1637           (v4i16 (MMX_MOVDQ2Qrr VR128:$src))>;
1638 def : Pat<(v8i8 (bitconvert (i64 (vector_extract (v2i64 VR128:$src),
1639                                                   (iPTR 0))))),
1640           (v8i8 (MMX_MOVDQ2Qrr VR128:$src))>;
1641
1642 There are other cases in various td files.
1643
1644 //===---------------------------------------------------------------------===//
1645
1646 Take something like the following on x86-32:
1647 unsigned a(unsigned long long x, unsigned y) {return x % y;}
1648
1649 We currently generate a libcall, but we really shouldn't: the expansion is
1650 shorter and likely faster than the libcall.  The expected code is something
1651 like the following:
1652
1653         movl    12(%ebp), %eax
1654         movl    16(%ebp), %ecx
1655         xorl    %edx, %edx
1656         divl    %ecx
1657         movl    8(%ebp), %eax
1658         divl    %ecx
1659         movl    %edx, %eax
1660         ret
1661
1662 A similar code sequence works for division.
1663
1664 //===---------------------------------------------------------------------===//
1665
1666 These should compile to the same code, but the later codegen's to useless
1667 instructions on X86. This may be a trivial dag combine (GCC PR7061):
1668
1669 struct s1 { unsigned char a, b; };
1670 unsigned long f1(struct s1 x) {
1671     return x.a + x.b;
1672 }
1673 struct s2 { unsigned a: 8, b: 8; };
1674 unsigned long f2(struct s2 x) {
1675     return x.a + x.b;
1676 }
1677
1678 //===---------------------------------------------------------------------===//
1679
1680 We currently compile this:
1681
1682 define i32 @func1(i32 %v1, i32 %v2) nounwind {
1683 entry:
1684   %t = call {i32, i1} @llvm.sadd.with.overflow.i32(i32 %v1, i32 %v2)
1685   %sum = extractvalue {i32, i1} %t, 0
1686   %obit = extractvalue {i32, i1} %t, 1
1687   br i1 %obit, label %overflow, label %normal
1688 normal:
1689   ret i32 %sum
1690 overflow:
1691   call void @llvm.trap()
1692   unreachable
1693 }
1694 declare {i32, i1} @llvm.sadd.with.overflow.i32(i32, i32)
1695 declare void @llvm.trap()
1696
1697 to:
1698
1699 _func1:
1700         movl    4(%esp), %eax
1701         addl    8(%esp), %eax
1702         jo      LBB1_2  ## overflow
1703 LBB1_1: ## normal
1704         ret
1705 LBB1_2: ## overflow
1706         ud2
1707
1708 it would be nice to produce "into" someday.
1709
1710 //===---------------------------------------------------------------------===//
1711
1712 This code:
1713
1714 void vec_mpys1(int y[], const int x[], int scaler) {
1715 int i;
1716 for (i = 0; i < 150; i++)
1717  y[i] += (((long long)scaler * (long long)x[i]) >> 31);
1718 }
1719
1720 Compiles to this loop with GCC 3.x:
1721
1722 .L5:
1723         movl    %ebx, %eax
1724         imull   (%edi,%ecx,4)
1725         shrdl   $31, %edx, %eax
1726         addl    %eax, (%esi,%ecx,4)
1727         incl    %ecx
1728         cmpl    $149, %ecx
1729         jle     .L5
1730
1731 llvm-gcc compiles it to the much uglier:
1732
1733 LBB1_1: ## bb1
1734         movl    24(%esp), %eax
1735         movl    (%eax,%edi,4), %ebx
1736         movl    %ebx, %ebp
1737         imull   %esi, %ebp
1738         movl    %ebx, %eax
1739         mull    %ecx
1740         addl    %ebp, %edx
1741         sarl    $31, %ebx
1742         imull   %ecx, %ebx
1743         addl    %edx, %ebx
1744         shldl   $1, %eax, %ebx
1745         movl    20(%esp), %eax
1746         addl    %ebx, (%eax,%edi,4)
1747         incl    %edi
1748         cmpl    $150, %edi
1749         jne     LBB1_1  ## bb1
1750
1751 The issue is that we hoist the cast of "scaler" to long long outside of the
1752 loop, the value comes into the loop as two values, and
1753 RegsForValue::getCopyFromRegs doesn't know how to put an AssertSext on the
1754 constructed BUILD_PAIR which represents the cast value.
1755
1756 //===---------------------------------------------------------------------===//
1757
1758 Test instructions can be eliminated by using EFLAGS values from arithmetic
1759 instructions. This is currently not done for mul, and, or, xor, neg, shl,
1760 sra, srl, shld, shrd, atomic ops, and others. It is also currently not done
1761 for read-modify-write instructions. It is also current not done if the
1762 OF or CF flags are needed.
1763
1764 The shift operators have the complication that when the shift count is
1765 zero, EFLAGS is not set, so they can only subsume a test instruction if
1766 the shift count is known to be non-zero. Also, using the EFLAGS value
1767 from a shift is apparently very slow on some x86 implementations.
1768
1769 In read-modify-write instructions, the root node in the isel match is
1770 the store, and isel has no way for the use of the EFLAGS result of the
1771 arithmetic to be remapped to the new node.
1772
1773 Add and subtract instructions set OF on signed overflow and CF on unsiged
1774 overflow, while test instructions always clear OF and CF. In order to
1775 replace a test with an add or subtract in a situation where OF or CF is
1776 needed, codegen must be able to prove that the operation cannot see
1777 signed or unsigned overflow, respectively.
1778
1779 //===---------------------------------------------------------------------===//
1780
1781 memcpy/memmove do not lower to SSE copies when possible.  A silly example is:
1782 define <16 x float> @foo(<16 x float> %A) nounwind {
1783         %tmp = alloca <16 x float>, align 16
1784         %tmp2 = alloca <16 x float>, align 16
1785         store <16 x float> %A, <16 x float>* %tmp
1786         %s = bitcast <16 x float>* %tmp to i8*
1787         %s2 = bitcast <16 x float>* %tmp2 to i8*
1788         call void @llvm.memcpy.i64(i8* %s, i8* %s2, i64 64, i32 16)
1789         %R = load <16 x float>* %tmp2
1790         ret <16 x float> %R
1791 }
1792
1793 declare void @llvm.memcpy.i64(i8* nocapture, i8* nocapture, i64, i32) nounwind
1794
1795 which compiles to:
1796
1797 _foo:
1798         subl    $140, %esp
1799         movaps  %xmm3, 112(%esp)
1800         movaps  %xmm2, 96(%esp)
1801         movaps  %xmm1, 80(%esp)
1802         movaps  %xmm0, 64(%esp)
1803         movl    60(%esp), %eax
1804         movl    %eax, 124(%esp)
1805         movl    56(%esp), %eax
1806         movl    %eax, 120(%esp)
1807         movl    52(%esp), %eax
1808         <many many more 32-bit copies>
1809         movaps  (%esp), %xmm0
1810         movaps  16(%esp), %xmm1
1811         movaps  32(%esp), %xmm2
1812         movaps  48(%esp), %xmm3
1813         addl    $140, %esp
1814         ret
1815
1816 On Nehalem, it may even be cheaper to just use movups when unaligned than to
1817 fall back to lower-granularity chunks.
1818
1819 //===---------------------------------------------------------------------===//
1820
1821 Implement processor-specific optimizations for parity with GCC on these
1822 processors.  GCC does two optimizations:
1823
1824 1. ix86_pad_returns inserts a noop before ret instructions if immediately
1825    preceeded by a conditional branch or is the target of a jump.
1826 2. ix86_avoid_jump_misspredicts inserts noops in cases where a 16-byte block of
1827    code contains more than 3 branches.
1828
1829 The first one is done for all AMDs, Core2, and "Generic"
1830 The second one is done for: Atom, Pentium Pro, all AMDs, Pentium 4, Nocona,
1831   Core 2, and "Generic"
1832
1833 //===---------------------------------------------------------------------===//
1834
1835 Testcase:
1836 int a(int x) { return (x & 127) > 31; }
1837
1838 Current output:
1839         movl    4(%esp), %eax
1840         andl    $127, %eax
1841         cmpl    $31, %eax
1842         seta    %al
1843         movzbl  %al, %eax
1844         ret
1845
1846 Ideal output:
1847         xorl    %eax, %eax
1848         testl   $96, 4(%esp)
1849         setne   %al
1850         ret
1851
1852 This should definitely be done in instcombine, canonicalizing the range
1853 condition into a != condition.  We get this IR:
1854
1855 define i32 @a(i32 %x) nounwind readnone {
1856 entry:
1857         %0 = and i32 %x, 127            ; <i32> [#uses=1]
1858         %1 = icmp ugt i32 %0, 31                ; <i1> [#uses=1]
1859         %2 = zext i1 %1 to i32          ; <i32> [#uses=1]
1860         ret i32 %2
1861 }
1862
1863 Instcombine prefers to strength reduce relational comparisons to equality
1864 comparisons when possible, this should be another case of that.  This could
1865 be handled pretty easily in InstCombiner::visitICmpInstWithInstAndIntCst, but it
1866 looks like InstCombiner::visitICmpInstWithInstAndIntCst should really already
1867 be redesigned to use ComputeMaskedBits and friends.
1868
1869
1870 //===---------------------------------------------------------------------===//
1871 Testcase:
1872 int x(int a) { return (a&0xf0)>>4; }
1873
1874 Current output:
1875         movl    4(%esp), %eax
1876         shrl    $4, %eax
1877         andl    $15, %eax
1878         ret
1879
1880 Ideal output:
1881         movzbl  4(%esp), %eax
1882         shrl    $4, %eax
1883         ret
1884
1885 //===---------------------------------------------------------------------===//
1886
1887 Testcase:
1888 int x(int a) { return (a & 0x80) ? 0x100 : 0; }
1889 int y(int a) { return (a & 0x80) *2; }
1890
1891 Current:
1892         testl   $128, 4(%esp)
1893         setne   %al
1894         movzbl  %al, %eax
1895         shll    $8, %eax
1896         ret
1897
1898 Better:
1899         movl    4(%esp), %eax
1900         addl    %eax, %eax
1901         andl    $256, %eax
1902         ret
1903
1904 This is another general instcombine transformation that is profitable on all
1905 targets.  In LLVM IR, these functions look like this:
1906
1907 define i32 @x(i32 %a) nounwind readnone {
1908 entry:
1909         %0 = and i32 %a, 128
1910         %1 = icmp eq i32 %0, 0
1911         %iftmp.0.0 = select i1 %1, i32 0, i32 256
1912         ret i32 %iftmp.0.0
1913 }
1914
1915 define i32 @y(i32 %a) nounwind readnone {
1916 entry:
1917         %0 = shl i32 %a, 1
1918         %1 = and i32 %0, 256
1919         ret i32 %1
1920 }
1921
1922 Replacing an icmp+select with a shift should always be considered profitable in
1923 instcombine.
1924
1925 //===---------------------------------------------------------------------===//
1926
1927 Re-implement atomic builtins __sync_add_and_fetch() and __sync_sub_and_fetch
1928 properly.
1929
1930 When the return value is not used (i.e. only care about the value in the
1931 memory), x86 does not have to use add to implement these. Instead, it can use
1932 add, sub, inc, dec instructions with the "lock" prefix.
1933
1934 This is currently implemented using a bit of instruction selection trick. The
1935 issue is the target independent pattern produces one output and a chain and we
1936 want to map it into one that just output a chain. The current trick is to select
1937 it into a MERGE_VALUES with the first definition being an implicit_def. The
1938 proper solution is to add new ISD opcodes for the no-output variant. DAG
1939 combiner can then transform the node before it gets to target node selection.
1940
1941 Problem #2 is we are adding a whole bunch of x86 atomic instructions when in
1942 fact these instructions are identical to the non-lock versions. We need a way to
1943 add target specific information to target nodes and have this information
1944 carried over to machine instructions. Asm printer (or JIT) can use this
1945 information to add the "lock" prefix.
1946
1947 //===---------------------------------------------------------------------===//