X-Git-Url: http://demsky.eecs.uci.edu/git/?a=blobdiff_plain;f=lib%2FTarget%2FX86%2FREADME-X86-64.txt;h=bcfdf0bc56b28ca4853244aa3669c96c8c968e09;hb=7f7bf0da6a1126c77590a433e4a133cd476ead48;hp=efe3b73da5b1d9f4dbbc484c7500d0fec63d98d9;hpb=22eedf4eec990f90988af819c034787019440cbd;p=oota-llvm.git

diff --git a/lib/Target/X86/README-X86-64.txt b/lib/Target/X86/README-X86-64.txt
index efe3b73da5b..bcfdf0bc56b 100644
--- a/lib/Target/X86/README-X86-64.txt
+++ b/lib/Target/X86/README-X86-64.txt
@@ -1,35 +1,5 @@
 //===- README_X86_64.txt - Notes for X86-64 code gen ----------------------===//
 
-Implement different PIC models? Right now we only support Mac OS X with small
-PIC code model.
-
-//===---------------------------------------------------------------------===//
-
-Make use of "Red Zone".
-
-//===---------------------------------------------------------------------===//
-
-Implement __int128 and long double support.
-
-//===---------------------------------------------------------------------===//
-
-For this:
-
-extern void xx(void);
-void bar(void) {
-  xx();
-}
-
-gcc compiles to:
-
-.globl _bar
-_bar:
-	jmp	_xx
-
-We need to do the tailcall optimization as well.
-
-//===---------------------------------------------------------------------===//
-
 AMD64 Optimization Manual 8.2 has some nice information about optimizing integer
 multiplication by a constant. How much of it applies to Intel's X86-64
 implementation? There are definite trade-offs to consider: latency vs. register
@@ -66,188 +36,149 @@ _conv:
 	cmovb %rcx, %rax
 	ret
 
-Seems like the jb branch has high likelyhood of being taken. It would have
+Seems like the jb branch has high likelihood of being taken. It would have
 saved a few instructions.
 
 //===---------------------------------------------------------------------===//
 
-Poor codegen:
-
-int X[2];
-int b;
-void test(void) {
-  memset(X, b, 2*sizeof(X[0]));
-}
-
-llc:
-	movq _b@GOTPCREL(%rip), %rax
-	movzbq (%rax), %rax
-	movq %rax, %rcx
-	shlq $8, %rcx
-	orq %rax, %rcx
-	movq %rcx, %rax
-	shlq $16, %rax
-	orq %rcx, %rax
-	movq %rax, %rcx
-	shlq $32, %rcx
-	movq _X@GOTPCREL(%rip), %rdx
-	orq %rax, %rcx
-	movq %rcx, (%rdx)
-	ret
+It's not possible to reference AH, BH, CH, and DH registers in an instruction
+requiring REX prefix. However, divb and mulb both produce results in AH. If isel
+emits a CopyFromReg which gets turned into a movb and that can be allocated a
+r8b - r15b.
 
-gcc:
-	movq	_b@GOTPCREL(%rip), %rax
-	movabsq	$72340172838076673, %rdx
-	movzbq	(%rax), %rax
-	imulq	%rdx, %rax
-	movq	_X@GOTPCREL(%rip), %rdx
-	movq	%rax, (%rdx)
-	ret
+To get around this, isel emits a CopyFromReg from AX and then right shift it
+down by 8 and truncate it. It's not pretty but it works. We need some register
+allocation magic to make the hack go away (e.g. putting additional constraints
+on the result of the movb).
 
 //===---------------------------------------------------------------------===//
 
-Vararg function prologue can be further optimized. Currently all XMM registers
-are stored into register save area. Most of them can be eliminated since the
-upper bound of the number of XMM registers used are passed in %al. gcc produces
-something like the following:
-
-	movzbl	%al, %edx
-	leaq	0(,%rdx,4), %rax
-	leaq	4+L2(%rip), %rdx
-	leaq	239(%rsp), %rax
-       	jmp	*%rdx
-	movaps	%xmm7, -15(%rax)
-	movaps	%xmm6, -31(%rax)
-	movaps	%xmm5, -47(%rax)
-	movaps	%xmm4, -63(%rax)
-	movaps	%xmm3, -79(%rax)
-	movaps	%xmm2, -95(%rax)
-	movaps	%xmm1, -111(%rax)
-	movaps	%xmm0, -127(%rax)
-L2:
-
-It jumps over the movaps that do not need to be stored. Hard to see this being
-significant as it added 5 instruciton (including a indirect branch) to avoid
-executing 0 to 8 stores in the function prologue.
-
-Perhaps we can optimize for the common case where no XMM registers are used for
-parameter passing. i.e. is %al == 0 jump over all stores. Or in the case of a
-leaf function where we can determine that no XMM input parameter is need, avoid
-emitting the stores at all.
+The x86-64 ABI for hidden-argument struct returns requires that the
+incoming value of %rdi be copied into %rax by the callee upon return.
 
-//===---------------------------------------------------------------------===//
+The idea is that it saves callers from having to remember this value,
+which would often require a callee-saved register. Callees usually
+need to keep this value live for most of their body anyway, so it
+doesn't add a significant burden on them.
+
+We currently implement this in codegen, however this is suboptimal
+because it means that it would be quite awkward to implement the
+optimization for callers.
 
-AMD64 has a complex calling convention for aggregate passing by value:
-
-1. If the size of an object is larger than two eightbytes, or in C++, is a non- 
-   POD structure or union type, or contains unaligned fields, it has class 
-   MEMORY.
-2. Both eightbytes get initialized to class NO_CLASS. 
-3. Each field of an object is classified recursively so that always two fields
-   are considered. The resulting class is calculated according to the classes
-   of the fields in the eightbyte: 
-   (a) If both classes are equal, this is the resulting class. 
-   (b) If one of the classes is NO_CLASS, the resulting class is the other 
-       class. 
-   (c) If one of the classes is MEMORY, the result is the MEMORY class. 
-   (d) If one of the classes is INTEGER, the result is the INTEGER. 
-   (e) If one of the classes is X87, X87UP, COMPLEX_X87 class, MEMORY is used as
-      class. 
-   (f) Otherwise class SSE is used. 
-4. Then a post merger cleanup is done: 
-   (a) If one of the classes is MEMORY, the whole argument is passed in memory. 
-   (b) If SSEUP is not preceeded by SSE, it is converted to SSE.
-
-Currently llvm frontend does not handle this correctly.
-
-Problem 1:
-    typedef struct { int i; double d; } QuadWordS;
-It is currently passed in two i64 integer registers. However, gcc compiled
-callee expects the second element 'd' to be passed in XMM0.
-
-Problem 2:
-    typedef struct { int32_t i; float j; double d; } QuadWordS;
-The size of the first two fields == i64 so they will be combined and passed in
-a integer register RDI. The third field is still passed in XMM0.
-
-Problem 3:
-    typedef struct { int64_t i; int8_t j; int64_t d; } S;
-    void test(S s)
-The size of this aggregate is greater than two i64 so it should be passed in 
-memory. Currently llvm breaks this down and passed it in three integer
-registers.
-
-Problem 4:
-Taking problem 3 one step ahead where a function expects a aggregate value
-in memory followed by more parameter(s) passed in register(s).
-    void test(S s, int b)
-
-LLVM IR does not allow parameter passing by aggregates, therefore it must break
-the aggregates value (in problem 3 and 4) into a number of scalar values:
-    void %test(long %s.i, byte %s.j, long %s.d);
-
-However, if the backend were to lower this code literally it would pass the 3
-values in integer registers. To force it be passed in memory, the frontend
-should change the function signiture to:
-    void %test(long %undef1, long %undef2, long %undef3, long %undef4, 
-               long %undef5, long %undef6,
-               long %s.i, byte %s.j, long %s.d);
-And the callee would look something like this:
-    call void %test( undef, undef, undef, undef, undef, undef,
-                     %tmp.s.i, %tmp.s.j, %tmp.s.d );
-The first 6 undef parameters would exhaust the 6 integer registers used for
-parameter passing. The following three integer values would then be forced into
-memory.
-
-For problem 4, the parameter 'd' would be moved to the front of the parameter
-list so it will be passed in register:
-    void %test(int %d,
-               long %undef1, long %undef2, long %undef3, long %undef4, 
-               long %undef5, long %undef6,
-               long %s.i, byte %s.j, long %s.d);
+A better implementation would be to relax the LLVM IR rules for sret
+arguments to allow a function with an sret argument to have a non-void
+return type, and to have the front-end to set up the sret argument value
+as the return value of the function. The front-end could more easily
+emit uses of the returned struct value to be in terms of the function's
+lowered return value, and it would free non-C frontends from a
+complication only required by a C-based ABI.
 
 //===---------------------------------------------------------------------===//
 
-Right now the asm printer assumes GlobalAddress are accessed via RIP relative
-addressing. Therefore, it is not possible to generate this:
-        movabsq $__ZTV10polynomialIdE+16, %rax
+We get a redundant zero extension for code like this:
 
-That is ok for now since we currently only support small model. So the above
-is selected as
-        leaq __ZTV10polynomialIdE+16(%rip), %rax
+int mask[1000];
+int foo(unsigned x) {
+ if (x < 10)
+   x = x * 45;
+ else
+   x = x * 78;
+ return mask[x];
+}
 
-This is probably slightly slower but is much shorter than movabsq. However, if
-we were to support medium or larger code models, we need to use the movabs
-instruction. We should probably introduce something like AbsoluteAddress to
-distinguish it from GlobalAddress so the asm printer and JIT code emitter can
-do the right thing.
+_foo:
+LBB1_0:	## entry
+	cmpl	$9, %edi
+	jbe	LBB1_3	## bb
+LBB1_1:	## bb1
+	imull	$78, %edi, %eax
+LBB1_2:	## bb2
+	movl	%eax, %eax                    <----
+	movq	_mask@GOTPCREL(%rip), %rcx
+	movl	(%rcx,%rax,4), %eax
+	ret
+LBB1_3:	## bb
+	imull	$45, %edi, %eax
+	jmp	LBB1_2	## bb2
+  
+Before regalloc, we have:
+
+        %reg1025<def> = IMUL32rri8 %reg1024, 45, %EFLAGS<imp-def>
+        JMP mbb<bb2,0x203afb0>
+    Successors according to CFG: 0x203afb0 (#3)
+
+bb1: 0x203af60, LLVM BB @0x1e02310, ID#2:
+    Predecessors according to CFG: 0x203aec0 (#0)
+        %reg1026<def> = IMUL32rri8 %reg1024, 78, %EFLAGS<imp-def>
+    Successors according to CFG: 0x203afb0 (#3)
+
+bb2: 0x203afb0, LLVM BB @0x1e02340, ID#3:
+    Predecessors according to CFG: 0x203af10 (#1) 0x203af60 (#2)
+        %reg1027<def> = PHI %reg1025, mbb<bb,0x203af10>,
+                            %reg1026, mbb<bb1,0x203af60>
+        %reg1029<def> = MOVZX64rr32 %reg1027
+
+so we'd have to know that IMUL32rri8 leaves the high word zero extended and to
+be able to recognize the zero extend.  This could also presumably be implemented
+if we have whole-function selectiondags.
 
 //===---------------------------------------------------------------------===//
 
-It's not possible to reference AH, BH, CH, and DH registers in an instruction
-requiring REX prefix. However, divb and mulb both produce results in AH. If isel
-emits a CopyFromReg which gets turned into a movb and that can be allocated a
-r8b - r15b.
+Take the following code
+(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653):
+extern unsigned long table[];
+unsigned long foo(unsigned char *p) {
+  unsigned long tag = *p;
+  return table[tag >> 4] + table[tag & 0xf];
+}
 
-To get around this, isel emits a CopyFromReg from AX and then right shift it
-down by 8 and truncate it. It's not pretty but it works. We need some register
-allocation magic to make the hack go away (e.g. putting additional constraints
-on the result of the movb).
+Current code generated:
+	movzbl	(%rdi), %eax
+	movq	%rax, %rcx
+	andq	$240, %rcx
+	shrq	%rcx
+	andq	$15, %rax
+	movq	table(,%rax,8), %rax
+	addq	table(%rcx), %rax
+	ret
 
-//===---------------------------------------------------------------------===//
+Issues:
+1. First movq should be movl; saves a byte.
+2. Both andq's should be andl; saves another two bytes.  I think this was
+   implemented at one point, but subsequently regressed.
+3. shrq should be shrl; saves another byte.
+4. The first andq can be completely eliminated by using a slightly more
+   expensive addressing mode.
 
-This function:
-double a(double b) {return (unsigned)b;} 
-compiles to this code:
+//===---------------------------------------------------------------------===//
 
-_a:
-	subq	$8, %rsp
-	cvttsd2siq	%xmm0, %rax
-	movl	%eax, %eax
-	cvtsi2sdq	%rax, %xmm0
-	addq	$8, %rsp
-	ret
+Consider the following (contrived testcase, but contains common factors):
+
+#include <stdarg.h>
+int test(int x, ...) {
+  int sum, i;
+  va_list l;
+  va_start(l, x);
+  for (i = 0; i < x; i++)
+    sum += va_arg(l, int);
+  va_end(l);
+  return sum;
+}
 
-note the dead rsp adjustments.
+Testcase given in C because fixing it will likely involve changing the IR
+generated for it.  The primary issue with the result is that it doesn't do any
+of the optimizations which are possible if we know the address of a va_list
+in the current function is never taken:
+1. We shouldn't spill the XMM registers because we only call va_arg with "int".
+2. It would be nice if we could scalarrepl the va_list.
+3. Probably overkill, but it'd be cool if we could peel off the first five
+iterations of the loop.
+
+Other optimizations involving functions which use va_arg on floats which don't
+have the address of a va_list taken:
+1. Conversely to the above, we shouldn't spill general registers if we only
+   call va_arg on "double".
+2. If we know nothing more than 64 bits wide is read from the XMM registers,
+   we can change the spilling code to reduce the amount of stack used by half.
 
 //===---------------------------------------------------------------------===//