123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261 |
- //===---------------------------------------------------------------------===//
- // Random ideas for the ARM backend (Thumb specific).
- //===---------------------------------------------------------------------===//
- * Add support for compiling functions in both ARM and Thumb mode, then taking
- the smallest.
- * Add support for compiling individual basic blocks in thumb mode, when in a
- larger ARM function. This can be used for presumed cold code, like paths
- to abort (failure path of asserts), EH handling code, etc.
- * Thumb doesn't have normal pre/post increment addressing modes, but you can
- load/store 32-bit integers with pre/postinc by using load/store multiple
- instrs with a single register.
- * Make better use of high registers r8, r10, r11, r12 (ip). Some variants of add
- and cmp instructions can use high registers. Also, we can use them as
- temporaries to spill values into.
- * In thumb mode, short, byte, and bool preferred alignments are currently set
- to 4 to accommodate ISA restriction (i.e. add sp, #imm, imm must be multiple
- of 4).
- //===---------------------------------------------------------------------===//
- Potential jumptable improvements:
- * If we know function size is less than (1 << 16) * 2 bytes, we can use 16-bit
- jumptable entries (e.g. (L1 - L2) >> 1). Or even smaller entries if the
- function is even smaller. This also applies to ARM.
- * Thumb jumptable codegen can improve given some help from the assembler. This
- is what we generate right now:
- .set PCRELV0, (LJTI1_0_0-(LPCRELL0+4))
- LPCRELL0:
- mov r1, #PCRELV0
- add r1, pc
- ldr r0, [r0, r1]
- mov pc, r0
- .align 2
- LJTI1_0_0:
- .long LBB1_3
- ...
- Note there is another pc relative add that we can take advantage of.
- add r1, pc, #imm_8 * 4
- We should be able to generate:
- LPCRELL0:
- add r1, LJTI1_0_0
- ldr r0, [r0, r1]
- mov pc, r0
- .align 2
- LJTI1_0_0:
- .long LBB1_3
- if the assembler can translate the add to:
- add r1, pc, #((LJTI1_0_0-(LPCRELL0+4))&0xfffffffc)
- Note the assembler also does something similar to constpool load:
- LPCRELL0:
- ldr r0, LCPI1_0
- =>
- ldr r0, pc, #((LCPI1_0-(LPCRELL0+4))&0xfffffffc)
- //===---------------------------------------------------------------------===//
- We compile the following:
- define i16 @func_entry_2E_ce(i32 %i) {
- switch i32 %i, label %bb12.exitStub [
- i32 0, label %bb4.exitStub
- i32 1, label %bb9.exitStub
- i32 2, label %bb4.exitStub
- i32 3, label %bb4.exitStub
- i32 7, label %bb9.exitStub
- i32 8, label %bb.exitStub
- i32 9, label %bb9.exitStub
- ]
- bb12.exitStub:
- ret i16 0
- bb4.exitStub:
- ret i16 1
- bb9.exitStub:
- ret i16 2
- bb.exitStub:
- ret i16 3
- }
- into:
- _func_entry_2E_ce:
- mov r2, #1
- lsl r2, r0
- cmp r0, #9
- bhi LBB1_4 @bb12.exitStub
- LBB1_1: @newFuncRoot
- mov r1, #13
- tst r2, r1
- bne LBB1_5 @bb4.exitStub
- LBB1_2: @newFuncRoot
- ldr r1, LCPI1_0
- tst r2, r1
- bne LBB1_6 @bb9.exitStub
- LBB1_3: @newFuncRoot
- mov r1, #1
- lsl r1, r1, #8
- tst r2, r1
- bne LBB1_7 @bb.exitStub
- LBB1_4: @bb12.exitStub
- mov r0, #0
- bx lr
- LBB1_5: @bb4.exitStub
- mov r0, #1
- bx lr
- LBB1_6: @bb9.exitStub
- mov r0, #2
- bx lr
- LBB1_7: @bb.exitStub
- mov r0, #3
- bx lr
- LBB1_8:
- .align 2
- LCPI1_0:
- .long 642
- gcc compiles to:
- cmp r0, #9
- @ lr needed for prologue
- bhi L2
- ldr r3, L11
- mov r2, #1
- mov r1, r2, asl r0
- ands r0, r3, r2, asl r0
- movne r0, #2
- bxne lr
- tst r1, #13
- beq L9
- L3:
- mov r0, r2
- bx lr
- L9:
- tst r1, #256
- movne r0, #3
- bxne lr
- L2:
- mov r0, #0
- bx lr
- L12:
- .align 2
- L11:
- .long 642
-
- GCC is doing a couple of clever things here:
- 1. It is predicating one of the returns. This isn't a clear win though: in
- cases where that return isn't taken, it is replacing one condbranch with
- two 'ne' predicated instructions.
- 2. It is sinking the shift of "1 << i" into the tst, and using ands instead of
- tst. This will probably require whole function isel.
- 3. GCC emits:
- tst r1, #256
- we emit:
- mov r1, #1
- lsl r1, r1, #8
- tst r2, r1
- //===---------------------------------------------------------------------===//
- When spilling in thumb mode and the sp offset is too large to fit in the ldr /
- str offset field, we load the offset from a constpool entry and add it to sp:
- ldr r2, LCPI
- add r2, sp
- ldr r2, [r2]
- These instructions preserve the condition code which is important if the spill
- is between a cmp and a bcc instruction. However, we can use the (potentially)
- cheaper sequence if we know it's ok to clobber the condition register.
- add r2, sp, #255 * 4
- add r2, #132
- ldr r2, [r2, #7 * 4]
- This is especially bad when dynamic alloca is used. The all fixed size stack
- objects are referenced off the frame pointer with negative offsets. See
- oggenc for an example.
- //===---------------------------------------------------------------------===//
- Poor codegen test/CodeGen/ARM/select.ll f7:
- ldr r5, LCPI1_0
- LPC0:
- add r5, pc
- ldr r6, LCPI1_1
- ldr r2, LCPI1_2
- mov r3, r6
- mov lr, pc
- bx r5
- //===---------------------------------------------------------------------===//
- Make register allocator / spiller smarter so we can re-materialize "mov r, imm",
- etc. Almost all Thumb instructions clobber condition code.
- //===---------------------------------------------------------------------===//
- Thumb load / store address mode offsets are scaled. The values kept in the
- instruction operands are pre-scale values. This probably ought to be changed
- to avoid extra work when we convert Thumb2 instructions to Thumb1 instructions.
- //===---------------------------------------------------------------------===//
- We need to make (some of the) Thumb1 instructions predicable. That will allow
- shrinking of predicated Thumb2 instructions. To allow this, we need to be able
- to toggle the 's' bit since they do not set CPSR when they are inside IT blocks.
- //===---------------------------------------------------------------------===//
- Make use of hi register variants of cmp: tCMPhir / tCMPZhir.
- //===---------------------------------------------------------------------===//
- Thumb1 immediate field sometimes keep pre-scaled values. See
- ThumbRegisterInfo::eliminateFrameIndex. This is inconsistent from ARM and
- Thumb2.
- //===---------------------------------------------------------------------===//
- Rather than having tBR_JTr print a ".align 2" and constant island pass pad it,
- add a target specific ALIGN instruction instead. That way, getInstSizeInBytes
- won't have to over-estimate. It can also be used for loop alignment pass.
- //===---------------------------------------------------------------------===//
- We generate conditional code for icmp when we don't need to. This code:
- int foo(int s) {
- return s == 1;
- }
- produces:
- foo:
- cmp r0, #1
- mov.w r0, #0
- it eq
- moveq r0, #1
- bx lr
- when it could use subs + adcs. This is GCC PR46975.
|