From fd468043d4d87da49d717d7747dba9f21bf13ed7 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Fri, 23 Feb 2018 11:35:10 -0800 Subject: x86: avoid per-cpu system call trampoline The per-cpu system call trampoline was a clever trick, and allows us to have percpu data even before swapgs is done by just doing %rip-relative addressing. And that was important, because syscall doesn't have a kernel stack, so we needed that percpu data very very early, just to get a temporary register to switch the page tables around. However, it turns out to be unnecessary. Because we actually have a temporary register that we can use: %r11 is destroyed by the 'syscall' instruction anyway. Ok, technically it contains the user mode flags register, but we *have* that information anyway: it's still in %rflags, we've just masked off a few unimportant bits. We'll destroy the rest too when we do the "and" of the CR3 value, but who cares? It's a system call. Btw, there are a few bits in eflags that might matter to user space: DF and AC. Right now this clears them, but that is fixable by just changing the MSR_SYSCALL_MASK value to not include them, and clearing them by hand the way we do for all other kernel entry points anyway. So the only _real_ flags we'd destroy are IF and the arithmetic flags that get trampled on by the arithmetic instructions that are part of the %cr3 reload logic. However, if we really end up caring, we can save off even those: we'd take advantage of the fact that %rcx - which contains the returning IP of the system call - also has 8 bits free. Why 8? Even with 5-level paging, we only have 57 bits of virtual address space, and the high address space is for the kernel (and vsyscall, but we'd just disable native vsyscall). So the %rip value saved in %rcx can have only 56 valid bits, which means that we have 8 bits free. So *if* we care about IF and the arithmetic flags being saved over a system call, we'd do: shlq $8,%rcx movb %r11b,%cl shrl $8,%r11d andl $8,%r11d orb %r11b,%cl to save those bits off before we then user %r11 as a temporary register (we'd obviously need to then undo that as we save the user space state on the stack). Signed-off-by: Linus Torvalds --- arch/x86/include/asm/cpu_entry_area.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'arch/x86/include/asm/cpu_entry_area.h') diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h index 4a7884b8dca5..29c706415443 100644 --- a/arch/x86/include/asm/cpu_entry_area.h +++ b/arch/x86/include/asm/cpu_entry_area.h @@ -30,8 +30,6 @@ struct cpu_entry_area { */ struct tss_struct tss; - char entry_trampoline[PAGE_SIZE]; - #ifdef CONFIG_X86_64 /* * Exception stacks used for IST entries. -- cgit v1.2.3