From b956575bed91ecfb136a8300742ecbbf451471ab Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Mon, 9 Oct 2017 09:50:49 -0700 Subject: x86/mm: Flush more aggressively in lazy TLB mode Since commit: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") x86's lazy TLB mode has been all the way lazy: when running a kernel thread (including the idle thread), the kernel keeps using the last user mm's page tables without attempting to maintain user TLB coherence at all. From a pure semantic perspective, this is fine -- kernel threads won't attempt to access user pages, so having stale TLB entries doesn't matter. Unfortunately, I forgot about a subtlety. By skipping TLB flushes, we also allow any paging-structure caches that may exist on the CPU to become incoherent. This means that we can have a paging-structure cache entry that references a freed page table, and the CPU is within its rights to do a speculative page walk starting at the freed page table. I can imagine this causing two different problems: - A speculative page walk starting from a bogus page table could read IO addresses. I haven't seen any reports of this causing problems. - A speculative page walk that involves a bogus page table can install garbage in the TLB. Such garbage would always be at a user VA, but some AMD CPUs have logic that triggers a machine check when it notices these bogus entries. I've seen a couple reports of this. Boris further explains the failure mode: > It is actually more of an optimization which assumes that paging-structure > entries are in WB DRAM: > > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables > performance optimization that assumes PML4, PDP, PDE, and PTE entries > are in cacheable WB-DRAM; memory type checks may be bypassed, and > addresses outside of WB-DRAM may result in undefined behavior or NB > protocol errors. 1=Disables performance optimization and allows PML4, > PDP, PDE and PTE entries to be in any memory type. Operating systems > that maintain page tables in memory types other than WB- DRAM must set > TlbCacheDis to insure proper operation." > > The MCE generated is an NB protocol error to signal that > > "Link: A specific coherent-only packet from a CPU was issued to an > IO link. This may be caused by software which addresses page table > structures in a memory type other than cacheable WB-DRAM without > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for > example, when page table structure addresses are above top of memory. In > such cases, the NB will generate an MCE if it sees a mismatch between > the memory operation generated by the core and the link type." > > I'm assuming coherent-only packets don't go out on IO links, thus the > error. To fix this, reinstate TLB coherence in lazy mode. With this patch applied, we do it in one of two ways: - If we have PCID, we simply switch back to init_mm's page tables when we enter a kernel thread -- this seems to be quite cheap except for the cost of serializing the CPU. - If we don't have PCID, then we set a flag and switch to init_mm the first time we would otherwise need to flush the TLB. The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed to override the default mode for benchmarking. In theory, we could optimize this better by only flushing the TLB in lazy CPUs when a page table is freed. Doing that would require auditing the mm code to make sure that all page table freeing goes through tlb_remove_page() as well as reworking some data structures to implement the improved flush logic. Reported-by: Markus Trippelsdorf Reported-by: Adam Borowski Signed-off-by: Andy Lutomirski Signed-off-by: Borislav Petkov Cc: Borislav Petkov Cc: Brian Gerst Cc: Daniel Borkmann Cc: Eric Biggers Cc: Johannes Hirte Cc: Kees Cook Cc: Kirill A. Shutemov Cc: Linus Torvalds Cc: Nadav Amit Cc: Peter Zijlstra Cc: Rik van Riel Cc: Roman Kagan Cc: Thomas Gleixner Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic Signed-off-by: Ingo Molnar --- arch/x86/include/asm/tlbflush.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) (limited to 'arch/x86/include/asm/tlbflush.h') diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4893abf7f74f..d362161d3291 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,6 +82,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif +/* + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If + * it's false, then we immediately switch CR3 when entering a kernel thread. + */ +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode); + /* * 6 because 6 should be plenty and struct tlb_state will fit in * two cache lines. @@ -104,6 +111,23 @@ struct tlb_state { u16 loaded_mm_asid; u16 next_asid; + /* + * We can be in one of several states: + * + * - Actively using an mm. Our CPU's bit will be set in + * mm_cpumask(loaded_mm) and is_lazy == false; + * + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit + * will not be set in mm_cpumask(&init_mm) and is_lazy == false. + * + * - Lazily using a real mm. loaded_mm != &init_mm, our bit + * is set in mm_cpumask(loaded_mm), but is_lazy == true. + * We're heuristically guessing that the CR3 load we + * skipped more than makes up for the overhead added by + * lazy mode. + */ + bool is_lazy; + /* * Access to this CR4 shadow and to H/W CR4 is protected by * disabling interrupts when modifying either one. -- cgit v1.2.3 From 4e57b94664fef55aa71cac33b4632fdfdd52b695 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Sat, 14 Oct 2017 09:59:50 -0700 Subject: x86/mm: Tidy up "x86/mm: Flush more aggressively in lazy TLB mode" Due to timezones, commit: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode") was an outdated patch that well tested and fixed the bug but didn't address Borislav's review comments. Tidy it up: - The name "tlb_use_lazy_mode()" was highly confusing. Change it to "tlb_defer_switch_to_init_mm()", which describes what it actually means. - Move the static_branch crap into a helper. - Improve comments. Actually removing the debugfs option is in the next patch. Reported-by: Borislav Petkov Signed-off-by: Andy Lutomirski Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode") Link: http://lkml.kernel.org/r/154ef95428d4592596b6e98b0af1d2747d6cfbf8.1508000261.git.luto@kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/tlbflush.h | 7 ++++++- arch/x86/mm/tlb.c | 30 ++++++++++++++++++------------ 2 files changed, 24 insertions(+), 13 deletions(-) (limited to 'arch/x86/include/asm/tlbflush.h') diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index d362161d3291..0d4a1bb7e303 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -87,7 +87,12 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) * to init_mm when we switch to a kernel thread (e.g. the idle thread). If * it's false, then we immediately switch CR3 when entering a kernel thread. */ -DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode); +DECLARE_STATIC_KEY_TRUE(__tlb_defer_switch_to_init_mm); + +static inline bool tlb_defer_switch_to_init_mm(void) +{ + return static_branch_unlikely(&__tlb_defer_switch_to_init_mm); +} /* * 6 because 6 should be plenty and struct tlb_state will fit in diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 7db23f9f804e..5ee3b59baa85 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -30,7 +30,7 @@ atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); -DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode); +DEFINE_STATIC_KEY_TRUE(__tlb_defer_switch_to_init_mm); static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, u16 *new_asid, bool *need_flush) @@ -213,6 +213,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, } /* + * Please ignore the name of this function. It should be called + * switch_to_kernel_thread(). + * * enter_lazy_tlb() is a hint from the scheduler that we are entering a * kernel thread or other context without an mm. Acceptable implementations * include doing nothing whatsoever, switching to init_mm, or various clever @@ -227,7 +230,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm) return; - if (static_branch_unlikely(&tlb_use_lazy_mode)) { + if (tlb_defer_switch_to_init_mm()) { /* * There's a significant optimization that may be possible * here. We have accurate enough TLB flush tracking that we @@ -632,7 +635,8 @@ static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf, { char buf[2]; - buf[0] = static_branch_likely(&tlb_use_lazy_mode) ? '1' : '0'; + buf[0] = static_branch_likely(&__tlb_defer_switch_to_init_mm) + ? '1' : '0'; buf[1] = '\n'; return simple_read_from_buffer(user_buf, count, ppos, buf, 2); @@ -647,9 +651,9 @@ static ssize_t tlblazy_write_file(struct file *file, return -EINVAL; if (val) - static_branch_enable(&tlb_use_lazy_mode); + static_branch_enable(&__tlb_defer_switch_to_init_mm); else - static_branch_disable(&tlb_use_lazy_mode); + static_branch_disable(&__tlb_defer_switch_to_init_mm); return count; } @@ -660,23 +664,25 @@ static const struct file_operations fops_tlblazy = { .llseek = default_llseek, }; -static int __init init_tlb_use_lazy_mode(void) +static int __init init_tlblazy(void) { if (boot_cpu_has(X86_FEATURE_PCID)) { /* - * Heuristic: with PCID on, switching to and from - * init_mm is reasonably fast, but remote flush IPIs - * as expensive as ever, so turn off lazy TLB mode. + * If we have PCID, then switching to init_mm is reasonably + * fast. If we don't have PCID, then switching to init_mm is + * quite slow, so we default to trying to defer it in the + * hopes that we can avoid it entirely. The latter approach + * runs the risk of receiving otherwise unnecessary IPIs. * * We can't do this in setup_pcid() because static keys * haven't been initialized yet, and it would blow up * badly. */ - static_branch_disable(&tlb_use_lazy_mode); + static_branch_disable(&__tlb_defer_switch_to_init_mm); } - debugfs_create_file("tlb_use_lazy_mode", S_IRUSR | S_IWUSR, + debugfs_create_file("tlb_defer_switch_to_init_mm", S_IRUSR | S_IWUSR, arch_debugfs_dir, NULL, &fops_tlblazy); return 0; } -late_initcall(init_tlb_use_lazy_mode); +late_initcall(init_tlblazy); -- cgit v1.2.3 From 7ac7f2c315ef76437f5119df354d334448534fb5 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Sat, 14 Oct 2017 09:59:51 -0700 Subject: x86/mm: Remove debug/x86/tlb_defer_switch_to_init_mm Borislav thinks that we don't need this knob in a released kernel. Get rid of it. Requested-by: Borislav Petkov Signed-off-by: Andy Lutomirski Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode") Link: http://lkml.kernel.org/r/1fa72431924e81e86c164ff7881bf9240d1f1a6c.1508000261.git.luto@kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/tlbflush.h | 20 ++++++++------ arch/x86/mm/tlb.c | 58 ----------------------------------------- 2 files changed, 12 insertions(+), 66 deletions(-) (limited to 'arch/x86/include/asm/tlbflush.h') diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 0d4a1bb7e303..c4aed0de565e 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,16 +82,20 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif -/* - * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point - * to init_mm when we switch to a kernel thread (e.g. the idle thread). If - * it's false, then we immediately switch CR3 when entering a kernel thread. - */ -DECLARE_STATIC_KEY_TRUE(__tlb_defer_switch_to_init_mm); - static inline bool tlb_defer_switch_to_init_mm(void) { - return static_branch_unlikely(&__tlb_defer_switch_to_init_mm); + /* + * If we have PCID, then switching to init_mm is reasonably + * fast. If we don't have PCID, then switching to init_mm is + * quite slow, so we try to defer it in the hopes that we can + * avoid it entirely. The latter approach runs the risk of + * receiving otherwise unnecessary IPIs. + * + * This choice is just a heuristic. The tlb code can handle this + * function returning true or false regardless of whether we have + * PCID. + */ + return !static_cpu_has(X86_FEATURE_PCID); } /* diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 5ee3b59baa85..0f3d0cea4d00 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -30,7 +30,6 @@ atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); -DEFINE_STATIC_KEY_TRUE(__tlb_defer_switch_to_init_mm); static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, u16 *new_asid, bool *need_flush) @@ -629,60 +628,3 @@ static int __init create_tlb_single_page_flush_ceiling(void) return 0; } late_initcall(create_tlb_single_page_flush_ceiling); - -static ssize_t tlblazy_read_file(struct file *file, char __user *user_buf, - size_t count, loff_t *ppos) -{ - char buf[2]; - - buf[0] = static_branch_likely(&__tlb_defer_switch_to_init_mm) - ? '1' : '0'; - buf[1] = '\n'; - - return simple_read_from_buffer(user_buf, count, ppos, buf, 2); -} - -static ssize_t tlblazy_write_file(struct file *file, - const char __user *user_buf, size_t count, loff_t *ppos) -{ - bool val; - - if (kstrtobool_from_user(user_buf, count, &val)) - return -EINVAL; - - if (val) - static_branch_enable(&__tlb_defer_switch_to_init_mm); - else - static_branch_disable(&__tlb_defer_switch_to_init_mm); - - return count; -} - -static const struct file_operations fops_tlblazy = { - .read = tlblazy_read_file, - .write = tlblazy_write_file, - .llseek = default_llseek, -}; - -static int __init init_tlblazy(void) -{ - if (boot_cpu_has(X86_FEATURE_PCID)) { - /* - * If we have PCID, then switching to init_mm is reasonably - * fast. If we don't have PCID, then switching to init_mm is - * quite slow, so we default to trying to defer it in the - * hopes that we can avoid it entirely. The latter approach - * runs the risk of receiving otherwise unnecessary IPIs. - * - * We can't do this in setup_pcid() because static keys - * haven't been initialized yet, and it would blow up - * badly. - */ - static_branch_disable(&__tlb_defer_switch_to_init_mm); - } - - debugfs_create_file("tlb_defer_switch_to_init_mm", S_IRUSR | S_IWUSR, - arch_debugfs_dir, NULL, &fops_tlblazy); - return 0; -} -late_initcall(init_tlblazy); -- cgit v1.2.3