x86/asm/64: Align start of __clear_user() loop to 16-bytes

x86 CPUs can suffer severe performance drops if a tight loop, such as the ones in __clear_user(), straddles a 16-byte instruction fetch window, or worse, a 64-byte cacheline. This issues was discovered in the SUSE kernel with the following commit, 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants") which increased the code object size from 10 bytes to 15 bytes and caused the 8-byte copy loop in __clear_user() to be split across a 64-byte cacheline. Aligning the start of the loop to 16-bytes makes this fit neatly inside a single instruction fetch window again and restores the performance of __clear_user() which is used heavily when reading from /dev/zero. Here are some numbers from running libmicro's read_z* and pread_z* microbenchmarks which read from /dev/zero: Zen 1 (Naples) libmicro-file 5.7.0-rc6 5.7.0-rc6 5.7.0-rc6 revert-1153933703d9+ align16+ Time mean95-pread_z100k 9.9195 ( 0.00%) 5.9856 ( 39.66%) 5.9938 ( 39.58%) Time mean95-pread_z10k 1.1378 ( 0.00%) 0.7450 ( 34.52%) 0.7467 ( 34.38%) Time mean95-pread_z1k 0.2623 ( 0.00%) 0.2251 ( 14.18%) 0.2252 ( 14.15%) Time mean95-pread_zw100k 9.9974 ( 0.00%) 6.0648 ( 39.34%) 6.0756 ( 39.23%) Time mean95-read_z100k 9.8940 ( 0.00%) 5.9885 ( 39.47%) 5.9994 ( 39.36%) Time mean95-read_z10k 1.1394 ( 0.00%) 0.7483 ( 34.33%) 0.7482 ( 34.33%) Note that this doesn't affect Haswell or Broadwell microarchitectures which seem to avoid the alignment issue by executing the loop straight out of the Loop Stream Detector (verified using perf events). Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants") Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # v4.19+ Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk
author: Matt Fleming <matt@codeblueprint.co.uk> 2020-06-18 11:20:02 +0100
committer: Borislav Petkov <bp@suse.de> 2020-06-19 18:32:11 +0200
commit: bb5570ad3b54e7930997aec76ab68256d5236d94 (patch)
tree: ec6baa21752c942f30902609460f8c9d52cedcc9
parent: a13b9d0b97211579ea63b96c606de79b963c0f47 (diff)
download: linux-bb5570ad3b54e7930997aec76ab68256d5236d94.tar.bz2
1 files changed, 1 insertions, 0 deletions
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..b0dfac3d3df7 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
 	asm volatile(
 		"	testq  %[size8],%[size8]\n"
 		"	jz     4f\n"
+		"	.align 16\n"
 		"0:	movq $0,(%[dst])\n"
 		"	addq   $8,%[dst]\n"
 		"	decl %%ecx ; jnz   0b\n"
author	Matt Fleming <matt@codeblueprint.co.uk>	2020-06-18 11:20:02 +0100
committer	Borislav Petkov <bp@suse.de>	2020-06-19 18:32:11 +0200
commit	bb5570ad3b54e7930997aec76ab68256d5236d94 (patch)
tree	ec6baa21752c942f30902609460f8c9d52cedcc9
parent	a13b9d0b97211579ea63b96c606de79b963c0f47 (diff)
download	linux-bb5570ad3b54e7930997aec76ab68256d5236d94.tar.bz2