From fe4ba3c34352b7e8068b7f18eb233444aed17011 Mon Sep 17 00:00:00 2001 From: Chris Metcalf Date: Wed, 24 Jun 2015 16:55:45 -0700 Subject: watchdog: add watchdog_cpumask sysctl to assist nohz Change the default behavior of watchdog so it only runs on the housekeeping cores when nohz_full is enabled at build and boot time. Allow modifying the set of cores the watchdog is currently running on with a new kernel.watchdog_cpumask sysctl. In the current system, the watchdog subsystem runs a periodic timer that schedules the watchdog kthread to run. However, nohz_full cores are designed to allow userspace application code running on those cores to have 100% access to the CPU. So the watchdog system prevents the nohz_full application code from being able to run the way it wants to, thus the motivation to suppress the watchdog on nohz_full cores, which this patchset provides by default. However, if we disable the watchdog globally, then the housekeeping cores can't benefit from the watchdog functionality. So we allow disabling it only on some cores. See Documentation/lockup-watchdogs.txt for more information. [jhubbard@nvidia.com: fix a watchdog crash in some configurations] Signed-off-by: Chris Metcalf Acked-by: Don Zickus Cc: Ingo Molnar Cc: Ulrich Obergfell Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Frederic Weisbecker Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/lockup-watchdogs.txt | 18 ++++++++++++++++++ Documentation/sysctl/kernel.txt | 21 +++++++++++++++++++++ 2 files changed, 39 insertions(+) (limited to 'Documentation') diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt index ab0baa692c13..22dd6af2e4bd 100644 --- a/Documentation/lockup-watchdogs.txt +++ b/Documentation/lockup-watchdogs.txt @@ -61,3 +61,21 @@ As explained above, a kernel knob is provided that allows administrators to configure the period of the hrtimer and the perf event. The right value for a particular environment is a trade-off between fast response to lockups and detection overhead. + +By default, the watchdog runs on all online cores. However, on a +kernel configured with NO_HZ_FULL, by default the watchdog runs only +on the housekeeping cores, not the cores specified in the "nohz_full" +boot argument. If we allowed the watchdog to run by default on +the "nohz_full" cores, we would have to run timer ticks to activate +the scheduler, which would prevent the "nohz_full" functionality +from protecting the user code on those cores from the kernel. +Of course, disabling it by default on the nohz_full cores means that +when those cores do enter the kernel, by default we will not be +able to detect if they lock up. However, allowing the watchdog +to continue to run on the housekeeping (non-tickless) cores means +that we will continue to detect lockups properly on those cores. + +In either case, the set of cores excluded from running the watchdog +may be adjusted via the kernel.watchdog_cpumask sysctl. For +nohz_full cores, this may be useful for debugging a case where the +kernel seems to be hanging on the nohz_full cores. diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index c831001c45f1..e5d528e0c46e 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -923,6 +923,27 @@ and nmi_watchdog. ============================================================== +watchdog_cpumask: + +This value can be used to control on which cpus the watchdog may run. +The default cpumask is all possible cores, but if NO_HZ_FULL is +enabled in the kernel config, and cores are specified with the +nohz_full= boot argument, those cores are excluded by default. +Offline cores can be included in this mask, and if the core is later +brought online, the watchdog will be started based on the mask value. + +Typically this value would only be touched in the nohz_full case +to re-enable cores that by default were not running the watchdog, +if a kernel lockup was suspected on those cores. + +The argument value is the standard cpulist format for cpumasks, +so for example to enable the watchdog on cores 0, 2, 3, and 4 you +might say: + + echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask + +============================================================== + watchdog_thresh: This value can be used to control the frequency of hrtimer and NMI -- cgit v1.2.3 From 9b012a29a300ea780d919205906d00d15cc6286e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 24 Jun 2015 16:57:50 -0700 Subject: Documentation/vm/unevictable-lru.txt: clarify MAP_LOCKED behavior There is a very subtle difference between mmap()+mlock() vs mmap(MAP_LOCKED) semantic. The former one fails if the population of the area fails while the later one doesn't. This basically means that mmap(MAPLOCKED) areas might see major fault after mmap syscall returns which is not the case for mlock. mmap man page has already been altered but Documentation/vm/unevictable-lru.txt deserves a clarification as well. Signed-off-by: Michal Hocko Reported-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 3be0bfc4738d..32ee3a67dba2 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -467,7 +467,13 @@ mmap(MAP_LOCKED) SYSTEM CALL HANDLING In addition the mlock()/mlockall() system calls, an application can request that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap() -call. Furthermore, any mmap() call or brk() call that expands the heap by a +call. There is one important and subtle difference here, though. mmap() + mlock() +will fail if the range cannot be faulted in (e.g. because mm_populate fails) +and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped +area will still have properties of the locked area - aka. pages will not get +swapped out - but major page faults to fault memory in might still happen. + +Furthermore, any mmap() call or brk() call that expands the heap by a task that has previously called mlockall() with the MCL_FUTURE flag will result in the newly mapped memory being mlocked. Before the unevictable/mlock changes, the kernel simply called make_pages_present() to allocate pages and -- cgit v1.2.3