summaryrefslogtreecommitdiffstats
path: root/kernel/cgroup/cgroup.c
AgeCommit message (Collapse)AuthorFilesLines
2019-07-01Merge tag 'v5.2-rc6' into for-5.3/blockJens Axboe1-30/+76
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak fix. Otherwise we end up having conflicts with future patches between for-5.3/block and master that touch this area. In particular, it makes the bio_full() fix hard to backport to stable. * tag 'v5.2-rc6': (482 commits) Linux 5.2-rc6 Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock" Bluetooth: Fix regression with minimum encryption key size alignment tcp: refine memory limit test in tcp_fragment() x86/vdso: Prevent segfaults due to hoisted vclock reads SUNRPC: Fix a credential refcount leak Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE" net :sunrpc :clnt :Fix xps refcount imbalance on the error path NFS4: Only set creation opendata if O_CREAT ARM: 8867/1: vdso: pass --be8 to linker if necessary KVM: nVMX: reorganize initial steps of vmx_set_nested_state KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries habanalabs: use u64_to_user_ptr() for reading user pointers nfsd: replace Jeff by Chuck as nfsd co-maintainer inet: clear num_timeout reqsk_alloc() PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier ...
2019-06-28Merge branch 'for-mingo' of ↵Ingo Molnar1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull rcu/next + tools/memory-model changes from Paul E. McKenney: - RCU flavor consolidation cleanups and optmizations - Documentation updates - Miscellaneous fixes - SRCU updates - RCU-sync flavor consolidation - Torture-test updates - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-21cgroup: export css_next_descendant_pre for bfqChristoph Hellwig1-0/+1
The bfq schedule now uses css_next_descendant_pre directly after the stats functionality depending on it has been from the core blk-cgroup code to bfq. Export the symbol so that bfq can still be build modular. Fixes: d6258980daf2 ("bfq-iosched: move bfq_stat_recursive_sum into the only caller") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-30/+76
Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14Merge branch 'for-5.2-fixes' of ↵Linus Torvalds1-30/+76
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This has an unusually high density of tricky fixes: - task_get_css() could deadlock when it races against a dying cgroup. - cgroup.procs didn't list thread group leaders with live threads. This could mislead readers to think that a cgroup is empty when it's not. Fixed by making PROCS iterator include dead tasks. I made a couple mistakes making this change and this pull request contains a couple follow-up patches. - When cpusets run out of online cpus, it updates cpusmasks of member tasks in bizarre ways. Joel improved the behavior significantly" * 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: restore sanity to cpuset_cpus_allowed_fallback() cgroup: Fix css_task_iter_advance_css_set() cset skip condition cgroup: css_task_iter_skip()'d iterators must be advanced before accessed cgroup: Include dying leaders with live threads in PROCS iterations cgroup: Implement css_task_iter_skip() cgroup: Call cgroup_release() before __exit_signal() docs cgroups: add another example size for hugetlb cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
2019-06-14cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFSTejun Heo1-42/+42
a5e112e6424a ("cgroup: add cgroup_parse_float()") accidentally added cgroup_parse_float() inside CONFIG_SYSFS block. Move it outside so that it doesn't cause failures on !CONFIG_SYSFS builds. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a5e112e6424a ("cgroup: add cgroup_parse_float()")
2019-06-10Merge branch 'for-5.2-fixes' into for-5.3Tejun Heo1-1/+5
2019-06-10cgroup: Fix css_task_iter_advance_css_set() cset skip conditionTejun Heo1-1/+1
While adding handling for dying task group leaders c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") added an inverted cset skip condition to css_task_iter_advance_css_set(). It should skip cset if it's completely empty but was incorrectly testing for the inverse condition for the dying_tasks list. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com
2019-06-10cgroup/bfq: revert bfq.weight symlink changeJens Axboe1-29/+4
There's some discussion on how to do this the best, and Tejun prefers that BFQ just create the file itself instead of having cgroups support a symlink feature. Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this can be done properly for 5.3. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-2/+14
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07cgroup: let a symlink too be created with a cftype fileAngelo Ruocco1-4/+29
This commit enables a cftype to have a symlink (of any name) that points to the file associated with the cftype. Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05cgroup: css_task_iter_skip()'d iterators must be advanced before accessedTejun Heo1-0/+4
b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") introduced css_task_iter_skip() which is used to fix task iterations skipping dying threadgroup leaders with live threads. Skipping is implemented as a subportion of full advancing but css_task_iter_next() forgot to fully advance a skipped iterator before determining the next task to visit causing it to return invalid task pointers. Fix it by making css_task_iter_next() fully advance the iterator if it has been skipped since the previous iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: syzbot Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")
2019-06-01mm, memcg: consider subtrees in memory.eventsChris Down1-2/+14
memory.stat and other files already consider subtrees in their output, and we should too in order to not present an inconsistent interface. The current situation is fairly confusing, because people interacting with cgroups expect hierarchical behaviour in the vein of memory.stat, cgroup.events, and other files. For example, this causes confusion when debugging reclaim events under low, as currently these always read "0" at non-leaf memcg nodes, which frequently causes people to misdiagnose breach behaviour. The same confusion applies to other counters in this file when debugging issues. Aggregation is done at write time instead of at read-time since these counters aren't hot (unlike memory.stat which is per-page, so it does it at read time), and it makes sense to bundle this with the file notifications. After this patch, events are propagated up the hierarchy: [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events low 0 high 0 max 0 oom 0 oom_kill 0 [root@ktst ~]# systemd-run -p MemoryMax=1 true Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events low 0 high 0 max 7 oom 1 oom_kill 1 As this is a change in behaviour, this can be reverted to the old behaviour by mounting with the `memory_localevents' flag set. However, we use the new behaviour by default as there's a lack of evidence that there are any current users of memory.events that would find this change undesirable. akpm: this is a behaviour change, so Cc:stable. THis is so that forthcoming distros which use cgroup v2 are more likely to pick up the revised behaviour. Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-31cgroup: add cgroup_parse_float()Tejun Heo1-0/+43
cgroup already uses floating point for percent[ile] numbers and there are several controllers which want to take them as input. Add a generic parse helper to handle inputs. Update the interface convention documentation about the use of percentage numbers. While at it, also clarify the default time unit. Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-31cgroup: Include dying leaders with live threads in PROCS iterationsTejun Heo1-7/+37
CSS_TASK_ITER_PROCS currently iterates live group leaders; however, this means that a process with dying leader and live threads will be skipped. IOW, cgroup.procs might be empty while cgroup.threads isn't, which is confusing to say the least. Fix it by making cset track dying tasks and include dying leaders with live threads in PROCS iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com>
2019-05-31cgroup: Implement css_task_iter_skip()Tejun Heo1-24/+36
When a task is moved out of a cset, task iterators pointing to the task are advanced using the normal css_task_iter_advance() call. This is fine but we'll be tracking dying tasks on csets and thus moving tasks from cset->tasks to (to be added) cset->dying_tasks. When we remove a task from cset->tasks, if we advance the iterators, they may move over to the next cset before we had the chance to add the task back on the dying list, which can allow the task to escape iteration. This patch separates out skipping from advancing. Skipping only moves the affected iterators to the next pointer rather than fully advancing it and the following advancing will recognize that the cursor has already been moved forward and do the rest of advancing. This ensures that when a task moves from one list to another in its cset, as long as it moves in the right direction, it's always visible to iteration. This doesn't cause any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>
2019-05-28bpf: decouple the lifetime of cgroup_bpf from cgroup itselfRoman Gushchin1-3/+8
Currently the lifetime of bpf programs attached to a cgroup is bound to the lifetime of the cgroup itself. It means that if a user forgets (or intentionally avoids) to detach a bpf program before removing the cgroup, it will stay attached up to the release of the cgroup. Since the cgroup can stay in the dying state (the state between being rmdir()'ed and being released) for a very long time, it leads to a waste of memory. Also, it blocks a possibility to implement the memcg-based memory accounting for bpf objects, because a circular reference dependency will occur. Charged memory pages are pinning the corresponding memory cgroup, and if the memory cgroup is pinning the attached bpf program, nothing will be ever released. A dying cgroup can not contain any processes, so the only chance for an attached bpf program to be executed is a live socket associated with the cgroup. So in order to release all bpf data early, let's count associated sockets using a new percpu refcounter. On cgroup removal the counter is transitioned to the atomic mode, and as soon as it reaches 0, all bpf programs are detached. Because cgroup_bpf_release() can block, it can't be called from the percpu ref counter callback directly, so instead an asynchronous work is scheduled. The reference counter is not socket specific, and can be used for any other types of programs, which can be executed from a cgroup-bpf hook outside of the process context, had such a need arise in the future. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: jolsa@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28locking/percpu-rwsem: Add DEFINE_PERCPU_RWSEM(), use it to initialize ↵Oleg Nesterov1-2/+1
cgroup_threadgroup_rwsem Turn DEFINE_STATIC_PERCPU_RWSEM() into __DEFINE_PERCPU_RWSEM() with the additional "is_static" argument to introduce DEFINE_PERCPU_RWSEM(). Change cgroup.c to use DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25cpuset: move mount -t cpuset logics into cgroup.cAl Viro1-0/+47
... and get rid of the weird dances in ->get_tree() - that logics can be easily handled in ->init_fs_context(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25no need to protect against put_user_ns(NULL)Al Viro1-2/+1
it's a no-op Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-14kernel/sched/psi.c: expose pressure metrics on root cgroupDan Schatzberg1-6/+12
Pressure metrics are already recorded and exposed in procfs for the entire system, but any tool which monitors cgroup pressure has to special case the root cgroup to read from procfs. This patch exposes the already recorded pressure metrics on the root cgroup. Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.com Signed-off-by: Dan Schatzberg <dschatzberg@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14psi: introduce psi monitorSuren Baghdasaryan1-2/+69
Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-06cgroup: get rid of cgroup_freezer_frozen_exit()Roman Gushchin1-3/+2
A task should never enter the exit path with the task->frozen bit set. Any frozen task must enter the signal handling loop and the only way to escape is through cgroup_leave_frozen(true), which unconditionally drops the task->frozen bit. So it means that cgroyp_freezer_frozen_exit() has zero chances to be called and has to be removed. Let's put a WARN_ON_ONCE() instead of the cgroup_freezer_frozen_exit() call to catch any potential leak of the task's frozen bit. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-06cgroup: Remove unused cgrp variableShaokun Zhang1-3/+0
The 'cgrp' is set but not used in commit <76f969e8948d8> ("cgroup: cgroup v2 freezer"). Remove it to avoid [-Wunused-but-set-variable] warning. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-04-19cgroup: add tracing points for cgroup v2 freezerRoman Gushchin1-0/+2
Add cgroup:cgroup_freeze and cgroup:cgroup_unfreeze events, which are using the existing cgroup tracing infrastructure. Add the cgroup_event event class, which is similar to the cgroup class, but contains an additional integer field to store a new value (the level field is dropped). Also add two tracing events: cgroup_notify_populated and cgroup_notify_frozen, which are raised in a generic way using the TRACE_CGROUP_PATH() macro. This allows to trace cgroup state transitions and is generally helpful for debugging the cgroup freezer code. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-04-19cgroup: cgroup v2 freezerRoman Gushchin1-4/+106
Cgroup v1 implements the freezer controller, which provides an ability to stop the workload in a cgroup and temporarily free up some resources (cpu, io, network bandwidth and, potentially, memory) for some other tasks. Cgroup v2 lacks this functionality. This patch implements freezer for cgroup v2. Cgroup v2 freezer tries to put tasks into a state similar to jobctl stop. This means that tasks can be killed, ptraced (using PTRACE_SEIZE*), and interrupted. It is possible to attach to a frozen task, get some information (e.g. read registers) and detach. It's also possible to migrate a frozen tasks to another cgroup. This differs cgroup v2 freezer from cgroup v1 freezer, which mostly tried to imitate the system-wide freezer. However uninterruptible sleep is fine when all tasks are going to be frozen (hibernation case), it's not the acceptable state for some subset of the system. Cgroup v2 freezer is not supporting freezing kthreads. If a non-root cgroup contains kthread, the cgroup still can be frozen, but the kthread will remain running, the cgroup will be shown as non-frozen, and the notification will not be delivered. * PTRACE_ATTACH is not working because non-fatal signal delivery is blocked in frozen state. There are some interface differences between cgroup v1 and cgroup v2 freezer too, which are required to conform the cgroup v2 interface design principles: 1) There is no separate controller, which has to be turned on: the functionality is always available and is represented by cgroup.freeze and cgroup.events cgroup control files. 2) The desired state is defined by the cgroup.freeze control file. Any hierarchical configuration is allowed. 3) The interface is asynchronous. The actual state is available using cgroup.events control file ("frozen" field). There are no dedicated transitional states. 4) It's allowed to make any changes with the cgroup hierarchy (create new cgroups, remove old cgroups, move tasks between cgroups) no matter if some cgroups are frozen. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com> Cc: kernel-team@fb.com
2019-04-19cgroup: protect cgroup->nr_(dying_)descendants by css_set_lockRoman Gushchin1-0/+6
The number of descendant cgroups and the number of dying descendant cgroups are currently synchronized using the cgroup_mutex. The number of descendant cgroups will be required by the cgroup v2 freezer, which will use it to determine if a cgroup is frozen (depending on total number of descendants and number of frozen descendants). It's not always acceptable to grab the cgroup_mutex, especially from quite hot paths (e.g. exit()). To avoid this, let's additionally synchronize these counters using the css_set_lock. So, it's safe to read these counters with either cgroup_mutex or css_set_lock locked, and for changing both locks should be acquired. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com
2019-04-19cgroup: implement __cgroup_task_count() helperRoman Gushchin1-0/+33
The helper is identical to the existing cgroup_task_count() except it doesn't take the css_set_lock by itself, assuming that the caller does. Also, move cgroup_task_count() implementation into kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: kernel-team@fb.com
2019-04-04cgroup: remove extra cgroup_migrate_finish() callShakeel Butt1-4/+1
The callers of cgroup_migrate_prepare_dst() correctly call cgroup_migrate_finish() for success and failure cases both. No need to call it in cgroup_migrate_prepare_dst() in failure case. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-03-12Merge branch 'work.mount' of ↵Linus Torvalds1-90/+133
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount infrastructure updates from Al Viro: "The rest of core infrastructure; no new syscalls in that pile, but the old parts are switched to new infrastructure. At that point conversions of individual filesystems can happen independently; some are done here (afs, cgroup, procfs, etc.), there's also a large series outside of that pile dealing with NFS (quite a bit of option-parsing stuff is getting used there - it's one of the most convoluted filesystems in terms of mount-related logics), but NFS bits are the next cycle fodder. It got seriously simplified since the last cycle; documentation is probably the weakest bit at the moment - I considered dropping the commit introducing Documentation/filesystems/mount_api.txt (cutting the size increase by quarter ;-), but decided that it would be better to fix it up after -rc1 instead. That pile allows to do followup work in independent branches, which should make life much easier for the next cycle. fs/super.c size increase is unpleasant; there's a followup series that allows to shrink it considerably, but I decided to leave that until the next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits) afs: Use fs_context to pass parameters over automount afs: Add fs_context support vfs: Add some logging to the core users of the fs_context log vfs: Implement logging through fs_context vfs: Provide documentation for new mount API vfs: Remove kern_mount_data() hugetlbfs: Convert to fs_context cpuset: Use fs_context kernfs, sysfs, cgroup, intel_rdt: Support fs_context cgroup: store a reference to cgroup_ns into cgroup_fs_context cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper cgroup_do_mount(): massage calling conventions cgroup: stash cgroup_root reference into cgroup_fs_context cgroup2: switch to option-by-option parsing cgroup1: switch to option-by-option parsing cgroup: take options parsing into ->parse_monolithic() cgroup: fold cgroup1_mount() into cgroup1_get_tree() cgroup: start switching to fs_context ipc: Convert mqueue fs to fs_context proc: Add fs_context support to procfs ...
2019-03-07Merge branch 'for-5.1' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Oleg's pids controller accounting update which gets rid of rcu delay in pids accounting updates - rstat (cgroup hierarchical stat collection mechanism) optimization - Doc updates * 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: remove unused task_has_mempolicy() cgroup, rstat: Don't flush subtree root unless necessary cgroup: add documentation for pids.events file Documentation: cgroup-v2: eliminate markup warnings MAINTAINERS: Update cgroup entry cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting
2019-03-05kernel: cgroup: add poll file operationJohannes Weiner1-0/+12
Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-1/+1
Pull networking updates from David Miller: "Here we go, another merge window full of networking and #ebpf changes: 1) Snoop DHCPACKS in batman-adv to learn MAC/IP pairs in the DHCP range without dealing with floods of ARP traffic, from Linus Lüssing. 2) Throttle buffered multicast packet transmission in mt76, from Felix Fietkau. 3) Support adaptive interrupt moderation in ice, from Brett Creeley. 4) A lot of struct_size conversions, from Gustavo A. R. Silva. 5) Add peek/push/pop commands to bpftool, as well as bash completion, from Stanislav Fomichev. 6) Optimize sk_msg_clone(), from Vakul Garg. 7) Add SO_BINDTOIFINDEX, from David Herrmann. 8) Be more conservative with local resends due to local congestion, from Yuchung Cheng. 9) Allow vetoing of unsupported VXLAN FDBs, from Petr Machata. 10) Add health buffer support to devlink, from Eran Ben Elisha. 11) Add TXQ scheduling API to mac80211, from Toke Høiland-Jørgensen. 12) Add statistics to basic packet scheduler filter, from Cong Wang. 13) Add GRE tunnel support for mlxsw Spectrum-2, from Nir Dotan. 14) Lots of new IP tunneling forwarding tests, also from Nir Dotan. 15) Add 3ad stats to bonding, from Nikolay Aleksandrov. 16) Lots of probing improvements for bpftool, from Quentin Monnet. 17) Various nfp drive #ebpf JIT improvements from Jakub Kicinski. 18) Allow #ebpf programs to access gso_segs from skb shared info, from Eric Dumazet. 19) Add sock_diag support for AF_XDP sockets, from Björn Töpel. 20) Support 22260 iwlwifi devices, from Luca Coelho. 21) Use rbtree for ipv6 defragmentation, from Peter Oskolkov. 22) Add JMP32 instruction class support to #ebpf, from Jiong Wang. 23) Add spinlock support to #ebpf, from Alexei Starovoitov. 24) Support 256-bit keys and TLS 1.3 in ktls, from Dave Watson. 25) Add device infomation API to devlink, from Jakub Kicinski. 26) Add new timestamping socket options which are y2038 safe, from Deepa Dinamani. 27) Add RX checksum offloading for various sh_eth chips, from Sergei Shtylyov. 28) Flow offload infrastructure, from Pablo Neira Ayuso. 29) Numerous cleanups, improvements, and bug fixes to the PHY layer and many drivers from Heiner Kallweit. 30) Lots of changes to try and make packet scheduler classifiers run lockless as much as possible, from Vlad Buslov. 31) Support BCM957504 chip in bnxt_en driver, from Erik Burrows. 32) Add concurrency tests to tc-tests infrastructure, from Vlad Buslov. 33) Add hwmon support to aquantia, from Heiner Kallweit. 34) Allow 64-bit values for SO_MAX_PACING_RATE, from Eric Dumazet. And I would be remiss if I didn't thank the various major networking subsystem maintainers for integrating much of this work before I even saw it. Alexei Starovoitov, Daniel Borkmann, Pablo Neira Ayuso, Johannes Berg, Kalle Valo, and many others. Thank you!" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2207 commits) net/sched: avoid unused-label warning net: ignore sysctl_devconf_inherit_init_net without SYSCTL phy: mdio-mux: fix Kconfig dependencies net: phy: use phy_modify_mmd_changed in genphy_c45_an_config_aneg net: dsa: mv88e6xxx: add call to mv88e6xxx_ports_cmode_init to probe for new DSA framework selftest/net: Remove duplicate header sky2: Disable MSI on Dell Inspiron 1545 and Gateway P-79 net/mlx5e: Update tx reporter status in case channels were successfully opened devlink: Add support for direct reporter health state update devlink: Update reporter state to error even if recover aborted sctp: call iov_iter_revert() after sending ABORT team: Free BPF filter when unregistering netdev ip6mr: Do not call __IP6_INC_STATS() from preemptible context isdn: mISDN: Fix potential NULL pointer dereference of kzalloc net: dsa: mv88e6xxx: support in-band signalling on SGMII ports with external PHYs cxgb4/chtls: Prefix adapter flags with CXGB4 net-sysfs: Switch to bitmap_zalloc() mellanox: Switch to bitmap_zalloc() bpf: add test cases for non-pointer sanitiation logic mlxsw: i2c: Extend initialization by querying resources data ...
2019-02-28kernfs, sysfs, cgroup, intel_rdt: Support fs_contextDavid Howells1-17/+14
Make kernfs support superblock creation/mount/remount with fs_context. This requires that sysfs, cgroup and intel_rdt, which are built on kernfs, be made to support fs_context also. Notes: (1) A kernfs_fs_context struct is created to wrap fs_context and the kernfs mount parameters are moved in here (or are in fs_context). (2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra namespace tag parameter is passed in the context if desired (3) kernfs_free_fs_context() is provided as a destructor for the kernfs_fs_context struct, but for the moment it does nothing except get called in the right places. (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to pass, but possibly this should be done anyway in case someone wants to add a parameter in future. (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and the cgroup v1 and v2 mount parameters are all moved there. (6) cgroup1 parameter parsing error messages are now handled by invalf(), which allows userspace to collect them directly. (7) cgroup1 parameter cleanup is now done in the context destructor rather than in the mount/get_tree and remount functions. Weirdies: (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held, but then uses the resulting pointer after dropping the locks. I'm told this is okay and needs commenting. (*) The cgroup refcount web. This really needs documenting. (*) cgroup2 only has one root? Add a suggestion from Thomas Gleixner in which the RDT enablement code is placed into its own function. [folded a leak fix from Andrey Vagin] Signed-off-by: David Howells <dhowells@redhat.com> cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> cc: Tejun Heo <tj@kernel.org> cc: Li Zefan <lizefan@huawei.com> cc: Johannes Weiner <hannes@cmpxchg.org> cc: cgroups@vger.kernel.org cc: fenghua.yu@intel.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: store a reference to cgroup_ns into cgroup_fs_contextAl Viro1-5/+12
... and trim cgroup_do_mount() arguments (renaming it to cgroup_do_get_tree()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup_do_mount(): massage calling conventionsAl Viro1-22/+23
pass it fs_context instead of fs_type/flags/root triple, have it return int instead of dentry and make it deal with setting fc->root. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: stash cgroup_root reference into cgroup_fs_contextAl Viro1-2/+5
Note that this reference is *NOT* contributing to refcount of cgroup_root in question and is valid only until cgroup_do_mount() returns. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup2: switch to option-by-option parsingAl Viro1-29/+33
[again, carved out of patch by dhowells] [NB: we probably want to handle "source" in parse_param here] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup1: switch to option-by-option parsingAl Viro1-14/+6
[dhowells should be the author - it's carved out of his patch] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: take options parsing into ->parse_monolithic()Al Viro1-28/+26
Store the results in cgroup_fs_context. There's a nasty twist caused by the enabling/disabling subsystems - we can't do the checks sensitive to that until cgroup_mutex gets grabbed. Frankly, these checks are complete bullshit (e.g. all,none combination is accepted if all subsystems are disabled; so's cpusets,none and all,cpusets when cpusets is disabled, etc.), but touching that would be a userland-visible behaviour change ;-/ So we do parsing in ->parse_monolithic() and have the consistency checks done in check_cgroupfs_options(), with the latter called (on already parsed options) from cgroup1_get_tree() and cgroup1_reconfigure(). Freeing the strdup'ed strings is done from fs_context destructor, which somewhat simplifies the life for cgroup1_{get_tree,reconfigure}(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: fold cgroup1_mount() into cgroup1_get_tree()Al Viro1-19/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28cgroup: start switching to fs_contextAl Viro1-37/+97
Unfortunately, cgroup is tangled into kernfs infrastructure. To avoid converting all kernfs-based filesystems at once, we need to untangle the remount part of things, instead of having it go through kernfs_sop_remount_fs(). Fortunately, it's not hard to do. This commit just gets cgroup/cgroup1 to use fs_context to deliver options on mount and remount paths. Parsing those is going to be done in the next commits; for now we do pretty much what legacy case does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-31cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix ↵Oleg Nesterov1-6/+9
the accounting The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which needs pids_free() to uncharge the pid. However, ->free() is called from __put_task_struct()->cgroup_free() and this is too late. Even the trivial program which does for (;;) { int pid = fork(); assert(pid >= 0); if (pid) wait(NULL); else exit(0); } can run out of limits because release_task()->call_rcu(delayed_put_task_struct) implies an RCU gp after the task/pid goes away and before the final put(). Test-case: mkdir -p /tmp/CG mount -t cgroup2 none /tmp/CG echo '+pids' > /tmp/CG/cgroup.subtree_control mkdir /tmp/CG/PID echo 2 > /tmp/CG/PID/pids.max perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' & echo $! > /tmp/CG/PID/cgroup.procs Without this patch the forking process fails soon after migration. Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite into the new helper, cgroup_release(), called by release_task() which actually frees the pid(s). Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2019-01-31bpf, cgroups: clean up kerneldoc warningsValdis Kletnieks1-1/+1
Building with W=1 reveals some bitrot: CC kernel/bpf/cgroup.o kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described in '__cgroup_bpf_attach' kernel/bpf/cgroup.c:367: warning: Function parameter or member 'unused_flags' not described in '__cgroup_bpf_detach' Add a kerneldoc line for 'flags'. Fixing the warning for 'unused_flags' is best approached by removing the unused parameter on the function call. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-17cgroup: saner refcounting for cgroup_rootAl Viro1-9/+7
* make the reference from superblock to cgroup_root counting - do cgroup_put() in cgroup_kill_sb() whether we'd done percpu_ref_kill() or not; matching grab is done when we allocate a new root. That gives the same refcounting rules for all callers of cgroup_do_mount() - a reference to cgroup_root has been grabbed by caller and it either is transferred to new superblock or dropped. * have cgroup_kill_sb() treat an already killed refcount as "just don't bother killing it, then". * after successful cgroup_do_mount() have cgroup1_mount() recheck if we'd raced with mount/umount from somebody else and cgroup_root got killed. In that case we drop the superblock and bugger off with -ERESTARTSYS, same as if we'd found it in the list already dying. * don't bother with delayed initialization of refcount - it's unreliable and not needed. No need to prevent attempts to bump the refcount if we find cgroup_root of another mount in progress - sget will reuse an existing superblock just fine and if the other sb manages to die before we get there, we'll catch that immediately after cgroup_do_mount(). * don't bother with kernfs_pin_sb() - no need for doing that either. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-17fix cgroup_do_mount() handling of failure exitsAl Viro1-3/+6
same story as with last May fixes in sysfs (7b745a4e4051 "unfuck sysfs_mount()"); new_sb is left uninitialized in case of early errors in kernfs_mount_ns() and papering over it by treating any error from kernfs_mount_ns() as equivalent to !new_ns ends up conflating the cases when objects had never been transferred to a superblock with ones when that has happened and resulting new superblock had been dropped. Easily fixed (same way as in sysfs case). Additionally, there's a superblock leak on kernfs_node_dentry() failure *and* a dentry leak inside kernfs_node_dentry() itself - the latter on probably impossible errors, but the former not impossible to trigger (as the matter of fact, injecting allocation failures at that point *does* trigger it). Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-29Merge branch 'for-4.21' of ↵Linus Torvalds1-21/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Waiman's cgroup2 cpuset support has been finally merged closing one of the last remaining feature gaps. - cgroup.procs could show non-leader threads when cgroup2 threaded mode was used in certain ways. I forgot to push the fix during the last cycle. - A patch to fix mount option parsing when all mount options have been consumed by someone else (LSM). - cgroup_no_v1 boot param can now block named cgroup1 hierarchies too. * 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Add named hierarchy disabling to cgroup_no_v1 boot param cgroup: fix parsing empty mount option string cpuset: Remove set but not used variable 'cs' cgroup: fix CSS_TASK_ITER_PROCS cgroup: Add .__DEBUG__. prefix to debug file names cpuset: Minor cgroup2 interface updates cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug cpuset: Add documentation about the new "cpuset.sched.partition" flag cpuset: Use descriptive text when reading/writing cpuset.sched.partition cpuset: Expose cpus.effective and mems.effective on cgroup v2 root cpuset: Make generate_sched_domains() work with partition cpuset: Make CPU hotplug work with partition cpuset: Track cpusets that use parent's effective_cpus cpuset: Add an error state to cpuset.sched.partition cpuset: Add new v2 cpuset.sched.partition flag cpuset: Simply allocation and freeing of cpumasks cpuset: Define data structures to support scheduling partition cpuset: Enable cpuset controller in default hierarchy cgroup: remove unnecessary unlikely()
2018-12-28Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds1-9/+39
Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ...
2018-12-28cgroup: fix parsing empty mount option stringOndrej Mosnacek1-1/+1
This fixes the case where all mount options specified are consumed by an LSM and all that's left is an empty string. In this case cgroupfs should accept the string and not fail. How to reproduce (with SELinux enabled): # umount /sys/fs/cgroup/unified # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error. # dmesg | tail -n 1 [ 31.575952] cgroup: cgroup2: unknown option "" Fixes: 67e9c74b8a87 ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type") [NOTE: should apply on top of commit 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase] Suggested-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2018-12-27Merge branch 'for-4.20-fixes' into for-4.21Tejun Heo1-12/+17