summaryrefslogtreecommitdiffstats
path: root/block/blk-cgroup.c
AgeCommit message (Collapse)AuthorFilesLines
2020-12-01block: simplify bdev/disk lookup in blkdev_getChristoph Hellwig1-21/+21
To simplify block device lookup and a few other upcoming areas, make sure that we always have a struct block_device available for each disk and each partition, and only find existing block devices in bdget. The only downside of this is that each device and partition uses a little more memory. The upside will be that a lot of code can be simplified. With that all we need to look up the block device is to lookup the inode and do a few sanity checks on the gendisk, instead of the separate lookup for the gendisk. For blk-cgroup which wants to access a gendisk without opening it, a new blkdev_{get,put}_no_open low-level interface is added to replace the previous get_gendisk use. Note that the change to look up block device directly instead of the two step lookup using struct gendisk causes a subtile change in behavior: accessing a non-existing partition on an existing block device can now cause a call to request_module. That call is harmless, and in practice no recent system will access these nodes as they aren't created by udev and static /dev/ setups are unusual. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-26blk-cgroup: Pre-allocate tree node on blkg_conf_prepGabriel Krisman Bertazi1-2/+12
Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq outside request queue spinlock"), blkg_create can also trigger occasional -ENOMEM failures at the radix insertion because any allocation inside blkg_create has to be non-blocking, making it more likely to fail. This causes trouble for userspace tools trying to configure io weights who need to deal with this condition. This patch reduces the occurrence of -ENOMEMs on this path by preloading the radix tree element on a GFP_KERNEL context, such that we guarantee the later non-blocking insertion won't fail. A similar solution exists in blkcg_init_queue for the same situation. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-26blk-cgroup: Fix memleak on error pathGabriel Krisman Bertazi1-0/+1
If new_blkg allocation raced with blk_policy change and blkg_lookup_check fails, new_blkg is leaked. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10blkcg: add plugging support for punt bioXianting Tian1-0/+9
The test and the explaination of the patch as bellow. Before test we added more debug code in blkg_async_bio_workfn(): int count = 0 if (bios.head && bios.head->bi_next) { need_plug = true; blk_start_plug(&plug); } while ((bio = bio_list_pop(&bios))) { /*io_punt is a sysctl user interface to control the print*/ if(io_punt) { printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n", current->comm, current->pid, bio->bi_iter.bi_sector, (bio->bi_iter.bi_size)>>9, count++, need_plug); } submit_bio(bio); } if (need_plug) blk_finish_plug(&plug); Steps that need to be set to trigger *PUNT* io before testing: mount -t btrfs -o compress=lzo /dev/sda6 /btrfs mount -t cgroup2 nodev /cgroup2 mkdir /cgroup2/cg3 echo "+io" > /cgroup2/cgroup.subtree_control echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s echo $$ > /cgroup2/cg3/cgroup.procs Then use dd command to test btrfs PUNT io in current shell: dd if=/dev/zero of=/btrfs/file bs=64K count=100000 Test hardware environment as below: [root@localhost btrfs]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel With above debug code, test command and test environment, I did the tests under 3 different system loads, which are triggered by stress: 1, Run 64 threads by command "stress -c 64 &" [53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1 [53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1 [53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1 [53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1 [53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1 [53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1 ... ... [53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1 [53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1 [53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1 [53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1 [53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1 [53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1 [53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1 2, Run 32 threads by command "stress -c 32 &" [50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1 [50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1 [50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1 [50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1 [50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1 [50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1 ... ... [50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1 [50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1 [50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1 [50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1 [50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1 [50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1 [50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1 [50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1 3, Don't run thread by stress [50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0 [50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0 [50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0 [50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0 [50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0 [50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0 [50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0 [50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0 [50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0 [50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0 [50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0 [50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0 [50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1 [50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1 [50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0 [50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0 [50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0 [50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0 [50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0 [50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0 [50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0 Analysis of above 3 test results with different system load: >From above test, we can see more and more continuous bios can be plugged with system load increasing. When run "stress -c 64 &", 310 continuous bios are plugged; When run "stress -c 32 &", 260 continuous bios are plugged; When don't run stress, at most only 2 continuous bios are plugged, in most cases, bio_list only contains one single bio. How to explain above phenomenon: We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is scheduled, it depends on the system load. When system load is low, the workqueue will be quickly scheduled, and the bio in bio_list will be quickly processed in blkg_async_bio_workfn(), so there is less chance that the same io submit thread can add multiple continuous bios to bio_list before workqueue is scheduled to run. The analysis aligned with above test "3". When system load is high, there is some delay before the workqueue can be scheduled to run, the higher the system load the greater the delay. So there is more chance that the same io submit thread can add multiple continuous bios to bio_list. Then when the workqueue is scheduled to run, there are more continuous bios in bio_list, which will be processed in blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2". According to test, we can get io performance improved with the patch, especially when system load is higher. Another optimazition is to use the plug only when bio_list contains at least 2 bios. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01blk-iocost: implement delay adjustment hysteresisTejun Heo1-6/+17
Curently, iocost syncs the delay duration to the outstanding debt amount, which seemed enough to protect the system from anon memory hogs. However, that was mostly because the delay calcuation was using hweight_inuse which quickly converges towards zero under debt for delay duration calculation, often pusnishing debtors overly harshly for longer than deserved. The previous patch fixed the delay calcuation and now the protection against anonymous memory hogs isn't enough because the effect of delay is indirect and non-linear and a huge amount of future debt can accumulate abruptly while unthrottled. This patch implements delay hysteresis so that delay is decayed exponentially over time instead of getting cleared immediately as debt is paid off. While the overall behavior is similar to the blk-cgroup implementation used by blk-iolatency, a lot of the details are different and due to the empirical nature of the mechanism, it's challenging to adapt the mechanism for one controller without negatively impacting the other. As the delay is gradually decayed now, there's no point in running it from its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and the dedicated hrtimer is removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21blkcg: fix memleak for iolatencyYufen Yu1-3/+5
Normally, blkcg_iolatency_exit() will free related memory in iolatency when cleanup queue. But if blk_throtl_init() return error and queue init fail, blkcg_iolatency_exit() will not do that for us. Then it cause memory leak. Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17blk-cgroup: show global disk stats in root cgroup io.statBoris Burkov1-2/+55
In order to improve consistency and usability in cgroup stat accounting, we would like to support the root cgroup's io.stat. Since the root cgroup has processes doing io even if the system has no explicitly created cgroups, we need to be careful to avoid overhead in that case. For that reason, the rstat algorithms don't handle the root cgroup, so just turning the file on wouldn't give correct statistics. To get around this, we simulate flushing the iostat struct by filling it out directly from global disk stats. The result is a root cgroup io.stat file consistent with both /proc/diskstats and io.stat. Note that in order to collect the disk stats, we needed to iterate over devices. To facilitate that, we had to change the linkage of a disk_type to external so that it can be used from blk-cgroup.c to iterate over disks. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Boris Burkov <boris@bur.io> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17blk-cgroup: make iostat functions visible to stat printingBoris Burkov1-71/+71
Previously, the code which printed io.stat only needed access to the generic rstat flushing code, but since we plan to write some more specific code for preparing root cgroup stats, we need to manipulate iostat structs directly. Since declaring static functions ahead does not seem like common practice in this file, simply move the iostat functions up. We only plan to use blkg_iostat_set, but it seems better to keep them all together. Signed-off-by: Boris Burkov <boris@bur.io> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08writeback: remove struct bdi_writeback_congestedChristoph Hellwig1-17/+2
We never set any congested bits in the group writeback instances of it. And for the simpler bdi-wide case a simple scalar field is all that that is needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01block: move ->make_request_fn to struct block_device_operationsChristoph Hellwig1-1/+1
The make_request_fn is a little weird in that it sits directly in struct request_queue instead of an operation vector. Replace it with a block_device_operations method called submit_bio (which describes much better what it does). Also remove the request_queue argument to it, as the queue can be derived pretty trivially from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30blk-cgroup: clean up indentationColin Ian King1-1/+1
There is a statement that is indented one level too deeply, fix it by removing a tab. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29blk-cgroup: remove blkcg_bio_issue_checkChristoph Hellwig1-0/+34
blkcg_bio_issue_check is a giant inline function that does three entirely different things. Factor out the blk-cgroup related bio initalization into a new helper, and the open code the sequence in the only caller, relying on the fact that all the actual functionality is stubbed out for non-cgroup builds. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29block: move the initial blkg lookup into blkg_tryget_closestChristoph Hellwig1-19/+14
By moving the initial blkg lookup into blkg_tryget_closest we get a nicely self contained routines that does all the RCU locking. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29block: bypass blkg_tryget_closest for the root_blkgChristoph Hellwig1-5/+10
The root_blkg is only torn down at the very end of removing a queue. So in the I/O submission path is always has a life reference and we can just grab another one using blkg_get instead of doing a tryget and parent walk that won't lead anywhere. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29block: merge blkg_lookup_create and __blkg_lookup_createChristoph Hellwig1-31/+18
No good reason to keep these two functions split. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29block: move the bio cgroup associatation helpers to blk-cgroup.cChristoph Hellwig1-2/+101
Keep the cgroup code together. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09Merge branch 'block-5.7' into for-5.8/blockJens Axboe1-1/+1
Pull in block-5.7 fixes for 5.8. Mostly to resolve a conflict with the blk-iocost changes, but we also need the base of the bdi use-after-free as well as we build on top of it. * block-5.7: nvme: fix possible hang when ns scanning fails during error recovery nvme-pci: fix "slimmer CQ head update" bdi: add a ->dev_name field to struct backing_dev_info bdi: use bdi_dev_name() to get device name bdi: move bdi_dev_name out of line vboxsf: don't use the source name in the bdi name iocost: protect iocg->abs_vdebt with iocg->waitq.lock block: remove the bd_openers checks in blk_drop_partitions nvme: prevent double free in nvme_alloc_ns() error handling null_blk: Cleanup zoned device initialization null_blk: Fix zoned command handling block: remove unused header blk-iocost: Fix error on iocost_ioc_vrate_adj bdev: Reduce time holding bd_mutex in sync in blkdev_close() buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: use bdi_dev_name() to get device nameYufen Yu1-1/+1
Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-30blk-iocost: switch to fixed non-auto-decaying use_delayTejun Heo1-0/+6
The use_delay mechanism was introduced by blk-iolatency to hold memory allocators accountable for the reclaim and other shared IOs they cause. The duration of the delay is dynamically balanced between iolatency increasing the value on each target miss and it auto-decaying as time passes and threads get delayed on it. While this works well for iolatency, iocost's control model isn't compatible with it. There is no repeated "violation" events which can be balanced against auto-decaying. iocost instead knows how much a given cgroup is over budget and wants to prevent that cgroup from issuing IOs while over budget. Until now, iocost has been adding the cost of force-issued IOs. However, this doesn't reflect the amount which is already over budget and is simply not enough to counter the auto-decaying allowing anon-memory leaking low priority cgroup to go over its alloted share of IOs. As auto-decaying doesn't make much sense for iocost, this patch introduces a different mode of operation for use_delay - when blkcg_set_delay() are used insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the delay duration synchronized to the budget overage amount. With this change, iocost can effectively police cgroups which generate significant amount of force-issued IOs. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01blkcg: don't offline parent blkcg firstTejun Heo1-0/+16
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. Let's fix this by making child blkcgs pin the parents' online states. Note that pin/unpin names are chosen over get/put intentionally because css uses get/put online for something different. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-01blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use itTejun Heo1-3/+3
blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs don't get offlined while there are active cgwbs on them. However, it ends up making offlining unordered sometimes causing parents to be offlined before children. To fix it, we want child blkcgs to pin the parents' online states turning the refcnt into a more generic online pinning mechanism. In prepartion, * blkcg->cgwb_refcnt -> blkcg->online_pin * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online() * Take them out of CONFIG_CGROUP_WRITEBACK Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-27block: simplify queue allocationChristoph Hellwig1-1/+1
Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-12blk-cgroup: remove blkcg_drain_queueGuoqing Jiang1-20/+0
Since blk_drain_queue had already been removed, so this function is not needed anymore. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTATTejun Heo1-97/+0
blkg_rwstat is now only used by bfq-iosched and blk-throtl when on cgroup1. Let's move it into its own files and gate it behind a config option. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-cgroup: reimplement basic IO stats using cgroup rstatTejun Heo1-25/+99
blk-cgroup has been using blkg_rwstat to track basic IO stats. Unfortunately, reading recursive stats scales badly as itinvolves walking all descendants. On systems with a huge number of cgroups (dead or alive), this can lead to substantial CPU cost when reading IO stats. This patch reimplements basic IO stats using cgroup rstat which uses more memory but makes recursive stat reading O(# descendants which have been active since last reading) instead of O(# descendants). * blk-cgroup core no longer uses sync/async stats. Introduce new stat enums - BLKG_IOSTAT_{READ|WRITE|DISCARD}. * Add blkg_iostat[_set] which encapsulates byte and io stats, last values for propagation delta calculation and u64_stats_sync for correctness on 32bit archs. * Update the new percpu stat counters directly and implement blkcg_rstat_flush() to implement propagation. * blkg_print_stat() can now bring the stats up to date by calling cgroup_rstat_flush() and print them instead of directly summing up all descendants. * It now allocates 96 bytes per cpu. It used to be 40 bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Cc: Daniel Xu <dlxu@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()Tejun Heo1-83/+0
These don't have users anymore. Remove them. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06blkcg: make blkcg_print_stat() print stats only for online blkgsTejun Heo1-5/+8
blkcg_print_stat() iterates blkgs under RCU and doesn't test whether the blkg is online. This can call into pd_stat_fn() on a pd which is still being initialized leading to an oops. The heaviest operation - recursively summing up rwstat counters - is already done while holding the queue_lock. Expand queue_lock to cover the other operations and skip the blkg if it isn't online yet. The online state is protected by both blkcg and queue locks, so this guarantees that only online blkgs are processed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Roman Gushchin <guro@fb.com> Cc: Josef Bacik <jbacik@fb.com> Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-15blkcg: Fix multiple bugs in blkcg_activate_policy()Tejun Heo1-18/+51
blkcg_activate_policy() has the following bugs. * cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however, blkcg_activate_policy() ends up using pd's allocated for the root blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs can be passed in pd's which are allocated for the root blkcg. For blk-iocost, this means that ->pd_init_fn() can write beyond the end of the allocated object as it determines the length of the flex array at the end based on the blkcg's nesting level. * Each pd is initialized as they get allocated. If alloc fails, the policy will get freed with pd's initialized on it. * After the above partial failure, the partial pds are not freed. This patch fixes all the above issues by * Restructuring blkcg_activate_policy() so that alloc and init passes are separate. Init takes place only after all allocs succeeded and on failure all allocated pds are freed. * Unifying and fixing the cleanup of the remaining pd_prealloc. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-14bfq: Fix bfq linkage errorPavel Begunkov1-0/+2
Since commit 795fe54c2a828099e ("bfq: Add per-device weight"), bfq uses blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it causes linkage error if bfq compiled as a module. Fixes: 795fe54c2a828099e ("bfq: Add per-device weight") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-28blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep()Tejun Heo1-19/+43
Separate out blkcg_conf_get_disk() so that it can be used by blkcg policy interface file input parsers before the policy is actually enabled. This doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-28blkcg: make ->cpd_init_fn() optionalTejun Heo1-1/+2
For policies which can do enough initialization from ->cpd_alloc_fn(), make ->cpd_init_fn() optional. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-28blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()Tejun Heo1-3/+3
Instead of @node, pass in @q and @blkcg so that the alloc function has more context. This doesn't cause any behavior change and will be used by io.weight implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-16blkcg: allow blkcg_policy->pd_stat() to print non-debug info tooTejun Heo1-6/+3
Currently, ->pd_stat() is called only when moduleparam blkcg_debug_stats is set which prevents it from printing non-debug policy-specific statistics. Let's move debug testing down so that ->pd_stat() can print non-debug stat too. This patch doesn't cause any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-10blkcg: implement REQ_CGROUP_PUNTTejun Heo1-0/+53
When a shared kthread needs to issue a bio for a cgroup, doing so synchronously can lead to priority inversions as the kthread can be trapped waiting for that cgroup. This patch implements REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing to a dedicated per-blkcg work item to avoid such priority inversions. This will be used to fix priority inversions in btrfs compression and should be generally useful as we grow filesystem support for comprehensive IO control. Cc: Chris Mason <clm@fb.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-10cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usagesTejun Heo1-0/+1
btrfs is going to use css_put() and wbc helpers to improve cgroup writeback support. Add dummy css_get() definition and export wbc helpers to prepare for module and !CONFIG_CGROUP builds. Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-10blk-cgroup: turn on psi memstall stuffJosef Bacik1-5/+7
With the psi stuff in place we can use the memstall flag to indicate pressure that happens from throttling. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20blk-cgroup: move struct blkg_stat to bfqChristoph Hellwig1-56/+0
This structure and assorted infrastructure is only used by the bfq I/O scheduler. Move it there instead of bloating the common code. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20blk-cgroup: introduce a new struct blkg_rwstat_sampleChristoph Hellwig1-20/+19
When sampling the blkcg counts we don't need atomics or per-cpu variables. Introduce a new structure just containing plain u64 counters. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20blk-cgroup: pass blkg_rwstat structures by referenceChristoph Hellwig1-15/+16
Returning a structure generates rather bad code, so switch to passing by reference. Also don't require the structure to be zeroed and add to the 0-initialized counters, but actually set the counters to the calculated value. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20blk-cgroup: factor out a helper to read rwstat counterChristoph Hellwig1-3/+2
Trying to break up the crazy statements to something readable. Also switch to an unsigned counter as it can't ever turn negative. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15blkcg: blkcg_activate_policy() should initialize ancestors firstTejun Heo1-1/+2
When blkcg_activate_policy() is creating blkg_policy_data for existing blkgs, it did in the wrong order - descendants first. Fix it. None of the existing controllers seem affected by this. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15blkcg: perpcu_ref init/exit should be done from blkg_alloc/free()Tejun Heo1-9/+4
blkg alloc is performed as a separate step from the rest of blkg creation so that GFP_KERNEL allocations can be used when creating blkgs from configuration file writes because otherwise user actions may fail due to failures of opportunistic GFP_NOWAIT allocations. While making blkgs use percpu_ref, 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref") incorrectly added unconditional opportunistic percpu_ref_init() to blkg_create() breaking this guarantee. This patch moves percpu_ref_init() to blkg_alloc() so makes it use @gfp_mask that blkg_alloc() is called with. Also, percpu_ref_exit() is moved to blkg_free() for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref") Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15blkcg: update blkcg_print_stat() to handle larger outputsTejun Heo1-2/+6
Depending on the number of devices, blkcg stats can go over the default seqfile buf size. seqfile normally retries with a larger buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't tell seqfile that overflow has happened and the output gets printed truncated. Fix it by calling seq_commit() w/ -1 on possible overflows. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-05block: Drop unlikely before IS_ERR(_OR_NULL)Kefeng Wang1-1/+1
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag, so no need to do that again from its callers. Drop it. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30block: add SPDX tags to block layer files missing licensing informationChristoph Hellwig1-0/+1
Various block layer files do not have any licensing information at all. Add SPDX tags for the default kernel GPLv2 license to those. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-20blkcg: Fix kernel-doc warningsBart Van Assche1-4/+5
Avoid that the following warnings are reported when building with W=1: block/blk-cgroup.c:1755: warning: Function parameter or member 'q' not described in 'blkcg_schedule_throttle' block/blk-cgroup.c:1755: warning: Function parameter or member 'use_memdelay' not described in 'blkcg_schedule_throttle' block/blk-cgroup.c:1779: warning: Function parameter or member 'blkg' not described in 'blkcg_add_delay' block/blk-cgroup.c:1779: warning: Function parameter or member 'now' not described in 'blkcg_add_delay' block/blk-cgroup.c:1779: warning: Function parameter or member 'delta' not described in 'blkcg_add_delay' Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-10blk-cgroup: Fix doc related to blkcg_exit_queueMarcos Paulo de Souza1-1/+1
Since 4cf6324b17e9, a portion of function blk_cleanup_queue was moved to a newly created function called blk_exit_queue, including the call of blkcg_exit_queue. So, adjust the documenation according. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-21blkcg: remove unused __blkg_release_rcu()Dennis Zhou1-23/+0
An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref") moved around the release call from blkg_put() to be a part of the percpu_ref cleanup. Remove the additional unused code which should have been removed earlier. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-19block: save irq state in blkg_lookup_create()Ming Lei1-2/+4
blkg_lookup_create() may be called from pool_map() in which irq state is saved, so we have to do that in blkg_lookup_create(). Otherwise, the following lockdep warning can be triggered: [ 104.258537] ================================ [ 104.259129] WARNING: inconsistent lock state [ 104.259725] 4.20.0-rc6+ #545 Not tainted [ 104.260268] -------------------------------- [ 104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.263747] {SOFTIRQ-ON-W} state was registered at: [ 104.264417] _raw_spin_unlock_irq+0x29/0x4c [ 104.265014] blkg_lookup_create+0xdc/0xe6 [ 104.265609] bio_associate_blkg_from_css+0xd3/0x13f [ 104.266312] bio_associate_blkg+0x15a/0x1bb [ 104.266913] pool_map+0xe8/0x103 [dm_thin_pool] [ 104.267572] __map_bio+0x98/0x29c [dm_mod] [ 104.268162] __split_and_process_non_flush+0x29e/0x306 [dm_mod] [ 104.269003] __split_and_process_bio+0x16a/0x25b [dm_mod] [ 104.269971] __dm_make_request.isra.14+0xdc/0x124 [dm_mod] [ 104.270973] generic_make_request+0x3f5/0x68b [ 104.271676] process_prepared_mapping+0x166/0x1ef [dm_thin_pool] [ 104.272531] schedule_zero+0x239/0x273 [dm_thin_pool] [ 104.273245] process_cell+0x60c/0x6f1 [dm_thin_pool] [ 104.273967] do_worker+0x60c/0xca8 [dm_thin_pool] [ 104.274635] process_one_work+0x4eb/0x834 [ 104.275203] worker_thread+0x318/0x484 [ 104.275740] kthread+0x1d1/0x1e1 [ 104.276203] ret_from_fork+0x3a/0x50 [ 104.276714] irq event stamp: 170003 [ 104.277201] hardirqs last enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55 [ 104.280273] softirqs last enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73 [ 104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d [ 104.282744] [ 104.282744] other info that might help us debug this: [ 104.283640] Possible unsafe locking scenario: [ 104.283640] [ 104.284452] CPU0 [ 104.284803] ---- [ 104.285150] lock(&(&pool->lock)->rlock#3); [ 104.285762] <Interrupt> [ 104.286130] lock(&(&pool->lock)->rlock#3); [ 104.286750] [ 104.286750] *** DEADLOCK *** [ 104.286750] [ 104.287564] no locks held by swapper/49/0. [ 104.288129] [ 104.288129] stack backtrace: [ 104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545 [ 104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 104.290858] Call Trace: [ 104.291204] <IRQ> [ 104.291502] dump_stack+0x9a/0xe6 [ 104.291968] mark_lock+0x56c/0x7a6 [ 104.292442] ? check_usage_backwards+0x209/0x209 [ 104.293086] __lock_acquire+0x400/0x15bf [ 104.293662] ? check_chain_key+0x150/0x1aa [ 104.294236] lock_acquire+0x1a6/0x1e3 [ 104.294768] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.295444] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.296143] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.297031] _raw_spin_lock_irqsave+0x46/0x55 [ 104.297659] ? thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298335] thin_endio+0xcf/0x2a3 [dm_thin_pool] [ 104.298997] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.299886] ? check_flags+0x20a/0x20a [ 104.300408] ? lock_acquire+0x1a6/0x1e3 [ 104.300954] ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool] [ 104.301865] clone_endio+0x1bb/0x22d [dm_mod] [ 104.302491] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.303200] ? bio_disassociate_blkg+0xc6/0x15f [ 104.303836] ? bio_endio+0x2b2/0x2da [ 104.304349] clone_endio+0x1f3/0x22d [dm_mod] [ 104.304978] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.305709] ? bio_disassociate_blkg+0xc6/0x15f [ 104.306333] ? bio_endio+0x2b2/0x2da [ 104.306853] clone_endio+0x1f3/0x22d [dm_mod] [ 104.307476] ? disable_write_zeroes+0x20/0x20 [dm_mod] [ 104.308185] ? bio_disassociate_blkg+0xc6/0x15f [ 104.308817] ? bio_endio+0x2b2/0x2da [ 104.309319] blk_update_request+0x2de/0x4cc [ 104.309927] blk_mq_end_request+0x2a/0x183 [ 104.310498] blk_done_softirq+0x16a/0x1a6 [ 104.311051] ? blk_softirq_cpu_dead+0xe2/0xe2 [ 104.311653] ? __lock_is_held+0x2a/0x87 [ 104.312186] __do_softirq+0x250/0x4e8 [ 104.312705] irq_exit+0x7e/0x11d [ 104.313157] call_function_single_interrupt+0xf/0x20 [ 104.313860] </IRQ> [ 104.314163] RIP: 0010:native_safe_halt+0x2/0x3 [ 104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8 [ 104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04 [ 104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7 [ 104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c [ 104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001 [ 104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf [ 104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000 [ 104.323307] ? lockdep_hardirqs_on+0x26b/0x278 [ 104.323927] default_idle+0xd9/0x1a8 [ 104.324427] do_idle+0x162/0x2b2 [ 104.324891] ? arch_cpu_idle_exit+0x28/0x28 [ 104.325467] ? mark_held_locks+0x28/0x7f [ 104.326031] ? _raw_spin_unlock_irqrestore+0x44/0x6b [ 104.326719] cpu_startup_entry+0x1d/0x1f [ 104.327261] start_secondary+0x2cb/0x308 [ 104.327806] ? set_cpu_sibling_map+0x8a3/0x8a3 [ 104.328421] secondary_startup_64+0xa4/0xb0 Fixes: b978962ad4f7f9 ("blkcg: update blkg_lookup_create() to do locking") Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-12blkcg: handle dying request_queue when associating a blkgDennis Zhou1-0/+6
Between v3 [1] and v4 [2] of the blkg association series, the association point moved from generic_make_request_checks(), which is called after the request enters the queue, to bio_set_dev(), which is when the bio is formed before submit_bio(). When the request_queue goes away, the blkgs supporting the request_queue are destroyed and then the q->root_blkg is set to %NULL. This patch adds a %NULL check to blkg_tryget_closest() to prevent the NPE caused by the above. It also adds a guard to see if the request_queue is dying when creating a blkg to prevent creating a blkg for a dead request_queue. [1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [2] https://lore.kernel.org/lkml/20181126211946.77067-1-dennis@kernel.org/ Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device") Reported-and-tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>