summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2011-05-18block: don't delay blk_run_queue_asyncShaohua Li1-1/+3
Let's check a scenario: 1. blk_delay_queue(q, SCSI_QUEUE_DELAY); 2. blk_run_queue_async(); the second one will became a noop, because q->delay_work already has WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed work runs immediately. Fix this by doing a cancel on potentially pending delayed work before queuing an immediate run of the workqueue. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-16blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroupVivek Goyal4-11/+19
Currentlly we first map the task to cgroup and then cgroup to blkio_cgroup. There is a more direct way to get to blkio_cgroup from task using task_subsys_state(). Use that. The real reason for the fix is that it also avoids a race in generic cgroup code. During remount/umount rebind_subsystems() is called and it can do following with and rcu protection. cgrp->subsys[i] = NULL; That means if somebody got hold of cgroup under rcu and then it tried to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which is wrong. I was running into this race condition with ltp running on a upstream derived kernel and that lead to crash. So ideally we should also fix cgroup generic code to wait for rcu grace period before setting pointer to NULL. Li Zefan is not very keen on introducing synchronize_wait() as he thinks it will slow down moun/remount/umount operations. So for the time being atleast fix the kernel crash by taking a more direct route to blkio_cgroup. One tester had reported a crash while running LTP on a derived kernel and with this fix crash is no more seen while the test has been running for over 6 days. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-21block: don't propagate unlisted DISK_EVENTs to userlandTejun Heo1-2/+6
DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and internal event for revalidation of removeable devices. Some legacy drivers don't implement proper event detection and continuously generate events under certain circumstances. For example, ide-cd generates media changed continuously if there's no media in the drive, which can lead to infinite loop of events jumping back and forth between the driver and userland event handler. This patch updates disk event infrastructure such that it never propagates events not listed in disk->events to userland. Those events are processed the same for internal purposes but uevent generation is suppressed. This also ensures that userland only gets events which are advertised in the @events sysfs node lowering risk of confusion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-21elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case tooJens Axboe1-1/+2
The sort insert is the one that goes to the IO scheduler. With the SORT_MERGE addition, we could bypass IO scheduler setup but still ask the IO scheduler to insert the request. This would cause an oops on switching IO schedulers through the sysfs interface, unless the disk just happened to be idle while it occured. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19block: Remove the extra check in queue_requests_storeTao Ma1-2/+2
In queue_requests_store, the code looks like if (rl->count[BLK_RW_SYNC] >= q->nr_requests) { blk_set_queue_full(q, BLK_RW_SYNC); } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) { blk_clear_queue_full(q, BLK_RW_SYNC); wake_up(&rl->wait[BLK_RW_SYNC]); } If we don't satify the situation of "if", we can get that rl->count[BLK_RW_SYNC} < q->nr_quests. It is the same as rl->count[BLK_RW_SYNC]+1 <= q->nr_requests. All the "else" should satisfy the "else if" check so it isn't needed actually. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19block, blk-sysfs: Fix an err return path in blk_register_queue()Liu Yuan1-1/+3
We do not call blk_trace_remove_sysfs() in err return path if kobject_add() fails. This path fixes it. Cc: stable@kernel.org Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19block: remove stale kerneldoc member from __blk_run_queue()Jens Axboe1-1/+0
We don't pass in a 'force_kblockd' anymore, get rid of the stsale comment. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19block: get rid of QUEUE_FLAG_REENTERJens Axboe2-10/+2
We are currently using this flag to check whether it's safe to call into ->request_fn(). If it is set, we punt to kblockd. But we get a lot of false positives and excessive punts to kblockd, which hurts performance. The only real abuser of this infrastructure is SCSI. So export the async queue run and convert SCSI over to use that. There's room for improvement in that SCSI need not always use the async call, but this fixes our performance issue and they can fix that up in due time. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-19cfq-iosched: read_lock() does not always imply rcu_read_lock()Jens Axboe1-14/+6
For some configurations of CONFIG_PREEMPT that is not true. So get rid of __call_for_each_cic() and always uses the explicitly rcu_read_lock() protected call_for_each_cic() instead. This fixes a potential bug related to IO scheduler removal or online switching. Thanks to Paul McKenney for clarifying this. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18block: kill blk_flush_plug_list() exportJens Axboe1-1/+0
With all drivers and file systems converted, we only have in-core use of this function. So remove the export. Reporteed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18block: add blk_run_queue_asyncChristoph Hellwig6-20/+33
Instead of overloading __blk_run_queue to force an offload to kblockd add a new blk_run_queue_async helper to do it explicitly. I've kept the blk_queue_stopped check for now, but I suspect it's not needed as the check we do when the workqueue items runs should be enough. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18block: blk_delay_queue() should use kblockd workqueueJens Axboe1-1/+2
Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18block: drop queue lock before calling __blk_run_queue() for kblockd puntJens Axboe1-8/+25
If we know we are going to punt to kblockd, we can drop the queue lock before calling into __blk_run_queue() since it only does a safe bit test and a workqueue call. Since kblockd needs to grab this very lock as one of the first things it does, it's a good optimization to drop the lock before waking kblockd. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18Revert "block: add callback function for unplug notification"Jens Axboe2-19/+0
MD can't use this since it really requires us to be able to keep more than a single piece of state for the unplug. Commit 048c9374 added the required support for MD, so get rid of this now unused code. This reverts commit f75664570d8b75469cc468f23c2b27220984983b. Conflicts: block/blk-core.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-18block: Enhance new plugging support to support general callbacksNeilBrown1-0/+20
md/raid requires an unplug callback, but as it does not uses requests the current code cannot provide one. So allow arbitrary callbacks to be attached to the blk_plug. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-16block: make unplug timer trace event correspond to the schedule() unplugJens Axboe1-6/+12
It's a pretty close match to what we had before - the timer triggering would mean that nobody unplugged the plug in due time, in the new scheme this matches very closely what the schedule() unplug now is. It's essentially the difference between an explicit unplug (IO unplug) or an implicit unplug (timer unplug, we scheduled with pending IO queued). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-15block: only force kblockd unplugging from the schedule() pathJens Axboe1-6/+7
For the explicit unplugging, we'd prefer to kick things off immediately and not pay the penalty of the latency to switch to kblockd. So let blk_finish_plug() do the run inline, while the implicit-on-schedule-out unplug will punt to kblockd. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-15block: cleanup the block plug helper functionsChristoph Hellwig1-18/+6
It's a bit of a mess currently. task->plug is being cleared and reset in __blk_finish_plug(), and blk_finish_plug() is testing for a NULL plug which cannot happen even from schedule() anymore since it uses blk_needs_flush_plug() to determine whether to call into this function at all. So get rid of some of the cruft. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-13block, blk-sysfs: Use the variable directly instead of a function callLiu Yuan1-2/+1
In the function blk_register_queue(), var _dev_ is already assigned by disk_to_dev().So use it directly instead of calling disk_to_dev() again. Signed-off-by: Liu Yuan <tailai.ly@taobao.com> Modified by me to delete an empty line in the same function while in there anyway. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12block: move queue run on unplug to kblockdJens Axboe1-1/+1
There are worries that we are now consuming a lot more stack in some cases, since we potentially call into IO dispatch from schedule() or io_schedule(). We can reduce this problem by moving the running of the queue to kblockd, like the old plugging scheme did as well. This may or may not be a good idea from a performance perspective, depending on how many tasks have queue plugs running at the same time. For even the slightly contended case, doing just a single queue run from kblockd instead of multiple runs directly from the unpluggers will be faster. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12block: kill queue_sync_plugs()Jens Axboe1-14/+0
The original use for this dates back to when we had to track write requests for serializing around barriers. That's not needed anymore, so kill it. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12block: readd plug trace eventJens Axboe1-1/+9
This was removed with the queue plug state. But we can easily readd by checking if this is the first request going to this queue. It's good information to have when tracing to see how effective the plugging is. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12block: add callback function for unplug notificationJens Axboe2-0/+19
MD would like to know when a queue is unplugged, so it can flush it's bitmap writes. Add such a callback. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12block: add comment on why we save and disable interrupts in flush_plug_list()Jens Axboe1-0/+5
It's done at the top to avoid doing it for every queue we unplug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12block: fixup block IO unplug trace callJens Axboe1-2/+13
It was removed with the on-stack plugging, readd it and track the depth of requests added when flushing the plug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-11block: splice plug list to local contextNeilBrown1-5/+9
If the request_fn ends up blocking, we could be re-entering the plug flush. Since the list is protected by explicitly not allowing schedule events, this isn't a terribly good idea. Additionally, it can cause us to recurse. As request_fn called by __blk_run_queue is allowed to 'schedule()' (after dropping the queue lock of course), it is possible to get a recursive call: schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list -> __blk_run_queue -> request_fn -> schedule We must make sure that the second schedule does not call into blk_flush_plug again. So instead of leaving the list of requests on blk_plug->list, move them to a separate list leaving blk_plug->list empty. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-07Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6Linus Torvalds6-7/+7
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings
2011-04-05block: fix request sorting at unplugKonstantin Khlebnikov1-1/+1
Comparison function for list_sort() must be anticommutative, otherwise it is not sorting in ordinary meaning. But fortunately list_sort() always check ((*cmp)(priv, a, b) <= 0) it not distinguish negative and zero, so comparison function can implement only less-or-equal instead of full three-way comparison. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-05dm: improve block integrity supportMike Snitzer1-1/+11
The current block integrity (DIF/DIX) support in DM is verifying that all devices' integrity profiles match during DM device resume (which is past the point of no return). To some degree that is unavoidable (stacked DM devices force this late checking). But for most DM devices (which aren't stacking on other DM devices) the ideal time to verify all integrity profiles match is during table load. Introduce the notion of an "initialized" integrity profile: a profile that was blk_integrity_register()'d with a non-NULL 'blk_integrity' template. Add blk_integrity_is_initialized() to allow checking if a profile was initialized. Update DM integrity support to: - check all devices with _initialized_ integrity profiles match during table load; uninitialized profiles (e.g. for underlying DM device(s) of a stacked DM device) are ignored. - disallow a table load that would result in an integrity profile that conflicts with a DM device's existing (in-use) integrity profile - avoid clearing an existing integrity profile - validate all integrity profiles match during resume; but if they don't all we can do is report the mismatch (during resume we're past the point of no return) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-05blk-throttle: don't call xchg on boolAndreas Schwab1-2/+2
xchg does not work portably with smaller than 32bit types. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-05block: make the flush insertion use the tail of the dispatch listJens Axboe1-2/+2
It's not a preempt type request, in fact we have to insert it behind requests that do specify INSERT_FRONT. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-05block: get rid of elv_insert() interfaceJens Axboe2-22/+17
Merge it with __elv_add_request(), it's pretty pointless to have a function with only two callers. The main interface is elv_add_request()/__elv_add_request(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-05block: dump request state on seeing a corrupted request completionJens Axboe1-1/+1
Currently we just dump a non-informative 'request botched' message. Lets actually try and print something sane to help debug issues around this. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-31Fix common misspellingsLucas De Marchi6-7/+7
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-25block: fix issue with calling blk_stop_queue() from the request_fn handlerJens Axboe1-1/+1
When the queue work handler was converted to delayed work, the stopping was inadvertently made sync as well. Change this back to being async stop, using __cancel_delayed_work() instead of cancel_delayed_work(). Reported-by: Jeremy Fitzhardinge <jeremy@goop.org> Reported-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-25block: fix bug with inserting flush requests as sort/mergeJens Axboe1-1/+4
With the introduction of the on-stack plugging, we would assume that any request being inserted was a normal file system request. As flush/fua requires a special insert mode, this caused problems. Fix this up by checking for this in flush_plug_list() and use the appropriate insert mechanism. Big thanks goes to Markus Tripplesdorf for tirelessly testing patches, and to Sergey Senozhatsky for helping find the real issue. Reported-by: Markus Tripplesdorf <markus@trippelsdorf.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds17-656/+955
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-23cfq-iosched: removing unnecessary think time checkingLi, Shaohua1-9/+4
Removing think time checking. A high thinktime queue might means the queue dispatches several requests and then do away. Limitting such queue seems meaningless. And also this can simplify code. This is suggested by Vivek. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-23cfq-iosched: Don't clear queue stats when preempt.Justin TerAvest1-22/+17
For v2, I added back lines to cfq_preempt_queue() that were removed during updates for accounting unaccounted_time. Thanks for pointing out that I'd missed these, Vivek. Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly cleared stats for preempting queues when it shouldn't have, because when we choose a queue to preempt, it still isn't necessarily scheduled next. Thanks to Vivek Goyal for figuring this out and understanding how the preemption code works. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-22blk-throttle: Reset group slice when limits are changedVivek Goyal1-1/+24
Lina reported that if throttle limits are initially very high and then dropped, then no new bio might be dispatched for a long time. And the reason being that after dropping the limits we don't reset the existing slice and do the rate calculation with new low rate and account the bios dispatched at high rate. To fix it, reset the slice upon rate change. https://lkml.org/lkml/2011/3/10/298 Another problem with very high limit is that we never queued the bio on throtl service tree. That means we kept on extending the group slice but never trimmed it. Fix that also by regulary trimming the slice even if bio is not being queued up. Reported-by: Lina Lu <lulina_nuaa@foxmail.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-22blk-cgroup: Only give unaccounted_time under debugJustin TerAvest1-10/+10
This change moves unaccounted_time to only be reported when CONFIG_DEBUG_BLK_CGROUP is true. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-22cfq-iosched: Don't set active queue in preemptJustin TerAvest1-16/+23
Commit "Add unaccounted time to timeslice_used" changed the behavior of cfq_preempt_queue to set cfqq active. Vivek pointed out that other preemption rules might get involved, so we shouldn't manually set which queue is active. This cleans up the code to just clear the queue stats at preemption time. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-21block: attempt to merge with existing requests on plug flushJens Axboe4-4/+58
One of the disadvantages of on-stack plugging is that we potentially lose out on merging since all pending IO isn't always visible to everybody. When we flush the on-stack plugs, right now we don't do any checks to see if potential merge candidates could be utilized. Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE. It works just ELEVATOR_INSERT_SORT, but first checks whether we can merge with an existing request before doing the insertion (if we fail merging). This fixes a regression with multiple processes issuing IO that can be merged. Thanks to Shaohua Li <shaohua.li@intel.com> for testing and fixing an accounting bug. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds1-3/+20
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits) [SCSI] scsi_dh_rdac: Add MD36xxf into device list [SCSI] scsi_debug: add consecutive medium errors [SCSI] libsas: fix ata list corruption issue [SCSI] hpsa: export resettable host attribute [SCSI] hpsa: move device attributes to avoid forward declarations [SCSI] scsi_debug: Logical Block Provisioning (SBC3r26) [SCSI] sd: Logical Block Provisioning update [SCSI] Include protection operation in SCSI command trace [SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try) [SCSI] target: Fix volume size misreporting for volumes > 2TB [SCSI] bnx2fc: Broadcom FCoE offload driver [SCSI] fcoe: fix broken fcoe interface reset [SCSI] fcoe: precedence bug in fcoe_filter_frames() [SCSI] libfcoe: Remove stale fcoe-netdev entries [SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h [SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument [SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs [SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out" [SCSI] libfc: Fixing a memory leak when destroying an interface [SCSI] megaraid_sas: Version and Changelog update ... Fix up trivial conflicts due to whitespace differences in drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}
2011-03-17cfq-iosched: Don't update group weights when on service treeJustin TerAvest1-12/+41
Version 3 is updated to apply to for-2.6.39/core. For version 2, I took Vivek's advice and made sure we update the group weight from cfq_group_service_tree_add(). If a weight was updated while a group is on the service tree, the calculation for the total weight of the service tree can be adjusted improperly, which either leads to bad service tree weights, or potentially crashes (if total_weight becomes 0). This patch defers updates to the weight until a group is off the service tree. Signed-off-by: Justin TerAvest <teravest@google.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-12blk-cgroup: Add unaccounted time to timeslice_used.Justin TerAvest4-14/+41
There are two kind of times that tasks are not charged for: the first seek and the extra time slice used over the allocated timeslice. Both of these exported as a new unaccounted_time stat. I think it would be good to have this reported in 'time' as well, but that is probably a separate discussion. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-11block: remove obsolete comments for blkdev_issue_zeroout.Tao Ma1-2/+0
barrier is already removed, so remove the obsolete comments in blkdev_issue_zeroout. Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-11block: fix mis-synchronisation in blkdev_issue_zeroout()Lukas Czerner1-12/+7
BZ29402 https://bugzilla.kernel.org/show_bug.cgi?id=29402 We can hit serious mis-synchronization in bio completion path of blkdev_issue_zeroout() leading to a panic. The problem is that when we are going to wait_for_completion() in blkdev_issue_zeroout() we check if the bb.done equals issued (number of submitted bios). If it does, we can skip the wait_for_completition() and just out of the function since there is nothing to wait for. However, there is a ordering problem because bio_batch_end_io() is calling atomic_inc(&bb->done) before complete(), hence it might seem to blkdev_issue_zeroout() that all bios has been completed and exit. At this point when bio_batch_end_io() is going to call complete(bb->wait), bb and wait does not longer exist since it was allocated on stack in blkdev_issue_zeroout() ==> panic! (thread 1) (thread 2) bio_batch_end_io() blkdev_issue_zeroout() if(bb) { ... if (bb->end_io) ... bb->end_io(bio, err); ... atomic_inc(&bb->done); ... ... while (issued != atomic_read(&bb.done)) ... (let issued == bb.done) ... (do the rest of the function) ... return ret; complete(bb->wait); ^^^^^^^^ panic We can fix this easily by simplifying bio_batch and completion counting. Also remove bio_end_io_t *end_io since it is not used. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reported-by: Eric Whitney <eric.whitney@hp.com> Tested-by: Eric Whitney <eric.whitney@hp.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/coreJens Axboe10-315/+323
Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10blk-throttle: Use blk_plug in throttle dispatchVivek Goyal1-0/+3
Use plug in throttle dispatch also as we are dispatching a bunch of bios in throttle context and some of them might merge. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>