summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2013-07-01Merge tag 'v3.10' into sched/coreIngo Molnar885-5164/+13534
Merge in a recent upstream commit: c2853c8df57f include/linux/math64.h: add div64_ul() because: 72a4cf20cb71 sched: Change cfs_rq load avg to unsigned long relies on it. [ We don't rebase sched/core for this, because the handful of followup commits after the broken commit are not behavioral changes so are unlikely to be needed during bisection. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-30Linux 3.10v3.10Linus Torvalds1-1/+1
2013-06-30Merge branch 'merge' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull another powerpc fix from Benjamin Herrenschmidt: "I mentioned that while we had fixed the kernel crashes, EEH error recovery didn't always recover... It appears that I had a fix for that already in powerpc-next (with a stable CC). I cherry-picked it today and did a few tests and it seems that things now work quite well. The patch is also pretty simple, so I see no reason to wait before merging it." * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/eeh: Fix fetching bus for single-dev-PE
2013-06-30Merge tag 'scsi-fixes' of ↵Linus Torvalds11-93/+61
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is a set of seven bug fixes. Several fcoe fixes for locking problems, initiator issues and a VLAN API change, all of which could eventually lead to data corruption, one fix for a qla2xxx locking problem which could lead to multiple completions of the same request (and subsequent data corruption) and a use after free in the ipr driver. Plus one minor MAINTAINERS file update" (only six bugfixes in this pull, since I had already pulled the fcoe API fix directly from Robert Love) * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: [SCSI] ipr: Avoid target_destroy accessing memory after it was freed [SCSI] qla2xxx: Fix for locking issue between driver ISR and mailbox routines MAINTAINERS: Fix fcoe mailing list libfc: extend ex_lock to protect all of fc_seq_send libfc: Correct check for initiator role libfcoe: Fix Conflicting FCFs issue in the fabric
2013-06-30powerpc/eeh: Fix fetching bus for single-dev-PEGavin Shan1-1/+2
While running Linux as guest on top of phyp, we possiblly have PE that includes single PCI device. However, we didn't return its PCI bus correctly and it leads to failure on recovery from EEH errors for single-dev-PE. The patch fixes the issue. Cc: <stable@vger.kernel.org> # v3.7+ Cc: Steve Best <sbest@us.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-29Merge branch 'merge' of ↵Linus Torvalds2-7/+14
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc fixes from Ben Herrenschmidt: "We discovered some breakage in our "EEH" (PCI Error Handling) code while doing error injection, due to a couple of regressions. One of them is due to a patch (37f02195bee9 "powerpc/pci: fix PCI-e devices rescan issue on powerpc platform") that, in hindsight, I shouldn't have merged considering that it caused more problems than it solved. Please pull those two fixes. One for a simple EEH address cache initialization issue. The other one is a patch from Guenter that I had originally planned to put in 3.11 but which happens to also fix that other regression (a kernel oops during EEH error handling and possibly hotplug). With those two, the couple of test machines I've hammered with error injection are remaining up now. EEH appears to still fail to recover on some devices, so there is another problem that Gavin is looking into but at least it's no longer crashing the kernel." * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/pci: Improve device hotplug initialization powerpc/eeh: Add eeh_dev to the cache during boot
2013-06-29ARM: dt: Only print warning, not WARN() on bad cpu map in device treeOlof Johansson1-2/+3
Due to recent changes and expecations of proper cpu bindings, there are now cases for many of the in-tree devicetrees where a WARN() will hit on boot due to badly formatted /cpus nodes. Downgrade this to a pr_warn() to be less alarmist, since it's not a new problem. Tested on Arndale, Cubox, Seaboard and Panda ES. Panda hits the WARN without this, the others do not. Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Olof Johansson <olof@lixom.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-30powerpc/pci: Improve device hotplug initializationGuenter Roeck1-5/+12
Commit 37f02195b (powerpc/pci: fix PCI-e devices rescan issue on powerpc platform) fixes a problem with interrupt and DMA initialization on hot plugged devices. With this commit, interrupt and DMA initialization for hot plugged devices is handled in the pci device enable function. This approach has a couple of drawbacks. First, it creates two code paths for device initialization, one for hot plugged devices and another for devices known during the initial PCI scan. Second, the initialization code for hot plugged devices is only called when the device is enabled, ie typically in the probe function. Also, the platform specific setup code is called each time pci_enable_device() is called, not only once during device discovery, meaning it is actually called multiple times, once for devices discovered during the initial scan and again each time a driver is re-loaded. The visible result is that interrupt pins are only assigned to hot plugged devices when the device driver is loaded. Effectively this changes the PCI probe API, since pci_dev->irq and the device's dma configuration will now only be valid after pci_enable() was called at least once. A more subtle change is that platform specific PCI device setup is moved from device discovery into the driver's probe function, more specifically into the pci_enable_device() call. To fix the inconsistencies, add new function pcibios_add_device. Call pcibios_setup_device from pcibios_setup_bus_devices if device setup is not complete, and from pcibios_add_device if bus setup is complete. With this change, device setup code is moved back into device initialization, and called exactly once for both static and hot plugged devices. [ This also fixes a regression introduced by the above patch which causes dev->irq to be overwritten under some cirumstances after MSIs have been enabled for the device which leads to crashes due to the MSI core "hijacking" dev->irq to store the base MSI number and not the LSI. --BenH ] Cc: Yuanquan Chen <Yuanquan.Chen@freescale.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Hiroo Matsumoto <matsumoto.hiroo@jp.fujitsu.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds3-13/+14
Pull crypto fix from Herbert Xu: "This fixes a crash in the crypto layer exposed by an SCTP test tool" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: algboss - Hold ref count on larval
2013-06-29Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds1-0/+5
Pull drm/qxl fix from Dave Airlie: "Bad me forgot an access check, possible security issue, but since this is the first kernel with it, should be fine to just put it in now" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: drm/qxl: add missing access check for execbuffer ioctl
2013-06-29Fix: kernel/ptrace.c: ptrace_peek_siginfo() missing __put_user() validationMathieu Desnoyers1-9/+11
This __put_user() could be used by unprivileged processes to write into kernel memory. The issue here is that even if copy_siginfo_to_user() fails, the error code is not checked before __put_user() is executed. Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle, so it has not hit a stable release yet. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Roland McGrath <roland@redhat.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Pedro Alves <palves@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-29Merge branch 'for-linus' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fix from Sage Weil: "This is a recently spotted regression in the snapshot behavior... It turns out several tests weren't being run in the nightlies so this took a while to spot" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: send snapshot context with writes
2013-06-29Merge branch 'for-linus' of ↵Linus Torvalds1-15/+39
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ubifs fixes from Al Viro: "A couple of ubifs readdir/lseek race fixes. Stable fodder, really nasty..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: UBIFS: fix a horrid bug UBIFS: prepare to fix a horrid bug
2013-06-29Merge tag 'for-linus-20130628' of ↵Linus Torvalds2-34/+22
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-mn10300 Pull two MN10300 fixes from David Howells: "The first fixes a problem with passing arrays rather than pointers to get_user() where __typeof__ then wants to declare and initialise an array variable which gcc doesn't like. The second fixes a problem whereby putting mem=xxx into the kernel command line causes init=xxx to get an incorrect value." * tag 'for-linus-20130628' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-mn10300: mn10300: Use early_param() to parse "mem=" parameter mn10300: Allow to pass array name to get_user()
2013-06-29Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "Correct an ordering issue in the tick broadcast code. I really wish we'd get compensation for pain and suffering for each line of code we write to work around dysfunctional timer hardware." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick: Fix tick_broadcast_pending_mask not cleared
2013-06-29Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-7/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "One more fix for a recently discovered bug" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Disable monitoring on setuid processes for regular users
2013-06-29UBIFS: fix a horrid bugArtem Bityutskiy1-3/+27
Al Viro pointed me to the fact that '->readdir()' and '->llseek()' have no mutual exclusion, which means the 'ubifs_dir_llseek()' can be run while we are in the middle of 'ubifs_readdir()'. This means that 'file->private_data' can be freed while 'ubifs_readdir()' uses it, and this is a very bad bug: not only 'ubifs_readdir()' can return garbage, but this may corrupt memory and lead to all kinds of problems like crashes an security holes. This patch fixes the problem by using the 'file->f_version' field, which '->llseek()' always unconditionally sets to zero. We set it to 1 in 'ubifs_readdir()' and whenever we detect that it became 0, we know there was a seek and it is time to clear the state saved in 'file->private_data'. I tested this patch by writing a user-space program which runds readdir and seek in parallell. I could easily crash the kernel without these patches, but could not crash it with these patches. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29UBIFS: prepare to fix a horrid bugArtem Bityutskiy1-12/+12
Al Viro pointed me to the fact that '->readdir()' and '->llseek()' have no mutual exclusion, which means the 'ubifs_dir_llseek()' can be run while we are in the middle of 'ubifs_readdir()'. First of all, this means that 'file->private_data' can be freed while 'ubifs_readdir()' uses it. But this particular patch does not fix the problem. This patch is only a preparation, and the fix will follow next. In this patch we make 'ubifs_readdir()' stop using 'file->f_pos' directly, because 'file->f_pos' can be changed by '->llseek()' at any point. This may lead 'ubifs_readdir()' to returning inconsistent data: directory entry names may correspond to incorrect file positions. So here we introduce a local variable 'pos', read 'file->f_pose' once at very the beginning, and then stick to 'pos'. The result of this is that when 'ubifs_dir_llseek()' changes 'file->f_pos' while we are in the middle of 'ubifs_readdir()', the latter "wins". Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-28mn10300: Use early_param() to parse "mem=" parameterAkira Takeuchi1-33/+21
This fixes the problem that "init=" options may not be passed to kernel correctly. parse_mem_cmdline() of mn10300 arch gets rid of "mem=" string from redboot_command_line. Then init_setup() parses the "init=" options from static_command_line, which is a copy of redboot_command_line, and keeps the pointer to the init options in execute_command variable. Since the commit 026cee0 upstream (params: <level>_initcall-like kernel parameters), static_command_line becomes overwritten by saved_command_line at do_initcall_level(). Notice that saved_command_line is a command line which includes "mem=" string. As a result, execute_command may point to weird string by the length of "mem=" parameter. I noticed this problem when using the command line like this: mem=128M console=ttyS0,115200 init=/bin/sh Here is the processing flow of command line parameters. start_kernel() setup_arch(&command_line) parse_mem_cmdline(cmdline_p) * strcpy(boot_command_line, redboot_command_line); * Remove "mem=xxx" from redboot_command_line. * *cmdline_p = redboot_command_line; setup_command_line(command_line) <-- command_line is redboot_command_line * strcpy(saved_command_line, boot_command_line) * strcpy(static_command_line, command_line) parse_early_param() strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE); parse_early_options(tmp_cmdline); parse_args("early options", cmdline, NULL, 0, 0, 0, do_early_param); parse_args("Booting ..", static_command_line, ...); init_setup() <-- save the pointer in execute_command rest_init() kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND); At this point, execute_command points to "/bin/sh" string. kernel_init() kernel_init_freeable() do_basic_setup() do_initcalls() do_initcall_level() (*) strcpy(static_command_line, saved_command_line); Here, execute_command gets to point to "200" string !! Signed-off-by: David Howells <dhowells@redhat.com>
2013-06-28mn10300: Allow to pass array name to get_user()Akira Takeuchi1-1/+1
This fixes the following compile error: CC block/scsi_ioctl.o block/scsi_ioctl.c: In function 'sg_scsi_ioctl': block/scsi_ioctl.c:449: error: invalid initializer Signed-off-by: David Howells <dhowells@redhat.com>
2013-06-28sched/debug: Remove CONFIG_FAIR_GROUP_SCHED maskAlex Shi1-4/+6
Now that we are using runnable load avg in sched balance, we don't need to keep it under CONFIG_FAIR_GROUP_SCHED. Also align the code style to #ifdef instead of #if defined() and reorder the tg output info. Signed-off-by: Alex Shi <alex.shi@intel.com> Cc: pjt@google.com Cc: kamalesh@linux.vnet.ibm.com Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1372417835-4698-1-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28sched/debug: Fix formatting of /proc/<PID>/schedKamalesh Babulal1-8/+9
This patch alters format string's width, to align all statistics at par with the longest struct sched_statistic member name under /proc/<PID>/sched. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/20130627165005.GA15583@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28drm/qxl: add missing access check for execbuffer ioctlDave Airlie1-0/+5
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2013-06-28powerpc/eeh: Add eeh_dev to the cache during bootThadeu Lima de Souza Cascardo1-2/+2
commit f8f7d63fd96ead101415a1302035137a866f8998 ("powerpc/eeh: Trace eeh device from I/O cache") broke EEH on pseries for devices that were present during boot and have not been hotplugged/DLPARed. eeh_check_failure will get the eeh_dev from the cache, and will get NULL. eeh_addr_cache_build adds the addresses to the cache, but eeh_dev for the giving pci_device is not set yet. Just reordering the call to eeh_addr_cache_insert_dev works fine. The ordering is similar to the one in eeh_add_device_late. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-27rbd: send snapshot context with writesJosh Durgin1-1/+5
Sending the right snapshot context with each write is required for snapshots to work. Due to the ordering of calls, the snapshot context is never set for any requests. This causes writes to the current version of the image to be reflected in all snapshots, which are supposed to be read-only. This happens because rbd_osd_req_format_write() sets the snapshot context based on obj_request->img_request. At this point, however, obj_request->img_request has not been set yet, to the snapshot context is set to NULL. Fix this by moving rbd_img_obj_request_add(), which sets obj_request->img_request, before the osd request formatting calls. This resolves: http://tracker.ceph.com/issues/5465 Reported-by: Karol Jurak <karol.jurak@gmail.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2013-06-27sched: Fix typo in struct sched_avg member descriptionKamalesh Babulal1-1/+1
Remove extra 'for' from the description about member of struct sched_avg. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: pjt@google.com Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/20130627060409.GB18582@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/fair: Fix typo describing flags in enqueue_entityKamalesh Babulal1-1/+1
Fix spelling of 'calling' in description of se flags in enqueue_entity(). Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/20130627055418.GA18582@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/debug: Add load-tracking statistics to taskKamalesh Babulal1-0/+6
At present we print per-entity load-tracking statistics for cfs_rq of cgroups/runqueues. Given that per task statistics is maintained, it can be used to know the contribution made by the task to its parenting cfs_rq level. This patch adds per-task load-tracking statistics to /proc/<PID>/sched. Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130625080336.GA20175@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Change get_rq_runnable_load() to static and inlineAlex Shi1-2/+2
Based-on-patch-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-14-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/tg: Remove tg.load_weightAlex Shi1-1/+0
Since no one use it. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-13-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/cfs_rq: Change atomic64_t removed_load to atomic_long_tAlex Shi2-5/+8
Similar to runnable_load_avg, blocked_load_avg variable, long type is enough for removed_load in 64 bit or 32 bit machine. Then we avoid the expensive atomic64 operations on 32 bit machine. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-12-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched/tg: Use 'unsigned long' for load variable in task groupAlex Shi3-11/+11
Since tg->load_avg is smaller than tg->load_weight, we don't need a atomic64_t variable for load_avg in 32 bit machine. The same reason for cfs_rq->tg_load_contrib. The atomic_long_t/unsigned long variable type are more efficient and convenience for them. Signed-off-by: Alex Shi <alex.shi@intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-11-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Change cfs_rq load avg to unsigned longAlex Shi3-8/+5
Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64 vaiables to describe them. unsigned long is more efficient and convenience. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-10-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Consider runnable load average in move_tasks()Alex Shi1-9/+9
Aside from using runnable load average in background, move_tasks is also the key function in load balance. We need consider the runnable load average in it in order to make it an apple to apple load comparison. Morten had caught a div u64 bug on ARM, thanks! Thanks-to: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-8-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_taskAlex Shi2-4/+18
They are the base values in load balance, update them with rq runnable load average, then the load balance will consider runnable load avg naturally. We also try to include the blocked_load_avg as cpu load in balancing, but that cause kbuild performance drop 6% on every Intel machine, and aim7/oltp drop on some of 4 CPU sockets machines. Or only add blocked_load_avg into get_rq_runable_load, hackbench still drop a little on NHM EX. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-7-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Update cpu load after task_tickAlex Shi1-1/+1
To get the latest runnable info, we need do this cpuload update after task_tick. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-6-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Fix sleep time double accounting in enqueue entityAlex Shi1-1/+7
The woken migrated task will __synchronize_entity_decay(se); in migrate_task_rq_fair, then it needs to set `se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before update_entity_load_avg, in order to avoid sleep time is updated twice for se.avg.load_avg_contrib in both __syncchronize and update_entity_load_avg. However if the sleeping task is woken up from the same cpu, it miss the last_runnable_update before update_entity_load_avg(se, 0, 1), then the sleep time was used twice in both functions. So we need to remove the double sleep time accounting. Paul also contributed some code comments in this commit. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-5-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Set an initial value of runnable avg for new forked taskAlex Shi3-4/+28
We need to initialize the se.avg.{decay_count, load_avg_contrib} for a new forked task. Otherwise random values of above variables cause a mess when a new task is enqueued: enqueue_task_fair enqueue_entity enqueue_entity_load_avg and make fork balancing imbalance due to incorrect load_avg_contrib. Further more, Morten Rasmussen notice some tasks were not launched at once after created. So Paul and Peter suggest giving a start value for new task runnable avg time same as sched_slice(). PeterZ said: > So the 'problem' is that our running avg is a 'floating' average; ie. it > decays with time. Now we have to guess about the future of our newly > spawned task -- something that is nigh impossible seeing these CPU > vendors keep refusing to implement the crystal ball instruction. > > So there's two asymptotic cases we want to deal well with; 1) the case > where the newly spawned program will be 'nearly' idle for its lifetime; > and 2) the case where its cpu-bound. > > Since we have to guess, we'll go for worst case and assume its > cpu-bound; now we don't want to make the avg so heavy adjusting to the > near-idle case takes forever. We want to be able to quickly adjust and > lower our running avg. > > Now we also don't want to make our avg too light, such that it gets > decremented just for the new task not having had a chance to run yet -- > even if when it would run, it would be more cpu-bound than not. > > So what we do is we make the initial avg of the same duration as that we > guess it takes to run each task on the system at least once -- aka > sched_slice(). > > Of course we can defeat this with wakeup/fork bombs, but in the 'normal' > case it should be good enough. Paul also contributed most of the code comments in this commit. Signed-off-by: Alex Shi <alex.shi@intel.com> Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reviewed-by: Paul Turner <pjt@google.com> [peterz; added explanation of sched_slice() usage] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27sched: Move a few runnable tg variables into CONFIG_SMPAlex Shi1-0/+2
The following 2 variables are only used under CONFIG_SMP, so its better to move their definiation into CONFIG_SMP too. atomic64_t load_avg; atomic_t runnable_avg; Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-27Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for ↵Alex Shi4-42/+8
load-tracking" Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then we can use runnable load variables. Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted patch(introduced in 9ee474f), but also need to revert. Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26Merge tag 'fcoe1' into fixesJames Bottomley1575-11275/+19854
This patch fixes a critical bug that was introduced in 3.9 related to VLAN tagging FCoE frames.
2013-06-26Merge tag 'fcoe' into fixesJames Bottomley4-25/+31
3.10 fixes
2013-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds77-231/+5065
Pull networking fixes from David Miller: 1) Found via trinity: If you connect up an ipv6 socket to an ipv4 mapped address then an ipv6 one, sendmsg() can croak because ip6_sk_dst_check() assumes the route cached in the socket is an ipv6 one. In this case there is an ipv4 route attached, so it gets stomped on. Reported by Dave Jones and Hannes Frederic Sowa, fixed by Eric Dumazet. 2) AF_KEY notifications leak some kernel memory to userspace, fix from Mathias Krause. 3) DLCI calls __dev_get_by_name() without proper locking, and dlci_del doesn't validate that the device being deleted is actually a DLCI one. Fixes from Li Zefan. 4) Length check on bluetooth l2cap information responses is wrong, each response type has a different lenth, so we should make sure it's in a given range rather than enforce one single valid length. From Jaganath Kanakkassery. 5) Receive FIFO overflow is really easy to trigger in stress scenerios in the sh_eth driver, but the event isn't being handled properly at all. Specifically, the mask of error interrupts doesn't include the event so we never clear it, resulting in the driver becomming wedged processing an interrupt that never gets cleared. Fix from Sergei Shtylyov. 6) qlcnic sleeps while holding a spinlock, use mdelay() instead of msleep(). From Shahed Shaikh. 7) Missing curly braces causes SIP netfilter NAT module to always drop packets. Fix from Balazs Peter Odor. 8) ipt_ULOG in netfilter passes the wrong value to timer setup, causing the timer to dereference crap when it fires. Fix from Gao Feng. 9) Missing RCU protection around txq->axq_acq traversal in ath_txq_schedule(). Fix from Felix Fietkau. 10) Idle state transition test in ath9k_htc_config() is reversed, fix from Sujith Manoharan. 11) IPV6 forwarding handles unicast Router Alert packets incorrectly. It tests the wrong option state. Previously opt->ra being non-zero indicated a router alert marking in the SKB, but now it's indicated by a bit in opt->flags. Fix from YOSHIFUJI Hideaki. 12) SKB leak in GRE tunnel GSO handling, from Eric Dumazet. 13) get_user_pages_fast() error handling in TUN and MACVTAP use the same local variable for the base index and the loop iterator for page traversal, oops! Fix from Michael S Tsirkin. 14) ipv6_get_lladdr() can fail, and we must therefore check it's return value in inet6_set_iftoken(). For from Hannes Frederic Sowa. 15) If you change an interface name and meanwhile can sneak in something that looks up the name (like SO_BINDTODEVICE or SIOCGIFNAME) we can deadlock with CONFIG_PREEMPT=n. Fix this by providing a helper function that properly uses raw_seqcount_begin(). From Nicolas Schichan. 16) Chain noise calibration test is inverted in iwlwifi, fix from Nikolay Martynov. 17) Properly set TX iwlwifi descriptor flags for back requests. Fix from Emmanuel Grumbach. 18) We can't assume skb_transport_header() is set in xt_TCPOPTSTRAP module, fix from Pablo Neira Ayuso. 19) Some crummy APs don't provide the proper High Throughput info in association response frames. Add a workaround by assume we'll use whatever is in the beacon/probe. Fix from Johannes Berg. 20) mac80211 call to rate_idx_match_mask() swaps two arguments (mask and channel width). Fix from Simon Wunderlich. 21) xt_TCPMSS (like xt_TCPOPTSTRAP) must not try to handle fragmented frames. Fix from Phil Oester. 22) Fix rate control regression causing iwlwifi/iwlegacy chips to use 1Mbit/s on pre-11n networks. From Moshe Benji and Stanslaw Gruszka. 23) Disable brcmsmac power-save functions, they cause regressions. From Arend van Spriel. 24) Enforce a sane minimum MTU in l2cap_build_cmd() otherwise we can easily crash. Fix from Anderson Lizardo. 25) If a learning packet arrives during vxlan_stop() we crash, easily fixed by checking netif_running(). From Stephen Hemminger. 26) Static vxlan FDB entries should not be migrated, also from Stephen. 27) skb_clone() failures not handled in vxlan_xmit(), oops. Also from Stephen. 28) Add minimal driver for AR816x/AR817x ethernet chips, from Johannes Berg. 29) Fix regression in userspace VLAN acceleration control, added by the 802.1ad support changes. Fix from Fernando Luis Vazquez Cao. 30) Interval selection for MLD queries in the bridging code was reversed. Fix from Linus Lüssing. 31) ipv6's ndisc_send_redirect() erroneously writes to the packet we received not the packet we are building to send out. Fix from Matthias Schiffer. 32) Don't free netdev before unregistering it, in usb_8dev can driver. From Marc Kleine-Budde. 33) Fix nl80211 attribute buffer races, from Johannes Berg. 34) Although netlink_diag.h is under uapi/ it isn't present in Kbuild. From Stephen Hemminger. 35) Wrong address and family passed to MD5 key lookups in TCP, from Aydin Arik. 36) phy_type attribute created by SFC driver should not be writable. From Ben Hutchings. 37) Receive/Transmit queue allocations in pxa168_eth and mv643xx_eth should use kzalloc(). Otherwise if setup fails half-way, we'll dereference garbage when trying to teardown the rings. From Lubomir Rintel. 38) Fix double-allocation of dst (resulting in unfreeable net device) in ipv6's init_loopback(). From Gao Feng. 39) Fix fragmentation handling SKB leak in netfilter conntrack, we were freeing the wrong skb pointer. From Phil Oester. 40) Don't report "-1" (SPEED_UNKNOWN) in bond_miimon_commit(), from Nikolay Aleksandrov. 41) davinci_cpdma doesn't check for DMA mapping errors, letting the device scribble to random addresses. From Sebastian Siewior. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits) dlci: validate the net device in dlci_del() dlci: acquire rtnl_lock before calling __dev_get_by_name() af_key: fix info leaks in notify messages ipv6: ip6_sk_dst_check() must not assume ipv6 dst net: fix kernel deadlock with interface rename and netdev name retrieval. net/tg3: Avoid delay during MMIO access ipv6: check return value of ipv6_get_lladdr macvtap: fix recovery from gup errors tun: fix recovery from gup errors gre: fix a possible skb leak ipv6: Process unicast packet with Router Alert by checking flag in skb. ath9k_htc: Handle IDLE state transition properly ath9k: fix an RCU issue in calling ieee80211_get_tx_rates netfilter: ipt_ULOG: fix incorrect setting of ulog timer netfilter: ctnetlink: send event when conntrack label was modified netfilter: nf_nat_sip: fix mangling qlcnic: Do not sleep while holding spinlock drivers: net: cpsw: fix compilation error with cpsw driver tcp: doc : fix the syncookies default value sh_eth: fix misreporting of transmit abort ...
2013-06-26Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds3-20/+20
Pull i915 drm fixes from Dave Airlie: "These should be the last two fixes for i915, one is for a fence leak killing X on some older GPUs, and one is a late regression partial revert for an swiotlb/xen/i915 interaction, Konrad has promised to figure out the proper answer, and this patch is the best thing to do at this stage to avoid regressing" * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: drm/i915: make compact dma scatter lists creation work with SWIOTLB backend. drm/i915: Restore fences after resume and GPU resets
2013-06-26dlci: validate the net device in dlci_del()Zefan Li1-0/+12
We triggered an oops while running trinity with 3.4 kernel: BUG: unable to handle kernel paging request at 0000000100000d07 IP: [<ffffffffa0109738>] dlci_ioctl+0xd8/0x2d4 [dlci] PGD 640c0d067 PUD 0 Oops: 0000 [#1] PREEMPT SMP CPU 3 ... Pid: 7302, comm: trinity-child3 Not tainted 3.4.24.09+ 40 Huawei Technologies Co., Ltd. Tecal RH2285 /BC11BTSA RIP: 0010:[<ffffffffa0109738>] [<ffffffffa0109738>] dlci_ioctl+0xd8/0x2d4 [dlci] ... Call Trace: [<ffffffff8137c5c3>] sock_ioctl+0x153/0x280 [<ffffffff81195494>] do_vfs_ioctl+0xa4/0x5e0 [<ffffffff8118354a>] ? fget_light+0x3ea/0x490 [<ffffffff81195a1f>] sys_ioctl+0x4f/0x80 [<ffffffff81478b69>] system_call_fastpath+0x16/0x1b ... It's because the net device is not a dlci device. Reported-by: Li Jinyue <lijinyue@huawei.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26dlci: acquire rtnl_lock before calling __dev_get_by_name()Zefan Li1-5/+9
Otherwise the net device returned can be freed at anytime. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26af_key: fix info leaks in notify messagesMathias Krause1-0/+2
key_notify_sa_flush() and key_notify_policy_flush() miss to initialize the sadb_msg_reserved member of the broadcasted message and thereby leak 2 bytes of heap memory to listeners. Fix that. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26ipv6: ip6_sk_dst_check() must not assume ipv6 dstEric Dumazet1-1/+7
It's possible to use AF_INET6 sockets and to connect to an IPv4 destination. After this, socket dst cache is a pointer to a rtable, not rt6_info. ip6_sk_dst_check() should check the socket dst cache is IPv6, or else various corruptions/crashes can happen. Dave Jones can reproduce immediate crash with trinity -q -l off -n -c sendmsg -c connect With help from Hannes Frederic Sowa Reported-by: Dave Jones <davej@redhat.com> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26net: fix kernel deadlock with interface rename and netdev name retrieval.Nicolas Schichan4-30/+41
When the kernel (compiled with CONFIG_PREEMPT=n) is performing the rename of a network interface, it can end up waiting for a workqueue to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to the fact that read_secklock_begin() will spin forever waiting for the writer process (the one doing the interface rename) to update the devnet_rename_seq sequence. This patch fixes the problem by adding a helper (netdev_get_name()) and using it in the code handling the SIOCGIFNAME ioctl and SO_BINDTODEVICE setsockopt. The netdev_get_name() helper uses raw_seqcount_begin() to avoid spinning forever, waiting for devnet_rename_seq->sequence to become even. cond_resched() is used in the contended case, before retrying the access to give the writer process a chance to finish. The use of raw_seqcount_begin() will incur some unneeded work in the reader process in the contended case, but this is better than deadlocking the system. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26Merge tag 'regulator-v3.10-rc7' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fix from Mark Brown: "Fix module loading for tps6586x. A simple one liner fix to make module loading work for distros (product specific kernels tend to have things built in)" * tag 'regulator-v3.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: mfd: tps6586x: correct device name of the regulator cell