linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2016-02-22	tools, bpf_asm: simplify parser rule for BPF extensions	Ray Bellis	2	-151/+79
	We can already use yylval in the lexer for encoding the BPF extension number, so that the parser rules can be further reduced to a single one for each B/H/W case. Signed-off-by: Ray Bellis <ray@isc.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-22	netcp: use pointer to fix build fail	Sudip Mukherjee	1	-1/+1
	While building keystone_defconfig of arm we are getting build failure with the error: drivers/net/ethernet/ti/netcp_core.c:1846:31: error: invalid type argument of '->' (have 'struct tc_to_netdev') if (handle != TC_H_ROOT \|\| tc->type != TC_SETUP_MQPRIO) ^ drivers/net/ethernet/ti/netcp_core.c:1851:35: error: invalid type argument of '->' (have 'struct tc_to_netdev') (dev->real_num_tx_queues < tc->tc)) ^ drivers/net/ethernet/ti/netcp_core.c:1855:8: error: invalid type argument of '->' (have 'struct tc_to_netdev') if (tc->tc) { ^ drivers/net/ethernet/ti/netcp_core.c:1856:28: error: invalid type argument of '->' (have 'struct tc_to_netdev') netdev_set_num_tc(dev, tc->tc); ^ drivers/net/ethernet/ti/netcp_core.c:1857:21: error: invalid type argument of '->' (have 'struct tc_to_netdev') for (i = 0; i < tc->tc; i++) ^ drivers/net/ethernet/ti/netcp_core.c: At top level: drivers/net/ethernet/ti/netcp_core.c:1879:2: warning: initialization from incompatible pointer type .ndo_setup_tc = netcp_setup_tc, ^ The callback of ndo_setup_tc should be: int (ndo_setup_tc)(struct net_device dev, u32 handle, __be16 protocol, struct tc_to_netdev *tc); But we missed marking the last argument as a pointer. Fixes: 16e5cc647173 ("net: rework setup_tc ndo op to consume general tc operand") CC: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	Merge branch 'qed-next'	David S. Miller	12	-153/+135
	Yuval Mintz says: ==================== qed*: Driver updates This contains various minor changes to driver - changing memory allocation, fixing a small theoretical bug, as well as some mostly-semantic changes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	qed,qede: Bump driver versions to 8.7.0.0	Yuval Mintz	2	-2/+2
	Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	qed: Introduce DMA_REGPAIR_LE	Yuval Mintz	4	-20/+14
	FW hsi contains regpairs, mostly for 64-bit address representations. Since same paradigm is applied each time a regpair is filled, this introduces a new utility macro for setting such regpairs. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	qed: Change metadata needed for SPQ entries	Yuval Mintz	3	-107/+95
	Each configuration element send via ramrod requires a Slow Path Queue entry. This slightly changes the way such an entry is configured, but contains mostly semantic changes [where more parameters are gathered in a sub-struct instead of being directly passed]. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	qed: Handle possible race in SB config	Yuval Mintz	1	-9/+9
	Due to HW design, some of the memories are wide-bus and access to those needs to be sequentialized on a per-HW-block level; Read/write to a given HW-block might break other read/write to wide-bus memory done at ~same time. Status blocks initialization in CAU is done into such a wide-bus memory. This moves the initialization into using DMAE which is guaranteed to be safe to use on such memories. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	qed: Turn most GFP_ATOMIC into GFP_KERNEL	Yuval Mintz	6	-15/+15
	Initial driver submission used GFP_ATOMIC almost inclusively when allocating memory. We now remedy this point, using GFP_KERNEL where it's possible. Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	Merge branch 'ipvlan-misc'	David S. Miller	3	-31/+36
	Mahesh Bandewar says: ==================== IPvlan misc patches This is a collection of unrelated patches for IPvlan driver. a. crub_skb() changes are added to ensure that the packets hit the NF_HOOKS in masters' ns in L3 mode. b. u16 change is bug fix while c. the third patch is to group tx/rx variables in single cacheline ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	ipvlan: misc changes	Mahesh Bandewar	3	-18/+20
	1. scope correction for few functions that are used in single file. 2. Adjust variables that are used in fast-path to fit into single cacheline 3. Update rcv_frame() to skip shared check for frames coming over wire Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	ipvlan: mode is u16	Mahesh Bandewar	2	-4/+6
	The mode argument was erronusly defined as u32 but it has always been u16. Also use ipvlan_set_mode() helper to set the mode instead of assigning directly. This should avoid future erronus assignments / updates. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	ipvlan: scrub skb before routing in L3 mode.	Mahesh Bandewar	1	-9/+10
	Scrub skb before hitting the iptable hooks to ensure packets hit these hooks. Set the xnet param only when the packet is crossing the ns boundry so if the IPvlan slave and master belong to the same ns, the param will be set to false. Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	Merge tag 'linux-can-next-for-4.6-20160220' of ↵	David S. Miller	9	-53/+1064
	git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2016-02-20 this is a pull request of 9 patch for net-next/master. The first 3 patches are from Damien Riegel, they add support for Technologic Systems IP core to tje sja100 driver. The next patches 6 by Marek Vasut (including one my me) first clean sort the CAN driver's Kconfig and Makefiles and then add support for the IFI CANFD IP core. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	Merge branch 'bpf-helper-improvements'	David S. Miller	7	-49/+142
	Daniel Borkmann says: ==================== BPF updates This set contains various updates for eBPF, i.e. the addition of a generic csum helper function and other misc bits that mostly improve existing helpers and ease programming with eBPF on cls_bpf. For more details, please see individual patches. Set is rebased on top of http://patchwork.ozlabs.org/patch/584465/. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	bpf: don't emit mov A,A on return	Daniel Borkmann	1	-4/+6
	While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax', and found that this comes from BPF_RET \| BPF_A translations from classic BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0 already, therefore only emit a mov when immediates are used as return value. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	bpf: fix csum update in bpf_l4_csum_replace helper for udp	Daniel Borkmann	2	-1/+8
	When using this helper for updating UDP checksums, we need to extend this in order to write CSUM_MANGLED_0 for csum computations that result into 0 as sum. Reason we need this is because packets with a checksum could otherwise become incorrectly marked as a packet without a checksum. Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should not turn packets without a checksum into ones with a checksum. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	bpf: try harder on clones when writing into skb	Daniel Borkmann	4	-28/+24
	When we're dealing with clones and the area is not writeable, try harder and get a copy via pskb_expand_head(). Replace also other occurences in tc actions with the new skb_try_make_writable(). Reported-by: Ashhad Sheikh <ashhadsheikh394@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	bpf: remove artificial bpf_skb_{load, store}_bytes buffer limitation	Daniel Borkmann	1	-13/+14
	We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes() helpers to only store or load a maximum buffer of 16 bytes. Thus, loading, rewriting and storing headers require several bpf_skb_load_bytes() and bpf_skb_store_bytes() calls. Also here we can use a per-cpu scratch buffer instead in order to not pressure stack space any further. I do suspect that this limit was mainly set in place for this particular reason. So, ease program development by removing this limitation and make the scratchpad generic, so it can be reused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	bpf: add generic bpf_csum_diff helper	Daniel Borkmann	2	-0/+64
	For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's currently limited to handle 2 and 4 byte changes in a header and feeds the from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When working with IPv6, for example, this makes it rather cumbersome to deal with, similarly when editing larger parts of a header. Instead, extend the API in a more generic way: For bpf_l4_csum_replace(), add a case for header field mask of 0 to change the checksum at a given offset through inet_proto_csum_replace_by_diff(), and provide a helper bpf_csum_diff() that can generically calculate a from/to diff for arbitrary amounts of data. This can be used in multiple ways: for the bpf_l4_csum_replace() only part, this even provides us with the option to insert precalculated diffs from user space f.e. from a map, or from bpf_csum_diff() during runtime. bpf_csum_diff() has a optional from/to stack buffer input, so we can calculate a diff by using a scratchbuffer for scenarios where we're inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers don't need to be of equal size) data. Also, bpf_csum_diff() allows to feed a previous csum into csum_partial(), so the function can also be cascaded. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	bpf: add new arg_type that allows for 0 sized stack buffer	Daniel Borkmann	2	-10/+33
	Currently, when we pass a buffer from the eBPF stack into a helper function, the function proto indicates argument types as ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1> must be of the latter type. Then, verifier checks whether the buffer points into eBPF stack, is initialized, etc. The verifier also guarantees that the constant value passed in R<X+1> is greater than 0, so helper functions don't need to test for it and can always assume a non-NULL initialized buffer as well as non-0 buffer size. This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function. Such helper functions, of course, need to be able to handle these cases internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0 or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any other combinations are not possible to load. I went through various options of extending the verifier, and introducing the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes needed to the verifier. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	Merge branch 'geneve-vxlan-outer-checksum'	David S. Miller	3	-19/+18
	Alexander Duyck says: ==================== GENEVE/VXLAN: Enable outer Tx checksum by default This patch series makes it so that we enable the outer Tx checksum for IPv4 tunnels by default. This makes the behavior consistent with how we were handling this for IPv6. In addition I have updated the internal flags for these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should match up will with the ZERO_CSUM6_TX flag which was already in use for IPv6. For most network devices this should be a net gain in terms of performance as having the outer header checksum present allows for devices to report CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order to determine if the inner header checksum is valid. Below is some data I collected with ixgbe with an X540 that demonstrates this. I located two PFs connected back to back in two different name spaces and then setup a pair of tunnels on each, one with checksum enabled and one without. Recv Send Send Utilization Socket Socket Message Elapsed Send Size Size Size Time Throughput local bytes bytes bytes secs. 10^6bits/s % S noudpcsum: 87380 16384 16384 30.00 8898.67 12.80 udpcsum: 87380 16384 16384 30.00 9088.47 5.69 The one spot where this may cause a performance regression is if the environment contains devices that can parse the inner headers and a device supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In the case of such a device we have to fall back to using GSO to segment the tunnel instead of TSO and as a result we may take a performance hit as seen below with i40e. Recv Send Send Utilization Socket Socket Message Elapsed Send Size Size Size Time Throughput local bytes bytes bytes secs. 10^6bits/s % S noudpcsum: 87380 16384 16384 30.00 9085.21 3.32 udpcsum: 87380 16384 16384 30.00 9089.23 5.54 In addition it will be necessary to update iproute2 so that we don't provide the checksum attribute unless specified. This way on older kernels which don't have local checksum offload we will default to disabling the outer checksum, and on newer kernels that have LCO we can default to enabling it. I also haven't investigated the effect this will have on OVS. However I suspect the impact should be minimal as the worst case scenario should be that Tx checksumming will become enabled by default which should be consistent with the existing behavior for IPv6. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	VXLAN: Support outer IPv4 Tx checksums by default	Alexander Duyck	2	-11/+10
	This change makes it so that if UDP CSUM is not specified we will default to enabling it. The main motivation behind this is the fact that with the use of outer checksum we can greatly improve the performance for VXLAN tunnels on devices that don't know how to parse tunnel headers. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	GENEVE: Support outer IPv4 Tx checksums by default	Alexander Duyck	1	-8/+8
	This change makes it so that if UDP CSUM is not specified we will default to enabling it. The main motivation behind this is the fact that with the use of outer checksum we can greatly improve the performance for GENEVE tunnels on hardware that doesn't know how to parse them. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	Merge branch 'lwt-autoload'	David S. Miller	4	-1/+42
	Robert Shearman says: ==================== lwtunnel: autoload of lwt modules Changes since v1: - remove "LWTUNNEL_ENCAP_" prefix for the string form of the encaps used when requesting the module to reduce duplication, and don't bother returning strings for lwt modules using netdevices, both suggested by Jiri. - update commit message of first patch to clarify security implications, in response to Eric's comments. The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, these patches add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	ila: autoload module	Robert Shearman	1	-0/+1
	Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	mpls: autoload lwt module	Robert Shearman	1	-0/+1
	Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	lwtunnel: autoload of lwt modules	Robert Shearman	2	-1/+40
	The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. Only users with the CAP_NET_ADMIN capability can cause modules to be loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx messages for users without this capability, and by lwtunnel_build_state not being called in response to RTM_GETxxx messages. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21	vlan: turn on unicast filtering on vlan device	Zhang Shengju	2	-1/+1
	Currently vlan device inherits unicast filtering flag from underlying device. If underlying device doesn't support unicast filter, this will put vlan device into promiscuous mode when it's stacked. Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does not go into promiscuous mode needlessly. If underlying device does not support unicast filtering, that device will enter promiscuous mode. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20	can: ifi: Add IFI CANFD IP support	Marek Vasut	5	-0/+932
	The patch adds support for IFI CAN/FD controller [1]. This driver currently supports sending and receiving both standard CAN and new CAN/FD frames. Both ISO and BOSCH variant of CAN/FD is supported. [1] http://www.ifi-pld.de/IP/CANFD/canfd.html Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: ifi: Add DT bindings for ifi,canfd	Marek Vasut	1	-0/+15
	Add device tree bindings for the I/F/I CANFD controller IP core. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	of: Add vendor prefix for I/F/I	Marek Vasut	1	-0/+1
	Add vendor prefix for I/F/I, Ingenieurbüro Für IC-Technologie http://www.ifi-pld.de/ Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: Makefile: Sort the Makefile	Marek Vasut	1	-8/+8
	Just sort the drivers in the Makefile, no functional change. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: Kconfig: sort drivers alphabetically	Marc Kleine-Budde	1	-25/+24
	Sort the drivers that are directly listed in the Kconfig alphabetically, no functional change. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: Kconfig: Sort the Kconfig includes	Marek Vasut	1	-10/+4
	Sort the Kconfig includes, no functional change. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: sja1000: of: add compatibility with Technologic Systems version	Damien Riegel	1	-0/+47
	Technologic Systems provides an IP compatible with the SJA1000, instantiated in an FPGA. Because of some bus widths issue, access to registers is made through a "window" that works like this: base + 0x0: address to read/write base + 0x2: 8-bit register value This commit adds a new compatible device, "technologic,sja1000", with read and write functions using the window mechanism. Signed-off-by: Damien Riegel <damien.riegel@savoirfairelinux.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: sja1000: add documentation for Technologic Systems version	Damien Riegel	1	-1/+2
	This commit adds documentation for the Technologic Systems version of SJA1000. The difference with the NXP version is in the way the registers are accessed. Signed-off-by: Damien Riegel <damien.riegel@savoirfairelinux.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	can: sja1000: of: add per-compatible init hook	Damien Riegel	1	-9/+31
	This commit adds the capability to allocate and init private data embedded in the sja1000_priv structure on a per-compatible basis. The device node is passed as a parameter of the init callback to allow parsing of custom device tree properties. Signed-off-by: Damien Riegel <damien.riegel@savoirfairelinux.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-02-20	Merge branch 'bpf-get-stackid'	David S. Miller	18	-30/+642
	Alexei Starovoitov says: ==================== bpf_get_stackid() and stack_trace map This patch set introduces new map type to store stack traces and corresponding bpf_get_stackid() helper. BPF programs already can walk the stack via unrolled loop of bpf_probe_read()s which is ok for simple analysis, but it's not efficient and limited to <30 frames after that the programs don't fit into MAX_BPF_STACK. With bpf_get_stackid() helper the programs can collect up to PERF_MAX_STACK_DEPTH both user and kernel frames. Using stack traces as a key in a map turned out to be very useful for generating flame graphs, off-cpu graphs, waker and chain graphs. Patch 3 is a simplified version of 'offwaketime' tool which is described in detail here: http://brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html Earlier version of this patch were using save_stack_trace() helper, but 'unreliable' frames add to much noise and two equiavlent stack traces produce different 'stackid's. Using lockdep style of storing frames with MAX_STACK_TRACE_ENTRIES is great for lockdep, but not acceptable for bpf, since the stack_trace map needs to be freed when user Ctrl-C the tool. The ftrace style with per_cpu(struct ftrace_stack) is great, but it's tightly coupled with ftrace ring buffer and has the same 'unreliable' noise. perf_event's perf_callchain() mechanism is also very efficient and it only needed minor generalization which is done in patch 1 to be used by bpf stack_trace maps. Peter, please take a look at patch 1. If you're ok with it, I'd like to take the whole set via net-next. Patch 1 - generalization of perf_callchain() Patch 2 - stack_trace map done as lock-less hashtable without link list to avoid spinlock on insertion which is critical path when bpf_get_stackid() helper is called for every task switch event Patch 3 - offwaketime example After the patch the 'perf report' for artificial 'sched_bench' benchmark that doing pthread_cond_wait/signal and 'offwaketime' example is running in the background: 16.35% swapper [kernel.vmlinux] [k] intel_idle 2.18% sched_bench [kernel.vmlinux] [k] __switch_to 2.18% sched_bench libpthread-2.12.so [.] pthread_cond_signal@@GLIBC_2.3.2 1.72% sched_bench libpthread-2.12.so [.] pthread_mutex_unlock 1.53% sched_bench [kernel.vmlinux] [k] bpf_get_stackid 1.44% sched_bench [kernel.vmlinux] [k] entry_SYSCALL_64 1.39% sched_bench [kernel.vmlinux] [k] __call_rcu.constprop.73 1.13% sched_bench libpthread-2.12.so [.] pthread_mutex_lock 1.07% sched_bench libpthread-2.12.so [.] pthread_cond_wait@@GLIBC_2.3.2 1.07% sched_bench [kernel.vmlinux] [k] hash_futex 1.05% sched_bench [kernel.vmlinux] [k] do_futex 1.05% sched_bench [kernel.vmlinux] [k] get_futex_key_refs.isra.13 The hotest part of bpf_get_stackid() is inlined jhash2, so we may consider using some faster hash in the future, but it's good enough for now. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20	samples/bpf: offwaketime example	Alexei Starovoitov	4	-0/+322
	This is simplified version of Brendan Gregg's offwaketime: This program shows kernel stack traces and task names that were blocked and "off-CPU", along with the stack traces and task names for the threads that woke them, and the total elapsed time from when they blocked to when they were woken up. The combined stacks, task names, and total time is summarized in kernel context for efficiency. Example: $ sudo ./offwaketime \| flamegraph.pl > demo.svg Open demo.svg in the browser as FlameGraph visualization. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20	bpf: introduce BPF_MAP_TYPE_STACK_TRACE	Alexei Starovoitov	6	-1/+269
	add new map type to store stack traces and corresponding helper bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id @ctx: struct pt_regs* @map: pointer to stack_trace map @flags: bits 0-7 - numer of stack frames to skip bit 8 - collect user stack instead of kernel bit 9 - compare stacks by hash only bit 10 - if two different stacks hash into the same stackid discard old other bits - reserved Return: >= 0 stackid on success or negative error stackid is a 32-bit integer handle that can be further combined with other data (including other stackid) and used as a key into maps. Userspace will access stackmap using standard lookup/delete syscall commands to retrieve full stack trace for given stackid. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20	perf: generalize perf_callchain	Alexei Starovoitov	8	-29/+51
	. avoid walking the stack when there is no room left in the buffer . generalize get_perf_callchain() to be called from bpf helper Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	net: use skb_postpush_rcsum instead of own implementations	Daniel Borkmann	5	-20/+7
	Replace individual implementations with the recently introduced skb_postpush_rcsum() helper. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tom Herbert <tom@herbertland.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	phy: marvell/micrel: Fix Unpossible condition	Andrew Lunn	2	-9/+10
	commit 2b2427d06426 ("phy: micrel: Add ethtool statistics counters") from Dec 30, 2015, leads to the following static checker warning: drivers/net/phy/micrel.c:609 kszphy_get_stat() warn: unsigned 'val' is never less than zero. drivers/net/phy/micrel.c 602 static u64 kszphy_get_stat(struct phy_device phydev, int i) 603 { 604 struct kszphy_hw_stat stat = kszphy_hw_stats[i]; 605 struct kszphy_priv priv = phydev->priv; 606 u64 val; 607 608 val = phy_read(phydev, stat.reg); 609 if (val < 0) { ^^^^^^^ Unpossible! 610 val = UINT64_MAX; 611 } else { 612 val = val & ((1 << stat.bits) - 1); 613 priv->stats[i] += val; 614 val = priv->stats[i]; 615 } 616 617 return val; 618 } The same problem exists in the Marvell driver. Fix both. Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Julia.Lawall <julia.lawall@lip6.fr> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	Merge branch 'ethtool-perqueue-params'	David S. Miller	16	-73/+760
	Kan Liang says: ==================== ethtool per queue parameters support Modern network interface controllers usually support multiple receive and transmit queues. Each queue may have its own parameters. For example, Intel XL710/X710 hardware supports per queue interrupt moderation. However, current ethtool does not support per queue parameters option. User has to set parameters for the whole NIC. This series extends ethtool to support per queue parameters option. Since the support of per queue parameters vary with different cards, it is impossible to address all cards in one patch. This series only supports per queue coalesce options on i40e driver. The framework used in the patch can be easily extended to other cards and parameters. The lib bitmap needs to be extended to facilitate exchanging queue bitmaps between user space and kernel space. Two patches from David's latest V8 patch series are also cited in this series. You may refer to https://lkml.org/lkml/2016/2/9/919 for more details. Changes since V6: - Rebase on commit 76d13b568776. Did minor change in patch 6. Changes since V5: - Add test_bitmap.c and bitmap.sh in the series. They are forgot to be added previously. - Update the first two patches to David's latest V8 version. The changes include - bitmap u32 API returns number of bits copied, unit tests updated - module_exit in test_bitmap - Also change the mode of bitmap.sh to 755 according to Ben's suggestion Changes since V4: - Modify set/get_per_queue_coalesce function description - Change the queue number to be u32 - Correct an error of calculating coalesce backup buffer address - Rename queue_num to n_queues - Don't log error message in __i40e_get_coalesce Changes since V3: - Based on David's lib bitmap. - ETHTOOL_PERQUEUE should be handled before the containing switch - Make the rollback code unconditional - some minor changes according to Ben's feedback Changes since V2: - Add queue-specific settings for interrupt moderation in i40e Changes since V1: - Checking the sub-command number to determine whether the command requires CAP_NET_ADMIN - Refine the struct ethtool_per_queue_op and improve the comments - Use bitmap functions to parse queue mask - Improve comments - Use bitmap functions to parse queue mask - Improve comments - Add rollback support - Correct the way to find the vector for specific queue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	i40e/ethtool: support coalesce setting by queue	Kan Liang	1	-0/+7
	This patch implements set_per_queue_coalesce for i40e driver. Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	i40e/ethtool: support coalesce getting by queue	Kan Liang	1	-0/+7
	This patch implements get_per_queue_coalesce for i40e driver. Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	i40e: queue-specific settings for interrupt moderation	Kan Liang	6	-70/+120
	For i40e driver, each vector has its own ITR register. However, there are no concept of queue-specific settings in the driver proper. Only global variable is used to store ITR values. That will cause problems especially when resetting the vector. The specific ITR values could be lost. This patch move rx_itr_setting and tx_itr_setting to i40e_ring to store specific ITR register for each queue. i40e_get_coalesce and i40e_set_coalesce are also modified accordingly to support queue-specific settings. To make it compatible with old ethtool, if user doesn't specify the queue number, i40e_get_coalesce will return queue 0's value. While i40e_set_coalesce will apply value to all queues. Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	net/ethtool: support set coalesce per queue	Kan Liang	2	-0/+68
	This patch implements sub command ETHTOOL_SCOALESCE for ioctl ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to set coalesce of each masked queue to device driver. The wanted coalesce information are stored in "data" for each masked queue, which can copy from userspace. If it fails to set coalesce to device driver, the value which already set to specific queue will be tried to rollback. Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	net/ethtool: support get coalesce per queue	Kan Liang	2	-2/+41
	This patch implements sub command ETHTOOL_GCOALESCE for ioctl ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to get coalesce of each masked queue from device driver. Then the interrupt coalescing parameters will be copied back to user space one by one. Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19	net/ethtool: introduce a new ioctl for per queue setting	Kan Liang	2	-2/+42
	Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting. The following patches will enable some SUB_COMMANDs for per queue setting. Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>