summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2018-11-20tcp: drop dst in tcp_add_backlog()Eric Dumazet1-0/+2
Under stress, softirq rx handler often hits a socket owned by the user, and has to queue the packet into socket backlog. When this happens, skb dst refcount is taken before we escape rcu protected region. This is done from __sk_add_backlog() calling skb_dst_force(). Consumer will have to perform the opposite costly operation. AFAIK nothing in tcp stack requests the dst after skb was stored in the backlog. If this was the case, we would have had failures already since skb_dst_force() can end up clearing skb dst anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-20ipv4: Don't try to print ASCII of link level header in martian dumps.David S. Miller1-1/+1
This has no value whatsoever. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-20net_sched: sch_fq: avoid calling ktime_get_ns() if not neededEric Dumazet1-1/+6
There are two cases were we can avoid calling ktime_get_ns() : 1) Queue is empty. 2) Internal queue is not empty. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: sched: cls_u32: add res to offload informationJakub Kicinski1-0/+2
In case of egress offloads the class/flowid assigned by the filter may be very important for offloaded Qdisc selection. Provide this info to drivers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: sched: gred: support reporting stats from offloadsJakub Kicinski1-0/+47
Allow drivers which offload GRED to report back statistics. Since A lot of GRED stats is fairly ad hoc in nature pass to drivers the standard struct gnet_stats_basic/gnet_stats_queue pairs, and untangle the values in the core. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: sched: gred: add basic Qdisc offloadJakub Kicinski1-0/+47
Add basic offload for the GRED Qdisc. Inform the drivers any time Qdisc or virtual queue configuration changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net: skb_scrub_packet(): Scrub offload_fwd_markPetr Machata1-0/+5
When a packet is trapped and the corresponding SKB marked as already-forwarded, it retains this marking even after it is forwarded across veth links into another bridge. There, since it ingresses the bridge over veth, which doesn't have offload_fwd_mark, it triggers a warning in nbp_switchdev_frame_mark(). Then nbp_switchdev_allowed_egress() decides not to allow egress from this bridge through another veth, because the SKB is already marked, and the mark (of 0) of course matches. Thus the packet is incorrectly blocked. Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That function is called from tunnels and also from veth, and thus catches the cases where traffic is forwarded between bridges and transformed in a way that invalidates the marking. Signed-off-by: Petr Machata <petrm@mellanox.com> Suggested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19sctp: add sockopt SCTP_EVENTXin Long1-0/+88
This patch adds sockopt SCTP_EVENT described in rfc6525#section-6.2. With this sockopt users can subscribe to an event from a specified asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19sctp: rename enum sctp_event to sctp_event_typeXin Long3-8/+8
sctp_event is a structure name defined in RFC for sockopt SCTP_EVENT. To avoid the conflict, rename it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19sctp: add subscribe per asocXin Long5-10/+15
The member subscribe should be per asoc, so that sockopt SCTP_EVENT in the next patch can subscribe a event from one asoc only. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19sctp: define subscribe in sctp_sock as __u16Xin Long4-20/+38
The member subscribe in sctp_sock is used to indicate to which of the events it is subscribed, more like a group of flags. So it's better to be defined as __u16 (2 bytpes), instead of struct sctp_event_subscribe (13 bytes). Note that sctp_event_subscribe is an UAPI struct, used on sockopt calls, and thus it will not be removed. This patch only changes the internal storage of the flags. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller20-93/+189
2018-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds17-63/+143
Pull networking fixes from David Miller: 1) Fix some potentially uninitialized variables and use-after-free in kvaser_usb can drier, from Jimmy Assarsson. 2) Fix leaks in qed driver, from Denis Bolotin. 3) Socket leak in l2tp, from Xin Long. 4) RSS context allocation fix in bnxt_en from Michael Chan. 5) Fix cxgb4 build errors, from Ganesh Goudar. 6) Route leaks in ipv6 when removing exceptions, from Xin Long. 7) Memory leak in IDR allocation handling of act_pedit, from Davide Caratti. 8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov. 9) When MTU is locked, do not force DF bit on ipv4 tunnels. From Sabrina Dubroca. 10) When NAPI cached skb is reused, we must set it to the proper initial state which includes skb->pkt_type. From Eric Dumazet. 11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy. 12) Set RX queue properly in various tuntap receive paths, from Matthew Cover. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) tuntap: fix multiqueue rx ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF tipc: don't assume linear buffer when reading ancillary data tipc: fix lockdep warning when reinitilaizing sockets net-gro: reset skb->pkt_type in napi_reuse_skb() tc-testing: tdc.py: Guard against lack of returncode in executed command tc-testing: tdc.py: ignore errors when decoding stdout/stderr ip_tunnel: don't force DF when MTU is locked MAINTAINERS: Add entry for CAKE qdisc net: bridge: fix vlan stats use-after-free on destruction socket: do a generic_file_splice_read when proto_ops has no splice_read net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs" net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs net/sched: act_pedit: fix memory leak when IDR allocation fails net: lantiq: Fix returned value in case of error in 'xrx200_probe()' ipv6: fix a dst leak when removing its exception net: mvneta: Don't advertise 2.5G modes drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo net/mlx4: Fix UBSAN warning of signed integer overflow ...
2018-11-18ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRFDavid Ahern1-2/+5
Preethi reported that PMTU discovery for UDP/raw applications is not working in the presence of VRF when the socket is not bound to a device. The problem is that ip6_sk_update_pmtu does not consider the L3 domain of the skb device if the socket is not bound. Update the function to set oif to the L3 master device if relevant. Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack") Reported-by: Preethi Ramachandra <preethir@juniper.net> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17tipc: don't assume linear buffer when reading ancillary dataJon Maloy1-4/+11
The code for reading ancillary data from a received buffer is assuming the buffer is linear. To make this assumption true we have to linearize the buffer before message data is read. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17tipc: fix lockdep warning when reinitilaizing socketsJon Maloy3-18/+48
We get the following warning: [ 47.926140] 32-bit node address hash set to 2010a0a [ 47.927202] [ 47.927433] ================================ [ 47.928050] WARNING: inconsistent lock state [ 47.928661] 4.19.0+ #37 Tainted: G E [ 47.929346] -------------------------------- [ 47.929954] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 47.930116] swapper/3/0 [HC0[0]:SC1[3]:HE1:SE0] takes: [ 47.930116] 00000000af8bc31e (&(&ht->lock)->rlock){+.?.}, at: rhashtable_walk_enter+0x36/0xb0 [ 47.930116] {SOFTIRQ-ON-W} state was registered at: [ 47.930116] _raw_spin_lock+0x29/0x60 [ 47.930116] rht_deferred_worker+0x556/0x810 [ 47.930116] process_one_work+0x1f5/0x540 [ 47.930116] worker_thread+0x64/0x3e0 [ 47.930116] kthread+0x112/0x150 [ 47.930116] ret_from_fork+0x3a/0x50 [ 47.930116] irq event stamp: 14044 [ 47.930116] hardirqs last enabled at (14044): [<ffffffff9a07fbba>] __local_bh_enable_ip+0x7a/0xf0 [ 47.938117] hardirqs last disabled at (14043): [<ffffffff9a07fb81>] __local_bh_enable_ip+0x41/0xf0 [ 47.938117] softirqs last enabled at (14028): [<ffffffff9a0803ee>] irq_enter+0x5e/0x60 [ 47.938117] softirqs last disabled at (14029): [<ffffffff9a0804a5>] irq_exit+0xb5/0xc0 [ 47.938117] [ 47.938117] other info that might help us debug this: [ 47.938117] Possible unsafe locking scenario: [ 47.938117] [ 47.938117] CPU0 [ 47.938117] ---- [ 47.938117] lock(&(&ht->lock)->rlock); [ 47.938117] <Interrupt> [ 47.938117] lock(&(&ht->lock)->rlock); [ 47.938117] [ 47.938117] *** DEADLOCK *** [ 47.938117] [ 47.938117] 2 locks held by swapper/3/0: [ 47.938117] #0: 0000000062c64f90 ((&d->timer)){+.-.}, at: call_timer_fn+0x5/0x280 [ 47.938117] #1: 00000000ee39619c (&(&d->lock)->rlock){+.-.}, at: tipc_disc_timeout+0xc8/0x540 [tipc] [ 47.938117] [ 47.938117] stack backtrace: [ 47.938117] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 4.19.0+ #37 [ 47.938117] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 47.938117] Call Trace: [ 47.938117] <IRQ> [ 47.938117] dump_stack+0x5e/0x8b [ 47.938117] print_usage_bug+0x1ed/0x1ff [ 47.938117] mark_lock+0x5b5/0x630 [ 47.938117] __lock_acquire+0x4c0/0x18f0 [ 47.938117] ? lock_acquire+0xa6/0x180 [ 47.938117] lock_acquire+0xa6/0x180 [ 47.938117] ? rhashtable_walk_enter+0x36/0xb0 [ 47.938117] _raw_spin_lock+0x29/0x60 [ 47.938117] ? rhashtable_walk_enter+0x36/0xb0 [ 47.938117] rhashtable_walk_enter+0x36/0xb0 [ 47.938117] tipc_sk_reinit+0xb0/0x410 [tipc] [ 47.938117] ? mark_held_locks+0x6f/0x90 [ 47.938117] ? __local_bh_enable_ip+0x7a/0xf0 [ 47.938117] ? lockdep_hardirqs_on+0x20/0x1a0 [ 47.938117] tipc_net_finalize+0xbf/0x180 [tipc] [ 47.938117] tipc_disc_timeout+0x509/0x540 [tipc] [ 47.938117] ? call_timer_fn+0x5/0x280 [ 47.938117] ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc] [ 47.938117] ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc] [ 47.938117] call_timer_fn+0xa1/0x280 [ 47.938117] ? tipc_disc_msg_xmit.isra.19+0xa0/0xa0 [tipc] [ 47.938117] run_timer_softirq+0x1f2/0x4d0 [ 47.938117] __do_softirq+0xfc/0x413 [ 47.938117] irq_exit+0xb5/0xc0 [ 47.938117] smp_apic_timer_interrupt+0xac/0x210 [ 47.938117] apic_timer_interrupt+0xf/0x20 [ 47.938117] </IRQ> [ 47.938117] RIP: 0010:default_idle+0x1c/0x140 [ 47.938117] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 53 65 8b 2d d8 2b 74 65 0f 1f 44 00 00 e8 c6 2c 8b ff fb f4 <65> 8b 2d c5 2b 74 65 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 b4 2b [ 47.938117] RSP: 0018:ffffaf6ac0207ec8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 [ 47.938117] RAX: ffff8f5b3735e200 RBX: 0000000000000003 RCX: 0000000000000001 [ 47.938117] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8f5b3735e200 [ 47.938117] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000 [ 47.938117] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 47.938117] R13: 0000000000000000 R14: ffff8f5b3735e200 R15: ffff8f5b3735e200 [ 47.938117] ? default_idle+0x1a/0x140 [ 47.938117] do_idle+0x1bc/0x280 [ 47.938117] cpu_startup_entry+0x19/0x20 [ 47.938117] start_secondary+0x187/0x1c0 [ 47.938117] secondary_startup_64+0xa4/0xb0 The reason seems to be that tipc_net_finalize()->tipc_sk_reinit() is calling the function rhashtable_walk_enter() within a timer interrupt. We fix this by executing tipc_net_finalize() in work queue context. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net-gro: reset skb->pkt_type in napi_reuse_skb()Eric Dumazet1-0/+4
eth_type_trans() assumes initial value for skb->pkt_type is PACKET_HOST. This is indeed the value right after a fresh skb allocation. However, it is possible that GRO merged a packet with a different value (like PACKET_OTHERHOST in case macvlan is used), so we need to make sure napi->skb will have pkt_type set back to PACKET_HOST. Otherwise, valid packets might be dropped by the stack because their pkt_type is not PACKET_HOST. napi_reuse_skb() was added in commit 96e93eab2033 ("gro: Add internal interfaces for VLAN"), but this bug always has been there. Fixes: 96e93eab2033 ("gro: Add internal interfaces for VLAN") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17ip_tunnel: don't force DF when MTU is lockedSabrina Dubroca1-1/+1
The various types of tunnels running over IPv4 can ask to set the DF bit to do PMTU discovery. However, PMTU discovery is subject to the threshold set by the net.ipv4.route.min_pmtu sysctl, and is also disabled on routes with "mtu lock". In those cases, we shouldn't set the DF bit. This patch makes setting the DF bit conditional on the route's MTU locking state. This issue seems to be older than git history. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net: bridge: fix vlan stats use-after-free on destructionNikolay Aleksandrov2-1/+9
Syzbot reported a use-after-free of the global vlan context on port vlan destruction. When I added per-port vlan stats I missed the fact that the global vlan context can be freed before the per-port vlan rcu callback. There're a few different ways to deal with this, I've chosen to add a new private flag that is set only when per-port stats are allocated so we can directly check it on destruction without dereferencing the global context at all. The new field in net_bridge_vlan uses a hole. v2: cosmetic change, move the check to br_process_vlan_info where the other checks are done v3: add change log in the patch, add private (in-kernel only) flags in a hole in net_bridge_vlan struct and use that instead of mixing user-space flags with private flags Fixes: 9163a0fc1f0c ("net: bridge: add support for per-port vlan stats") Reported-by: syzbot+04681da557a0e49a52e5@syzkaller.appspotmail.com Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17socket: do a generic_file_splice_read when proto_ops has no splice_readSlavomir Kaslev1-1/+1
splice(2) fails with -EINVAL when called reading on a socket with no splice_read set in its proto_ops (such as vsock sockets). Switch this to fallbacks to a generic_file_splice_read instead. Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net/ncsi: Configure multi-package, multi-channel modes with failoverSamuel Mendoza-Jonas5-88/+478
This patch extends the ncsi-netlink interface with two new commands and three new attributes to configure multiple packages and/or channels at once, and configure specific failover modes. NCSI_CMD_SET_PACKAGE mask and NCSI_CMD_SET_CHANNEL_MASK set a whitelist of packages or channels allowed to be configured with the NCSI_ATTR_PACKAGE_MASK and NCSI_ATTR_CHANNEL_MASK attributes respectively. If one of these whitelists is set only packages or channels matching the whitelist are considered for the channel queue in ncsi_choose_active_channel(). These commands may also use the NCSI_ATTR_MULTI_FLAG to signal that multiple packages or channels may be configured simultaneously. NCSI hardware arbitration (HWA) must be available in order to enable multi-package mode. Multi-channel mode is always available. If the NCSI_ATTR_CHANNEL_ID attribute is present in the NCSI_CMD_SET_CHANNEL_MASK command the it sets the preferred channel as with the NCSI_CMD_SET_INTERFACE command. The combination of preferred channel and channel whitelist defines a primary channel and the allowed failover channels. If the NCSI_ATTR_MULTI_FLAG attribute is also present then the preferred channel is configured for Tx/Rx and the other channels are enabled only for Rx. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net/ncsi: Reset channel state in ncsi_start_dev()Samuel Mendoza-Jonas3-12/+113
When the NCSI driver is stopped with ncsi_stop_dev() the channel monitors are stopped and the state set to "inactive". However the channels are still configured and active from the perspective of the network controller. We should suspend each active channel but in the context of ncsi_stop_dev() the transmit queue has been or is about to be stopped so we won't have time to do so. Instead when ncsi_start_dev() is called if the NCSI topology has already been probed then call ncsi_reset_dev() to suspend any channels that were previously active. This resets the network controller to a known state, provides an up to date view of channel link state, and makes sure that mode flags such as NCSI_MODE_TX_ENABLE are properly reset. In addition to ncsi_start_dev() use ncsi_reset_dev() in ncsi-netlink.c to update the channel configuration more cleanly. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net/ncsi: Don't mark configured channels inactiveSamuel Mendoza-Jonas2-8/+12
The concepts of a channel being 'active' and it having link are slightly muddled in the NCSI driver. Tweak this slightly so that NCSI_CHANNEL_ACTIVE represents a channel that has been configured and enabled, and NCSI_CHANNEL_INACTIVE represents a de-configured channel. This distinction is important because a channel can be 'active' but have its link down; in this case the channel may still need to be configured so that it may receive AEN link-state-change packets. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net/ncsi: Don't deselect package in suspend if activeSamuel Mendoza-Jonas1-2/+13
When a package is deselected all channels of that package cease communication. If there are other channels active on the package of the suspended channel this will disable them as well, so only send a deselect-package command if no other channels are active. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net/ncsi: Probe single packages to avoid conflictSamuel Mendoza-Jonas2-55/+31
Currently the NCSI driver sends a select-package command to all possible packages simultaneously to discover what packages are available. However at this stage in the probe process the driver does not know if hardware arbitration is available: if it isn't then this process could cause collisions on the RMII bus when packages try to respond. Update the probe loop to probe each package one by one, and once complete check if HWA is universally supported. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17net/ncsi: Don't enable all channels when HWA availableSamuel Mendoza-Jonas2-48/+5
NCSI hardware arbitration allows multiple packages to be enabled at once and share the same wiring. If the NCSI driver recognises that HWA is available it unconditionally enables all packages and channels; but that is a configuration decision rather than something required by HWA. Additionally the current implementation will not failover on link events which can cause connectivity to be lost unless the interface is manually bounced. Retain basic HWA support but remove the separate configuration path to enable all channels, leaving this to be handled by a later implementation. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17tcp: add SRTT to SCM_TIMESTAMPING_OPT_STATSYousuk Seung1-0/+2
Add TCP_NLA_SRTT to SCM_TIMESTAMPING_OPT_STATS that reports the smoothed round trip time in microseconds (tcp_sock.srtt_us >> 3). Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: allow manipulating per-DP RED flagsJakub Kicinski1-1/+143
Allow users to set and dump RED flags (ECN enabled and harddrop) on per-virtual queue basis. Validation of attributes is split from changes to make sure we won't have to undo previous operations when we find out configuration is invalid. The objective is to allow changing per-Qdisc parameters without overwriting the per-vq configured flags. Old user space will not pass the TCA_GRED_VQ_FLAGS attribute and per-Qdisc flags will always get propagated to the virtual queues. New user space which wants to make use of per-vq flags should set per-Qdisc flags to 0 and then configure per-vq flags as it sees fit. Once per-vq flags are set per-Qdisc flags can't be changed to non-zero. Vice versa - if the per-Qdisc flags are non-zero the TCA_GRED_VQ_FLAGS attribute has to either be omitted or set to the same value as per-Qdisc flags. Update per-Qdisc parameters: per-Qdisc | per-VQ | result 0 | 0 | all vq flags updated 0 | non-0 | error (vq flags in use) non-0 | 0 | -- impossible -- non-0 | non-0 | all vq flags updated Update per-VQ state (flags parameter not specified): no change to flags Update per-VQ state (flags parameter set): per-Qdisc | per-VQ | result 0 | any | per-vq flags updated non-0 | 0 | -- impossible -- non-0 | non-0 | error (per-Qdisc flags in use) Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: store red flags per virtual queueJakub Kicinski1-6/+18
Right now ECN marking and HARD drop (the common RED flags) can only be configured for the entire Qdisc. In preparation for per-vq flags store the values in the virtual queue structure. Setting per-vq flags will only be allowed when no flags are set for the entire Qdisc. For the new flags we will also make sure undefined bits are 0. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: provide a better structured dump and expose statsJakub Kicinski1-1/+52
Currently all GRED's virtual queue data is dumped in a single array in a single attribute. This makes it pretty much impossible to add new fields. In order to expose more detailed stats add a new set of attributes. We can now expose the 64 bit value of bytesin and all the mark stats which were not part of the original design. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: store bytesin as a 64 bit valueJakub Kicinski1-1/+1
32 bit counters for bytes are not really going to last long in modern world. Make sch_gred count bytes on a 64 bit counter. It will still get truncated during dump but follow up patch will add set of new stat dump attributes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: use extack to provide more details on configuration errorsJakub Kicinski1-11/+33
Add extack messages to -EINVAL errors, to help users identify their mistakes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: pass extack to nla_parse_nested()Jakub Kicinski1-2/+2
In case netlink wants to provide parsing error pass extack to nla_parse_nested(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: sched: gred: separate error and non-error path in gred_change()Jakub Kicinski1-6/+6
We will soon want to add more code to the non-error path, separate it from the error handling flow. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16udp: fix jump label misusePaolo Abeni2-4/+4
The commit 60fb9567bf30 ("udp: implement complete book-keeping for encap_needed") introduced a severe misuse of jump label APIs, which syzbot, as reported by Eric, was able to exploit. When multiple sockets/process can concurrently request (and than disable) the udp encap, we need to track the activation counter with *_inc()/*_dec() jump label variants, or we can experience bad things at disable time. Fixes: 60fb9567bf30 ("udp: implement complete book-keeping for encap_needed") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16etf: Drop all expired packetsJesus Sanchez-Palencia1-15/+21
Currently on dequeue() ETF only drops the first expired packet, which causes a problem if the next packet is already expired. When this happens, the watchdog will be configured with a time in the past, fire straight way and the packet will finally be dropped once the dequeue() function of the qdisc is called again. We can save quite a few cycles and improve the overall behavior of the qdisc if we drop all expired packets if the next packet is expired. This should allow ETF to recover faster from bad situations. But packet drops are still a very serious warning that the requirements imposed on the system aren't reasonable. This was inspired by how the implementation of hrtimers use the rb_tree inside the kernel. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16etf: Split timersortedlist_erase()Jesus Sanchez-Palencia1-15/+29
This is just a refactor that will simplify the implementation of the next patch in this series which will drop all expired packets on the dequeue flow. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16etf: Use cached rb_rootJesus Sanchez-Palencia1-9/+12
ETF's peek() operation is heavily used so use an rb_root_cached instead and leverage rb_first_cached() which will run in O(1) instead of O(log n). Even if on 'timesortedlist_clear()' we could be using rb_erase(), we choose to use rb_erase_cached(), because if in the future we allow runtime changes to ETF parameters, and need to do a '_clear()', this might cause some hard to debug issues. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16etf: Cancel timer if there are no pending skbsJesus Sanchez-Palencia1-1/+3
There is no point in firing the qdisc watchdog if there are no future skbs pending in the queue and the watchdog had been set previously. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16tcp: clean up STATE_TRACEYafang Shao1-4/+0
Currently we can use bpf or tcp tracepoint to conveniently trace the tcp state transition at the run time. So we don't need to do this stuff at the compile time anymore. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16Merge tag 'batadv-net-for-davem-20181114' of git://git.open-mesh.org/linux-mergeDavid S. Miller2-3/+5
Simon Wunderlich says: ==================== Here are two batman-adv bugfixes: - Explicitly pad short ELP packets with zeros, by Sven Eckelmann - Fix packet size calculation when merging fragments, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net/sched: act_pedit: fix memory leak when IDR allocation failsDavide Caratti1-1/+2
tcf_idr_check_alloc() can return a negative value, on allocation failures (-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases. Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: 8021q: move vlan offload registrations into vlan_coreJiri Pirko2-96/+99
Currently, the vlan packet offloads are registered only upon 8021q module load. However, even without this module loaded, the offloads could be utilized, for example by openvswitch datapath. As reported by Michael, that causes 2x to 5x performance improvement, depending on a testcase. So move the vlan offload registrations into vlan_core and make this available even without 8021q module loaded. Reported-by: Michael Shteinbok <michaelsh86@gmail.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Tested-by: Michael Shteinbok <michaelsh86@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16ipv6: fix a dst leak when removing its exceptionXin Long1-4/+3
These is no need to hold dst before calling rt6_remove_exception_rt(). The call to dst_hold_safe() in ip6_link_failure() was for ip6_del_rt(), which has been removed in Commit 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes"). Otherwise, it will cause a dst leak. This patch is to simply remove the dst_hold_safe() call before calling rt6_remove_exception_rt() and also do the same in ip6_del_cached_rt(). It's safe, because the removal of the exception that holds its dst's refcnt is protected by rt6_exception_lock. Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes") Fixes: 23fb93a4d3f1 ("net/ipv6: Cleanup exception and cache route handling") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net/decnet: add missing indentationColin Ian King1-1/+1
There is a missing indentation before the declaration of port. Add it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: remove unused skb_send_sock()Cong Wang1-13/+0
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net: remove VLAN_TAG_PRESENTMichał Mirosław1-6/+0
Replace VLAN_TAG_PRESENT with single bit flag and free up VLAN.CFI overload. Now VLAN.CFI is visible in networking stack and can be passed around intact. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16net/bpf: split VLAN_PRESENT bit handling from VLAN_TCIMichał Mirosław1-19/+21
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-15Merge tag 'batadv-next-for-davem-20181114' of ↵David S. Miller19-160/+236
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - Bump version strings, by Simon Wunderlich - Fixup includes, by Sven Eckelmann (3 patches) - Separate BATMAN_ADV_DEBUG from DEBUGFS, by Sven Eckelmann - Fixup tracing log documentation, by Sven Eckelmann - Use exclusive locks to secure netlink information dump transfers, by Sven Eckelmann (8 patches) - Move CRC16 dependency, by Sven Eckelmann - Enable MCAST by default, by Linus Luessing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-15net: slightly optimize eth_type_transLi RongQing1-8/+10
netperf udp stream shows that eth_type_trans takes certain cpu, so adjust the mac address check order, and firstly check if it is device address, and only check if it is multicast address only if not the device address. After this change: To unicast, and skb dst mac is device mac, this is most of time reduce a comparision To unicast, and skb dst mac is not device mac, nothing change To multicast, increase a comparision Before: 1.03% [kernel] [k] eth_type_trans After: 0.78% [kernel] [k] eth_type_trans Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>