summaryrefslogtreecommitdiffstats
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2018-05-22net/ipv6: Add helper to return path MTU based on fib resultDavid Ahern3-0/+11
Determine path MTU from a FIB lookup result. Logic is based on ip6_dst_mtu_forward plus lookup of nexthop exception. Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core bpf code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-22net/ipv4: Add helper to return path MTU based on fib resultDavid Ahern1-0/+2
Determine path MTU from a FIB lookup result. Logic is a distillation of ip_dst_mtu_maybe_forward. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-0/+9
S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-20erspan: set bso bit based on mirrored packet's lenWilliam Tu1-0/+28
Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not handled. BSO has 4 possible values: 00 --> Good frame with no error, or unknown integrity 11 --> Payload is a Bad Frame with CRC or Alignment Error 01 --> Payload is a Short Frame 10 --> Payload is an Oversized Frame Based the short/oversized definitions in RFC1757, the patch sets the bso bit based on the mirrored packet's size. Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-20Merge branch 'for-upstream' of ↵David S. Miller1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2018-05-18 Here's the first bluetooth-next pull request for the 4.18 kernel: - Refactoring of the btbcm driver - New USB IDs for QCA_ROME and LiteOn controllers - Buffer overflow fix if the controller sends invalid advertising data length - Various cleanups & fixes for Qualcomm controllers Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-19devlink: introduce a helper to generate physical port namesJiri Pirko1-0/+9
Each driver implements physical port name generation by itself. However as devlink has all needed info, it can easily do the job for all its users. So implement this helper in devlink. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-19devlink: extend attrs_set for setting port flavoursJiri Pirko1-0/+3
Devlink ports can have specific flavour according to the purpose of use. This patch extend attrs_set so the driver can say which flavour port has. Initial flavours are: physical, cpu, dsa User can query this to see right away what is the purpose of each port. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-19devlink: introduce devlink_port_attrs_setJiri Pirko1-6/+14
Change existing setter for split port information into more generic attrs setter. Alongside with that, allow to set port number and subport number for split ports. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18tcp: add tcp_comp_sack_nr sysctlEric Dumazet1-0/+1
This per netns sysctl allows for TCP SACK compression fine-tuning. This limits number of SACK that can be compressed. Using 0 disables SACK compression. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18tcp: add tcp_comp_sack_delay_ns sysctlEric Dumazet1-0/+1
This per netns sysctl allows for TCP SACK compression fine-tuning. Its default value is 1,000,000, or 1 ms to meet TSO autosizing period. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18tcp: add SACK compressionEric Dumazet1-0/+3
When TCP receives an out-of-order packet, it immediately sends a SACK packet, generating network load but also forcing the receiver to send 1-MSS pathological packets, increasing its RTX queue length/depth, and thus processing time. Wifi networks suffer from this aggressive behavior, but generally speaking, all these SACK packets add fuel to the fire when networks are under congestion. This patch adds a high resolution timer and tp->compressed_ack counter. Instead of sending a SACK, we program this timer with a small delay, based on RTT and capped to 1 ms : delay = min ( 5 % of RTT, 1 ms) If subsequent SACKs need to be sent while the timer has not yet expired, we simply increment tp->compressed_ack. When timer expires, a SACK is sent with the latest information. Whenever an ACK is sent (if data is sent, or if in-order data is received) timer is canceled. Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent if the sack blocks need to be shuffled, even if the timer has not expired. A new SNMP counter is added in the following patch. Two other patches add sysctls to allow changing the 1,000,000 and 44 values that this commit hard-coded. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()Eric Dumazet1-1/+1
Socket can not disappear under us. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-18xsk: clean up SPDX headersBjörn Töpel1-11/+2
Clean up SPDX-License-Identifier and removing licensing leftovers. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18cfg80211: release station info tidstats where neededJohannes Berg1-0/+13
This fixes memory leaks in cases where we got the station info but failed sending it out properly. Fixes: 8689c051a201 ("cfg80211: dynamically allocate per-tid stats for station info") Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-18cfg80211: dynamically allocate per-tid stats for station infoArend van Spriel1-1/+9
With the addition of TXQ stats in the per-tid statistics the struct station_info grew significantly. This resulted in stack size warnings due to the structure itself being above the limit for the warnings. Add an allocation function that those who want to provide per-tid stats should use to allocate the tid array, i.e. struct station_info::pertid. Cc: Toke Høiland-Jørgensen <toke@toke.dk> Fixes: 52539ca89f36 ("cfg80211: Expose TXQ stats and parameters to userspace") Signed-off-by: Arend van Spriel <aspriel@gmail.com> [johannes: fix missing BIT() and logic by removing] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-05-18Bluetooth: Add __hci_cmd_send functionLoic Poulain1-0/+2
This function allows to send a HCI command without expecting any controller event/response in return. This is allowed for vendor- specific commands only. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-05-17tcp: new helper tcp_rack_skb_timeoutYuchung Cheng1-0/+2
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack to prepare the final RTO change. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: account lost retransmit after timeoutYuchung Cheng1-0/+1
The previous approach for the lost and retransmit bits was to wipe the slate clean: zero all the lost and retransmit bits, correspondingly zero the lost_out and retrans_out counters, and then add back the lost bits (and correspondingly increment lost_out). The new approach is to treat this very much like marking packets lost in fast recovery. We don’t wipe the slate clean. We just say that for all packets that were not yet marked sacked or lost, we now mark them as lost in exactly the same way we do for fast recovery. This fixes the lost retransmit accounting at RTO time and greatly simplifies the RTO code by sharing much of the logic with Fast Recovery. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: simpler NewReno implementationYuchung Cheng1-0/+1
This is a rewrite of NewReno loss recovery implementation that is simpler and standalone for readability and better performance by using less states. Note that NewReno refers to RFC6582 as a modification to the fast recovery algorithm. It is used only if the connection does not support SACK in Linux. It should not to be confused with the Reno (AIMD) congestion control. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: support DUPACK threshold in RACKYuchung Cheng1-0/+1
This patch adds support for the classic DUPACK threshold rule (#DupThresh) in RACK. When the number of packets SACKed is greater or equal to the threshold, RACK sets the reordering window to zero which would immediately mark all the unsacked packets below the highest SACKed sequence lost. Since this approach is known to not work well with reordering, RACK only uses it if no reordering has been observed. The DUPACK threshold rule is a particularly useful extension to the fast recoveries triggered by RACK reordering timer. For example data-center transfers where the RTT is much smaller than a timer tick, or high RTT path where the default RTT/4 may take too long. Note that this patch differs slightly from RFC6675. RFC6675 considers a packet lost when at least #DupThresh higher-sequence packets are SACKed. With RACK, for connections that have seen reordering, RACK continues to use a dynamically-adaptive time-based reordering window to detect losses. But for connections on which we have not yet seen reordering, this patch considers a packet lost when at least one higher sequence packet is SACKed and the total number of SACKed packets is at least DupThresh. For example, suppose a connection has not seen reordering, and sends 10 packets, and packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2 lost. RACK considers packets 1, 2, 4, 6 lost. There is some small risk of spurious retransmits here due to reordering. However, this is mostly limited to the first flight of a connection on which the sender receives SACKs from reordering. And RFC 6675 and FACK loss detection have a similar risk on the first flight with reordering (it's just that the risk of spurious retransmits from reordering was slightly narrower for those older algorithms due to the margin of 3*MSS). Also the minimum reordering window is reduced from 1 msec to 0 to recover quicker on short RTT transfers. Therefore RACK is more aggressive in marking packets lost during recovery to reduce the reordering window timeouts. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tls: don't use stack memory in a scatterlistMatt Mullins1-0/+3
scatterlist code expects virt_to_page() to work, which fails with CONFIG_VMAP_STACK=y. Fixes: c46234ebb4d1e ("tls: RX path for ktls") Signed-off-by: Matt Mullins <mmullins@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17sched: replace __QDISC_STATE_RUNNING bit with a spin lockPaolo Abeni1-5/+5
So that we can use lockdep on it. The newly introduced sequence lock has the same scope of busylock, so it shares the same lockdep annotation, but it's only used for NOLOCK qdiscs. With this changeset we acquire such lock in the control path around flushing operation (qdisc reset), to allow more NOLOCK qdisc perf improvement in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-5/+33
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-05-17 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Provide a new BPF helper for doing a FIB and neighbor lookup in the kernel tables from an XDP or tc BPF program. The helper provides a fast-path for forwarding packets. The API supports IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are implemented in this initial work, from David (Ahern). 2) Just a tiny diff but huge feature enabled for nfp driver by extending the BPF offload beyond a pure host processing offload. Offloaded XDP programs are allowed to set the RX queue index and thus opening the door for defining a fully programmable RSS/n-tuple filter replacement. Once BPF decided on a queue already, the device data-path will skip the conventional RSS processing completely, from Jakub. 3) The original sockmap implementation was array based similar to devmap. However unlike devmap where an ifindex has a 1:1 mapping into the map there are use cases with sockets that need to be referenced using longer keys. Hence, sockhash map is added reusing as much of the sockmap code as possible, from John. 4) Introduce BTF ID. The ID is allocatd through an IDR similar as with BPF maps and progs. It also makes BTF accessible to user space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data through BPF_OBJ_GET_INFO_BY_FD, from Martin. 5) Enable BPF stackmap with build_id also in NMI context. Due to the up_read() of current->mm->mmap_sem build_id cannot be parsed. This work defers the up_read() via a per-cpu irq_work so that at least limited support can be enabled, from Song. 6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND JIT conversion as well as implementation of an optimized 32/64 bit immediate load in the arm64 JIT that allows to reduce the number of emitted instructions; in case of tested real-world programs they were shrinking by three percent, from Daniel. 7) Add ifindex parameter to the libbpf loader in order to enable BPF offload support. Right now only iproute2 can load offloaded BPF and this will also enable libbpf for direct integration into other applications, from David (Beckett). 8) Convert the plain text documentation under Documentation/bpf/ into RST format since this is the appropriate standard the kernel is moving to for all documentation. Also add an overview README.rst, from Jesper. 9) Add __printf verification attribute to the bpf_verifier_vlog() helper. Though it uses va_list we can still allow gcc to check the format string, from Mathieu. 10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...' is a bash 4.0+ feature which is not guaranteed to be available when calling out to shell, therefore use a more portable variant, from Joe. 11) Fix a 64 bit division in xdp_umem_reg() by using div_u64() instead of relying on the gcc built-in, from Björn. 12) Fix a sock hashmap kmalloc warning reported by syzbot when an overly large key size is used in hashmap then causing overflows in htab->elem_size. Reject bogus attr->key_size early in the sock_hash_alloc(), from Yonghong. 13) Ensure in BPF selftests when urandom_read is being linked that --build-id is always enabled so that test_stacktrace_build_id[_nmi] won't be failing, from Alexei. 14) Add bitsperlong.h as well as errno.h uapi headers into the tools header infrastructure which point to one of the arch specific uapi headers. This was needed in order to fix a build error on some systems for the BPF selftests, from Sirio. 15) Allow for short options to be used in the xdp_monitor BPF sample code. And also a bpf.h tools uapi header sync in order to fix a selftest build failure. Both from Prashant. 16) More formally clarify the meaning of ID in the direct packet access section of the BPF documentation, from Wang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpersPaolo Abeni1-1/+9
Currently NOLOCK qdiscs pay a measurable overhead to atomically manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. This changeset moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario. This also allows simplifying the qdisc teardown code path - since qdisc_is_running() is now effective for each qdisc type - and avoid a possible race between qdisc_run() and dev_deactivate_many(), as now the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy dequeuing packets. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16bonding: allow use of tx hashing in balance-albDebabrata Banerjee1-2/+9
The rx load balancing provided by balance-alb is not mutually exclusive with using hashing for tx selection, and should provide a decent speed increase because this eliminates spinlocks and cache contention. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16proc: introduce proc_create_net{,_data}Christoph Hellwig3-5/+13
Variants of proc_create{,_data} that directly take a struct seq_operations and deal with network namespaces in ->open and ->release. All callers of proc_create + seq_open_net converted over, and seq_{open,release}_net are removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16net: move seq_file_single_net to <linux/seq_file_net.h>Christoph Hellwig1-12/+0
This helper deals with single_{open,release}_net internals and thus belongs here. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/raw: simplify ѕeq_file codeChristoph Hellwig1-4/+0
Pass the hashtable to the proc private data instead of copying it into the per-file private data. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/ping: simplify proc file creationChristoph Hellwig1-11/+0
Remove the pointless ping_seq_afinfo indirection and make the code look like most other protocols. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/tcp: simplify procfs registrationChristoph Hellwig1-8/+3
Avoid most of the afinfo indirections and just call the proc helpers directly. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/udp{,lite}: simplify proc registrationChristoph Hellwig1-12/+8
Remove a couple indirections to make the code look like most other protocols. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16proc: introduce proc_create_seq{,_data}Christoph Hellwig3-7/+9
Variants of proc_create{,_data} that directly take a struct seq_operations argument and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-15bpf: sockmap, refactor sockmap routines to work with hashmapJohn Fastabend1-2/+1
This patch only refactors the existing sockmap code. This will allow much of the psock initialization code path and bpf helper codes to work for both sockmap bpf map types that are backed by an array, the currently supported type, and the new hash backed bpf map type sockhash. Most the fallout comes from three changes, - Pushing bpf programs into an independent structure so we can use it from the htab struct in the next patch. - Generalizing helpers to use void *key instead of the hardcoded u32. - Instead of passing map/key through the metadata we now do the lookup inline. This avoids storing the key in the metadata which will be useful when keys can be longer than 4 bytes. We rename the sk pointers to sk_redir at this point as well to avoid any confusion between the current sk pointer and the redirect pointer sk_redir. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-14audit: use inline function to get audit contextRichard Guy Briggs1-1/+1
Recognizing that the audit context is an internal audit value, use an access function to retrieve the audit context pointer for the task rather than reaching directly into the task struct to get it. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: merge fuzz in auditsc.c and selinuxfs.c, checkpatch.pl fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-14sched: cls: enable verbose loggingMarcelo Ricardo Leitner1-2/+4
Currently, when the rule is not to be exclusively executed by the hardware, extack is not passed along and offloading failures don't get logged. The idea was that hardware failures are okay because the rule will get executed in software then and this way it doesn't confuse unware users. But this is not helpful in case one needs to understand why a certain rule failed to get offloaded. Considering it may have been a temporary failure, like resources exceeded or so, reproducing it later and knowing that it is triggering the same reason may be challenging. The ultimate goal is to improve Open vSwitch debuggability when using flower offloading. This patch adds a new flag to enable verbose logging. With the flag set, extack will be passed to the driver, which will be able to log the error. As the operation itself probably won't fail (not because of this, at least), current iproute will already log it as a Warning. The flag is generic, so it can be reused later. No need to restrict it just for HW offloading. The command line will follow the syntax that tc-ebpf already uses, tc ... [ verbose ] ... , and extend its meaning. For example: # ./tc qdisc add dev p7p1 ingress # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \ flower verbose \ src_mac ed:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \ src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop Warning: TC offload is disabled on net device. # echo $? 0 # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \ flower \ src_mac ff:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \ src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop # echo $? 0 Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-14audit: convert sessionid unset to a macroRichard Guy Briggs1-1/+1
Use a macro, "AUDIT_SID_UNSET", to replace each instance of initialization and comparison to an audit session ID. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-0/+5
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Fix handling of simultaneous open TCP connection in conntrack, from Jozsef Kadlecsik. 2) Insufficient sanitify check of xtables extension names, from Florian Westphal. 3) Skip unnecessary synchronize_rcu() call when transaction log is already empty, from Florian Westphal. 4) Incorrect destination mac validation in ebt_stp, from Stephen Hemminger. 5) xtables module reference counter leak in nft_compat, from Florian Westphal. 6) Incorrect connection reference counting logic in IPVS one-packet scheduler, from Julian Anastasov. 7) Wrong stats for 32-bits CPU in IPVS, also from Julian. 8) Calm down sparse error in netfilter core, also from Florian. 9) Use nla_strlcpy to fix compilation warning in nfnetlink_acct and nfnetlink_cthelper, again from Florian. 10) Missing module alias in icmp and icmp6 xtables extensions, from Florian Westphal. 11) Base chain statistics in nf_tables may be unset/null, from Florian. 12) Fix handling of large matchinfo size in nft_compat, this includes one preparation for before this fix. From Florian. 13) Fix bogus EBUSY error when deleting chains due to incorrect reference counting from the preparation phase of the two-phase commit protocol. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-2/+4
The bpf syscall and selftests conflicts were trivial overlapping changes. The r8169 change involved moving the added mdelay from 'net' into a different function. A TLS close bug fix overlapped with the splitting of the TLS state into separate TX and RX parts. I just expanded the tests in the bug fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf == X". Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11tcp: switch pacing timer to softirq based hrtimerEric Dumazet1-1/+3
linux-4.16 got support for softirq based hrtimers. TCP can switch its pacing hrtimer to this variant, since this avoids going through a tasklet and some atomic operations. pacing timer logic looks like other (jiffies based) tcp timers. v2: use hrtimer_try_to_cancel() in tcp_clear_xmit_timers() to correctly release reference on socket if needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11net: dsa: Plug in PHYLINK supportFlorian Fainelli1-0/+1
Add support for PHYLINK within the DSA subsystem in order to support more complex devices such as pluggable (SFP) and non-pluggable (SFF) modules, 10G PHYs, and traditional PHYs. Using PHYLINK allows us to drop some amount of complexity we had while probing fixed and non-fixed PHYs using Device Tree. Because PHYLINK separates the Ethernet MAC/port configuration into different stages, we let switch drivers implement those, and for now, we maintain functionality by calling dsa_slave_adjust_link() during phylink_mac_link_{up,down} which provides semantically equivalent steps. Drivers willing to take advantage of PHYLINK should implement the phylink_mac_* operations that DSA wraps. We cannot quite remove the adjust_link() callback just yet, because a number of drivers rely on that for configuring their "CPU" and "DSA" ports, this is done dsa_port_setup_phy_of() and dsa_port_fixed_link_register_of() still. Drivers that utilize fixed links for user-facing ports (e.g: bcm_sf2) will need to implement phylink_mac_ops from now on to preserve functionality, since PHYLINK *does not* create a phy_device instance for fixed links. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11net: dsa: Add PHYLINK switch operationsFlorian Fainelli1-0/+24
In preparation for adding support for PHYLINK within DSA, define a number of operations that we will need and that switch drivers can start implementing. Proper integration with PHYLINK will follow in subsequent patches. We start selecting PHYLINK (which implies PHYLIB) in net/dsa/Kconfig such that drivers can be guaranteed that this dependency is properly taken care of and can start referencing PHYLINK helper functions without requiring stubs or anything. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11bonding: send learning packets for vlans on slaveDebabrata Banerjee1-0/+1
There was a regression at some point from the intended functionality of commit f60c3704e87d ("bonding: Fix alb mode to only use first level vlans.") Given the return value vlan_get_encap_level() we need to store the nest level of the bond device, and then compare the vlan's encap level to this. Without this, this check always fails and learning packets are never sent. In addition, this same commit caused a regression in the behavior of balance_alb, which requires learning packets be sent for all interfaces using the slave's mac in order to load balance properly. For vlan's that have not set a user mac, we can send after checking one bit. Otherwise we need send the set mac, albeit defeating rx load balancing for that vlan. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11net/ipv6: Add fib lookup stubs for use in bpf helperDavid Ahern1-0/+14
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table, a stub to do a lookup in a specific table, fib6_table_lookup, and a stub for a full route lookup. The stubs are needed for core bpf code to handle the case when the IPv6 module is not builtin. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Add fib6_lookupDavid Ahern1-0/+6
Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules, but returns a FIB entry, fib6_info, rather than a dst based rt6_info. fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled) to 60% faster than any of the dst based lookup methods (without custom rules) and 25% faster with custom rules (e.g., l3mdev rule). Since the lookup function has a completely different signature, fib6_rule_action is split into 2 paths: the existing one is renamed __fib6_rule_action and a new one for the fib6_info path is added. fib6_rule_action decides which to call based on the lookup_ptr. If it is fib6_table_lookup then the new path is taken. Caller must hold rcu lock as no reference is taken on the returned fib entry. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Extract table lookup from ip6_pol_routeDavid Ahern1-0/+4
ip6_pol_route is used for ingress and egress FIB lookups. Refactor it moving the table lookup into a separate fib6_table_lookup that can be invoked separately and export the new function. ip6_pol_route now calls fib6_table_lookup and uses the result to generate a dst based rt6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Rename rt6_multipath_selectDavid Ahern1-0/+5
Rename rt6_multipath_select to fib6_multipath_select and export it. A later patch wants access to it similar to IPv4's fib_select_path. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Rename fib6_lookup to fib6_node_lookupDavid Ahern1-3/+3
Rename fib6_lookup to fib6_node_lookup to better reflect what it returns. The fib6_lookup name will be used in a later patch for an IPv6 equivalent to IPv4's fib_lookup. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-10tcp: Add mark for TIMEWAIT socketsJon Maxwell1-0/+1
This version has some suggestions by Eric Dumazet: - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP races. - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark" statement. - Factorize code as sk_fullsock() check is not necessary. Aidan McGurn from Openwave Mobility systems reported the following bug: "Marked routing is broken on customer deployment. Its effects are large increase in Uplink retransmissions caused by the client never receiving the final ACK to their FINACK - this ACK misses the mark and routes out of the incorrect route." Currently marks are added to sk_buffs for replies when the "fwmark_reflect" sysctl is enabled. But not for TW sockets that had sk->sk_mark set via setsockopt(SO_MARK..). Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. Then progate this so that the skb gets sent with the correct mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that netfilter rules are still honored. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10net: ipv4: remove define INET_CSK_DEBUG and unnecessary EXPORT_SYMBOLJoe Perches1-18/+4
INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls. EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2 pr_debug calls and is also unnecessary as the exported string can be used directly by these calls. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10Merge tag 'mac80211-for-davem-2018-05-09' of ↵David S. Miller1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== We only have a few fixes this time: * WMM element validation * SAE timeout * add-BA timeout * docbook parsing * a few memory leaks in error paths ==================== Signed-off-by: David S. Miller <davem@davemloft.net>