summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-07-17netfilter: conntrack: remove l3proto abstractionFlorian Westphal19-1095/+645
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto abstraction. This gets rid of all l3proto indirect calls and the need to do a lookup on the function to call for l3 demux. It increases module size by only a small amount (12kbyte), so this reduces size because nf_conntrack.ko is useless without either nf_conntrack_ipv4 or nf_conntrack_ipv6 module. before: text data bss dec hex filename 7357 1088 0 8445 20fd nf_conntrack_ipv4.ko 7405 1084 4 8493 212d nf_conntrack_ipv6.ko 72614 13689 236 86539 1520b nf_conntrack.ko 19K nf_conntrack_ipv4.ko 19K nf_conntrack_ipv6.ko 179K nf_conntrack.ko after: text data bss dec hex filename 79277 13937 236 93450 16d0a nf_conntrack.ko 191K nf_conntrack.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_timeout() indirectionFlorian Westphal12-104/+94
Not needed, we can have the l4trackers fetch it themselvs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: avoid l4proto pkt_to_tuple callsFlorian Westphal5-78/+15
Handle common protocols (udp, tcp, ..), in the core and only do the call if needed by the l4proto tracker. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: avoid calls to l4proto invert_tupleFlorian Westphal8-64/+8
Handle the common cases (tcp, udp, etc). in the core and only do the indirect call for the protocols that need it (GRE for instance). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove get_l4proto indirection from l3 protocol trackersFlorian Westphal7-134/+94
Handle it in the core instead. ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this doesn't create an ipv6 dependency. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove invert_tuple indirection from l3 protocol trackersFlorian Westphal8-52/+18
Its simpler to just handle it directly in nf_ct_invert_tuple(). Also gets rid of need to pass l3proto pointer to resolve_conntrack(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove pkt_to_tuple indirection from l3 protocol trackersFlorian Westphal5-58/+33
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: conntrack: remove ctnetlink callbacks from l3 protocol trackersFlorian Westphal11-149/+79
handle everything from ctnetlink directly. After all these years we still only support ipv4 and ipv6, so it seems reasonable to remove l3 protocol tracker support and instead handle ipv4/ipv6 from a common, always builtin inet tracker. Step 1: Get rid of all the l3proto->func() calls. Start with ctnetlink, then move on to packet-path ones. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: Kconfig: Make NETFILTER_XT_MATCH_SOCKET select NF_SOCKET_IPV4/6Máté Eckl1-2/+2
Instead of depending on it. Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16openvswitch: use nf_ct_get_tuplepr, invert_tupleprFlorian Westphal3-23/+4
These versions deal with the l3proto/l4proto details internally. It removes only caller of nf_ct_get_tuple, so make it static. After this, l3proto->get_l4proto() can be removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: utils: move nf_ip6_checksum* from ipv6 to utilsFlorian Westphal3-78/+65
similar to previous change, this also allows to remove it from nf_ipv6_ops and avoid the indirection. It also removes the bogus dependency of nf_conntrack_ipv6 on ipv6 module: ipv6 checksum functions are built into kernel even if CONFIG_IPV6=m, but ipv6/netfilter.o isn't. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: utils: move nf_ip_checksum* from ipv4 to utilsFlorian Westphal3-64/+55
allows to make nf_ip_checksum_partial static, it no longer has an external caller. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: nft_tproxy: Move nf_tproxy_assign_sock() to nf_tproxy.hMáté Eckl2-9/+8
This function is also necessary to implement nft tproxy support Fixes: 45ca4e0cf273 ("netfilter: Libify xt_TPROXY") Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: flowtables: use fixed renew timeout on teardownFlorian Westphal1-8/+5
This is one of the very few external callers of ->get_timeouts(), We can use a fixed timeout instead, conntrack core will refresh this in case a new packet comes within this period. Use of ESTABLISHED timeout seems way too huge anyway. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16netfilter: nft_reject_bridge: remove unnecessary ttl setTaehee Yoo1-2/+1
In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by the nf_reject_iphdr_put(). so, below code is unnecessary. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-16Merge branch 'TLS-offload-rx-netdev-and-mlx5'David S. Miller25-191/+846
Boris Pismenny says: ==================== TLS offload rx, netdev & mlx5 The following series provides TLS RX inline crypto offload. v5->v4: - Remove the Kconfig to mutually exclude both IPsec and TLS v4->v3: - Remove the iov revert for zero copy send flow v2->v3: - Fix typo - Adjust cover letter - Fix bug in zero copy flows - Use network byte order for the record number in resync - Adjust the sequence provided in resync v1->v2: - Fix bisectability problems due to variable name changes - Fix potential uninitialized return value This series completes the generic infrastructure to offload TLS crypto to a network devices. It enables the kernel TLS socket to skip decryption and authentication operations for SKBs marked as decrypted on the receive side of the data path. Leaving those computationally expensive operations to the NIC. This infrastructure doesn't require a TCP offload engine. Instead, the NIC decrypts a packet's payload if the packet contains the expected TCP sequence number. The TLS record authentication tag remains unmodified regardless of decryption. If the packet is decrypted successfully and it contains an authentication tag, then the authentication check has passed. Otherwise, if the authentication fails, then the packet is provided unmodified and the KTLS layer is responsible for handling it. Out-Of-Order TCP packets are provided unmodified. As a result, in the slow path some of the SKBs are decrypted while others remain as ciphertext. The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs. At the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW whenever it lost track of the TLS record framing in the TCP stream. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC identifies packets that should be offloaded according to the 5-tuple and the TCP sequence number. If these match and the packet is decrypted and authenticated successfully, then a syndrome is provided to software. Otherwise, the packet is unmodified. Decrypted and non-decrypted packets aren't coalesced by the network stack, and the KTLS layer decrypts and authenticates partially decrypted records. The NIC provides an indication whenever a resync is required. The resync operation is triggered by the KTLS layer while parsing TLS record headers. Finally, we measure the performance obtained by running single stream iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound) and KTLS-Offload running both in Tx and Rx. The results show that the performance of offload is comparable to TCP. | Bandwidth (Gbps) | CPU Tx (%) | CPU rx (%) TCP | 28.8 | 5 | 12 KTLS-Offload-Tx-Rx | 28.6 | 7 | 14 Paper: https://netdevconf.org/2.2/papers/pismenny-tlscrypto-talk.pdf ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: IPsec, fix byte count in CQEBoris Pismenny3-2/+3
This patch fixes the byte count indication in CQE for processed IPsec packets that contain a metadata header. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5: Accel, add common metadata functionsBoris Pismenny3-29/+45
This patch adds common functions to handle mellanox metadata headers. These functions are used by IPsec and TLS to process FPGA metadata. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: TLS, build TLS netdev from capabilitiesBoris Pismenny1-2/+16
This patch enables TLS Rx based on available HW capabilities. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: TLS, add software statisticsBoris Pismenny3-1/+17
This patch adds software statistics for TLS to count important events. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: TLS, add Innova TLS rx data pathBoris Pismenny3-3/+118
Implement the TLS rx offload data path according to the requirements of the TLS generic NIC offload infrastructure. Special metadata ethertype is used to pass information to the hardware. When hardware loses synchronization a special resync request metadata message is used to request resync. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: TLS, add innova rx supportBoris Pismenny2-15/+46
Add the mlx5 implementation of the TLS Rx routines to add/del TLS contexts, also add the tls_dev_resync_rx routine to work with the TLS inline Rx crypto offload infrastructure. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5: Accel, add TLS rx offload routinesBoris Pismenny5-46/+135
In Innova TLS, TLS contexts are added or deleted via a command message over the SBU connection. The HW then sends a response message over the same connection. Complete the implementation for Innova TLS (FPGA-based) hardware by adding support for rx inline crypto offload. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: TLS, refactor variable namesBoris Pismenny3-8/+8
For symmetry, we rename mlx5e_tls_offload_context to mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Fix zerocopy_from_iter iov handlingBoris Pismenny1-3/+5
zerocopy_from_iter iterates over the message, but it doesn't revert the updates made by the iov iteration. This patch fixes it. Now, the iov can be used after calling zerocopy_from_iter. Fixes: 3c4d75591 ("tls: kernel TLS support") Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Add rx inline crypto offloadBoris Pismenny5-43/+355
This patch completes the generic infrastructure to offload TLS crypto to a network device. It enables the kernel to skip decryption and authentication of some skbs marked as decrypted by the NIC. In the fast path, all packets received are decrypted by the NIC and the performance is comparable to plain TCP. This infrastructure doesn't require a TCP offload engine. Instead, the NIC only decrypts packets that contain the expected TCP sequence number. Out-Of-Order TCP packets are provided unmodified. As a result, at the worst case a received TLS record consists of both plaintext and ciphertext packets. These partially decrypted records must be reencrypted, only to be decrypted. The notable differences between SW KTLS Rx and this offload are as follows: 1. Partial decryption - Software must handle the case of a TLS record that was only partially decrypted by HW. This can happen due to packet reordering. 2. Resynchronization - tls_read_size calls the device driver to resynchronize HW after HW lost track of TLS record framing in the TCP stream. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Fill software context without allocationBoris Pismenny1-12/+22
This patch allows tls_set_sw_offload to fill the context in case it was already allocated previously. We will use it in TLS_DEVICE to fill the RX software context. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Split tls_sw_release_resources_rxBoris Pismenny2-1/+10
This patch splits tls_sw_release_resources_rx into two functions one which releases all inner software tls structures and another that also frees the containing structure. In TLS_DEVICE we will need to release the software structures without freeeing the containing structure, which contains other information. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Split decrypt_skb to two functionsBoris Pismenny2-18/+28
Previously, decrypt_skb also updated the TLS context. Now, decrypt_skb only decrypts the payload using the current context, while decrypt_skb_update also updates the state. Later, in the tls_device Rx flow, we will use decrypt_skb directly. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tls: Refactor tls_offload variable namesBoris Pismenny4-28/+27
For symmetry, we rename tls_offload_context to tls_offload_context_tx before we add tls_offload_context_rx. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16tcp: Don't coalesce decrypted and encrypted SKBsBoris Pismenny2-0/+15
Prevent coalescing of decrypted and encrypted SKBs in GRO and TCP layer. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add TLS rx resync NDOBoris Pismenny1-0/+2
Add new netdev tls op for resynchronizing HW tls context Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add TLS RX offload featureIlya Lesokhin2-0/+3
This patch adds a netdev feature to configure TLS RX inline crypto offload. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: Add decrypted field to skbBoris Pismenny2-1/+12
The decrypted bit is propogated to cloned/copied skbs. This will be used later by the inline crypto receive side offload of tls. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16Merge branch 'mvpp2-add-debugfs-interface'David S. Miller8-36/+830
Maxime Chevallier says: ==================== net: mvpp2: add debugfs interface The PPv2 Header Parser and Classifier are not straightforward to debug, having easy access to some of the many lookup tables configuration is helpful during development and debug. This series adds a basic debugfs interface, allowing to read data from the Header Parser and some of the Classifier tables. For now, the interface is read-only, and contains only some basic info. This was actually used during RSS development, and might be useful to troubleshoot some issues we might find. The first patch of the series converts the mvpp2 files to SPDX, which eases adding the new debugfs dedicated file. The second patch adds the interface, and exposes basic Header Parser data. The 3rd patch adds a hit counter for the Header Parser TCAM. The 4th patch exposes classifier info. The 5th patch adds some hit counters for some of the classifier engines. Changes since V1: - Rebased on the lastest net-next - Made cls_flow_get non static so that it can be used in mvpp2_debugfs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: mvpp2: debugfs: add classifier hit countersMaxime Chevallier4-0/+84
The classification operations that are used for RSS make use of several lookup tables. Having hit counters for these tables is really helpful to determine what flows were matched by ingress traffic, and see the path of packets among all the classifier tables. This commit adds hit counters for the 3 tables used at the moment : - The decoding table (also called lookup_id table), that links flows identified by the Header Parser to the flow table. There's one entry per flow, located at : .../mvpp2/<controller>/flows/XX/dec_hits Note that there are 21 flows in the decoding table, whereas there are 52 flows in the Header Parser. That's because there are several kind of traffic that will match a given flow. Reading the hit counter from one sub-flow will clear all hit counter that have the same flow_id. This also applies to the flow_hits. - The flow table, that contains all the different lookups to be performed by the classifier for each packet of a given flow. The match is done on the first entry of the flow sequence. - The C2 engine entries, that are used to assign the default rx queue, and enable or disable RSS for a given port. There's one entry per flow, located at: .../mvpp2/<controller>/flows/XX/flow_hits There is one C2 entry per port, so the c2 hit counter is located at : .../mvpp2/<controller>/ethX/c2_hits All hit counter values are 16-bits clear-on-read values. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: mvpp2: debugfs: add entries for classifier flowsMaxime Chevallier4-5/+329
The classifier configuration for RSS is quite complex, with several lookup tables being used. This commit adds useful info in debugfs to see how the different tables are configured : Added 2 new entries in the per-port directory : - .../eth0/default_rxq : The default rx queue on that port - .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry Added the 'flows' directory : It contains one entry per sub-flow. a 'sub-flow' is a unique path from Header Parser to the flow table. Multiple sub-flows can point to the same 'flow' (each flow has an id from 8 to 29, which is its index in the Lookup Id table) : - .../flows/00/... /01/... ... /51/id : The flow id. There are 21 unique flows. There's one flow per combination of the following parameters : - L4 protocol (TCP, UDP, none) - L3 protocol (IPv4, IPv6) - L3 parameters (Fragmented or not) - L2 parameters (Vlan tag presence or not) .../type : The flow type. This is an even higher level flow, that we manipulate with ethtool. It can be : "udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other". .../eth0/... .../eth1/engine : The hash generation engine used for this flow on the given port .../hash_opts : The hash generation options indicating on what data we base the hash (vlan tag, src IP, src port, etc.) Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: mvpp2: debugfs: add hit counter stats for Header Parser entriesMaxime Chevallier4-0/+40
One helpful feature to help debug the Header Parser TCAM filter in PPv2 is to be able to see if the entries did match something when a packet comes in. This can be done by using the built-in hit counter for TCAM entries. This commit implements reading the counter, and exposing its value on debugfs for each filter entry. The counter is a 16-bits clear-on-read value, located at: .../mvpp2/<controller>/parser/XXX/hits Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: mvpp2: add a debugfs interface for the Header ParserMaxime Chevallier6-7/+371
Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not trivial to configure and debug. Being able to dump TCAM entries from userspace can be really helpful to help development of new features and debug existing ones. This commit adds a basic debugfs interface for the PPv2 driver, focusing on TCAM related features. <mnt>/mvpp2/ --- f2000000.ethernet \- f4000000.ethernet --- parser --- 000 ... | \- 001 | \- ... | \- 255 --- ai | \- header_data | \- lookup_id | \- sram | \- valid \- eth1 ... \- eth2 --- mac_filter \- parser_entries \- vid_filter There's one directory per PPv2 instance, named after pdev->name to make sure names are uniques. In each of these directories, there's : - one directory per interface on the controller, each containing : - "mac_filter", which lists all filtered addresses for this port (based on TCAM, not on the kernel's uc / mc lists) - "parser_entries", which lists the indices of all valid TCAM entries that have this port in their port map - "vid_filter", which lists the vids allowed on this port, based on TCAM - one "parser" directory (the parser is common to all ports), containing : - one directory per TCAM entry (256 of them, from 0 to 255), each containing : - "ai" : Contains the 1 byte Additional Info field from TCAM, and - "header_data" : Contains the 8 bytes Header Data extracted from the packet - "lookup_id" : Contains the 4 bits LU_ID - "sram" : contains the raw SRAM data, which is the result of the TCAM lookup. This readonly at the moment. - "valid" : Indicates if the entry is valid of not. All entries are read-only, and everything is output in hex form. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net: mvpp2: switch to SPDX identifiersAntoine Tenart6-24/+6
Use the appropriate SPDX license identifiers and drop the license text. This patch is only cosmetic. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller67-929/+3119
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-15 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Various different arm32 JIT improvements in order to optimize code emission and make the JIT code itself more robust, from Russell. 2) Support simultaneous driver and offloaded XDP in order to allow for advanced use-cases where some work is offloaded to the NIC and some to the host. Also add ability for bpftool to load programs and maps beyond just the cgroup case, from Jakub. 3) Add BPF JIT support in nfp for multiplication as well as division. For the latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong. 4) Add BTF pretty print functionality to bpftool in plain and JSON output format, from Okash. 5) Add build and installation to the BPF helper man page into bpftool, from Quentin. 6) Add a TCP BPF callback for listening sockets which is triggered right after the socket transitions to TCP_LISTEN state, from Andrey. 7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup tree and prints all attached programs, from Roman. 8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged packets, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-15Merge branch 'bpf-tcp-listen-cb'Daniel Borkmann9-69/+88
Andrey Ignatov says: ==================== This patchset adds TCP-BPF callback for listening sockets. Patch 0001 provides more details and is the main patch in the set. Patch 0006 adds selftest for the new callback. Other patches are bug fixes and improvements in TCP-BPF selftest to make it easier to extend in 0006. ==================== Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-15selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CBAndrey Ignatov3-6/+16
Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB callback can be called on future state transition, and when such a transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and verify it in user space later. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-15selftests/bpf: Better verification in test_tcpbpfAndrey Ignatov1-25/+39
Reduce amount of copy/paste for debug info when result is verified in the test and keep that info together with values being checked so that they won't get out of sync. It also improves debug experience: instead of checking manually what doesn't match in debug output for all fields, only unexpected field is printed. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-15selftests/bpf: Switch test_tcpbpf_user to cgroup_helpersAndrey Ignatov2-34/+22
Switch to cgroup_helpers to simplify the code and fix cgroup cleanup: before cgroup was not cleaned up after the test. It also removes SYSTEM macro, that only printed error, but didn't terminate the test. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-15selftests/bpf: Fix const'ness in cgroup_helpersAndrey Ignatov2-6/+6
Lack of const in cgroup helpers signatures forces to write ugly client code. Fix it. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-15bpf: Sync bpf.h to tools/Andrey Ignatov1-0/+3
Sync BPF_SOCK_OPS_TCP_LISTEN_CB related UAPI changes to tools/. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-15bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CBAndrey Ignatov2-0/+4
Add new TCP-BPF callback that is called on listen(2) right after socket transition to TCP_LISTEN state. It fills the gap for listening sockets in TCP-BPF. For example BPF program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening and track later transition from TCP_LISTEN to TCP_CLOSE with BPF_SOCK_OPS_STATE_CB callback. Before there was no way to do it with TCP-BPF and other options were much harder to work with. E.g. socket state tracking can be done with tracepoints (either raw or regular) but they can't be attached to cgroup and their lifetime has to be managed separately. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-14Merge branch 'mlxsw-VRRP'David S. Miller5-4/+292
Ido Schimmel says: ==================== mlxsw: Add VRRP support When a router that is acting as the default gateway of a host stops functioning, the host will encounter packet loss until the router starts functioning again. To increase the reliability of the default gateway without performing reconfiguration on the host, a host can use a Virtual Router Redundancy Protocol (VRRP) Router. This virtual router is composed from several routers where only one is actually forwarding packets from the host (the master router) while the other routers act as backup routers. The election of the master router is determined by the VRRP protocol [1]. Packets addressed to the virtual router are always sent to the virtual router MAC address (IPv4: 00-00-5E-00-01-XX, IPv6: 00-00-5E-00-02-XX). Such packets can only be accepted by the master router and must be discarded by the backup routers. In Linux, VRRP is usually implemented by configuring a macvlan with the virtual router MAC on top of the router interface that is connected to the host / LAN. The macvlan on the master router is assigned the virtual IP (VIP) that the host uses as its gateway. In order to support VRRP in mlxsw, we first need to enable macvlan upper devices on top of mlxsw netdevs and their uppers. This is done by the first patch, which also takes care of sanitizing macvlan configurations that are not currently supported by the driver. The second patch directs packets with destination MAC addresses as the macvlans to the router so that they will undergo an L3 lookup. This is consistent with the kernel's behavior where the macvlan's Rx handler will re-inject such packets to the Rx path so that they will be picked up by the IPvX protocol handlers and undergo an L3 lookup. Note that the driver prevents the macvlans from being enslaved to other devices, to ensure the packets will be picked up by the protocol handler and not by another Rx handler. The third patch adds packet traps for VRRP control packets for both IPv4 and IPv6. Finally, the last patch optimizes the reception of VRRP MACs by potentially skipping one L2 lookup for them. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-14mlxsw: spectrum_router: Optimize processing of VRRP MACsIdo Schimmel2-0/+66
Hosts using a VRRP router send their packets with a destination MAC of the VRRP router which is of the following form [1]: IPv4 - 00-00-5E-00-01-{VRID} IPv6 - 00-00-5E-00-02-{VRID} Where VRID is the ID of the virtual router. Such packets are directed to the router block in the ASIC by an FDB entry that was added in the previous patch. However, in certain cases it is possible to skip this FDB lookup and send such packets directly to the router. This is accomplished by adding these special MAC addresses to the RIF cache. If the cache is hit, the packet will skip the L2 lookup and ingress the router with the RIF specified in the cache entry. 1. https://tools.ietf.org/html/rfc5798#section-7.3 Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>