From 350601b4f7ab45a3ef39575acc21d6b7a69f724b Mon Sep 17 00:00:00 2001 From: Murali Karicheri Date: Tue, 17 Apr 2018 17:30:30 -0400 Subject: soc: ti: K2G: enhancement to support QMSS in K2G NAVSS Navigator Subsystem (NAVSS) available on K2G SoC has a cut down version of QMSS with less number of queues, internal linking ram with lesser number of buffers etc. It doesn't have status and explicit push register space as in QMSS available on other K2 SoCs. So define reg indices specific to QMSS on K2G. This patch introduces "ti,66ak2g-navss-qm" compatibility to identify QMSS on K2G NAVSS and to customize the dts handling code. Per Device manual, descriptors with index less than or equal to regions0_size is in region 0 in the case of K2 QMSS where as for QMSS on K2G, descriptors with index less than regions0_size is in region 0. So update the size accordingly in the regions0_size bits of the linking ram size 0 register. Signed-off-by: Murali Karicheri Signed-off-by: WingMan Kwok Reviewed-by: Rob Herring Signed-off-by: David S. Miller --- .../devicetree/bindings/soc/ti/keystone-navigator-qmss.txt | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt b/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt index 77cd42cc5f54..b025770eeb92 100644 --- a/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt +++ b/Documentation/devicetree/bindings/soc/ti/keystone-navigator-qmss.txt @@ -17,7 +17,8 @@ pool management. Required properties: -- compatible : Must be "ti,keystone-navigator-qmss"; +- compatible : Must be "ti,keystone-navigator-qmss". + : Must be "ti,66ak2g-navss-qm" for QMSS on K2G SoC. - clocks : phandle to the reference clock for this device. - queue-range : total range of queue numbers for the device. - linkram0 :
for internal link ram, where size is the total @@ -39,6 +40,12 @@ Required properties: - Descriptor memory setup region. - Queue Management/Queue Proxy region for queue Push. - Queue Management/Queue Proxy region for queue Pop. + +For QMSS on K2G SoC, following QM reg indexes are used in that order + - Queue Peek region. + - Queue configuration region. + - Queue Management/Queue Proxy region for queue Push/Pop. + - queue-pools : child node classifying the queue ranges into pools. Queue ranges are grouped into 3 type of pools: - qpend : pool of qpend(interruptible) queues -- cgit v1.2.3 From ae316c4cbba2ee8f92bc3c5b040b275371ea052c Mon Sep 17 00:00:00 2001 From: Govind Singh Date: Tue, 10 Apr 2018 18:01:35 +0300 Subject: dt: bindings: add bindings for wcn3990 wifi block Add device tree binding documentation details for wcn3990 wifi block present in Qualcomm SDM845/APQ8098 SoC into "qcom,ath10k.txt". Signed-off-by: Govind Singh Reviewed-by: Rob Herring Signed-off-by: Kalle Valo --- .../bindings/net/wireless/qcom,ath10k.txt | 31 ++++++++++++++++++++++ 1 file changed, 31 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt index 3d2a031217da..7fd4e8ce4149 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.txt @@ -4,6 +4,7 @@ Required properties: - compatible: Should be one of the following: * "qcom,ath10k" * "qcom,ipq4019-wifi" + * "qcom,wcn3990-wifi" PCI based devices uses compatible string "qcom,ath10k" and takes calibration data along with board specific data via "qcom,ath10k-calibration-data". @@ -18,8 +19,12 @@ In general, entry "qcom,ath10k-pre-calibration-data" and "qcom,ath10k-calibration-data" conflict with each other and only one can be provided per device. +SNOC based devices (i.e. wcn3990) uses compatible string "qcom,wcn3990-wifi". + Optional properties: - reg: Address and length of the register set for the device. +- reg-names: Must include the list of following reg names, + "membase" - resets: Must contain an entry for each entry in reset-names. See ../reset/reseti.txt for details. - reset-names: Must include the list of following reset names, @@ -49,6 +54,8 @@ Optional properties: hw versions. - qcom,ath10k-pre-calibration-data : pre calibration data as an array, the length can vary between hw versions. +- -supply: handle to the regulator device tree node + optional "supply-name" is "vdd-0.8-cx-mx". Example (to supply the calibration data alone): @@ -119,3 +126,27 @@ wifi0: wifi@a000000 { qcom,msi_base = <0x40>; qcom,ath10k-pre-calibration-data = [ 01 02 03 ... ]; }; + +Example (to supply wcn3990 SoC wifi block details): + +wifi@18000000 { + compatible = "qcom,wcn3990-wifi"; + reg = <0x18800000 0x800000>; + reg-names = "membase"; + clocks = <&clock_gcc clk_aggre2_noc_clk>; + clock-names = "smmu_aggre2_noc_clk" + interrupts = + <0 130 0 /* CE0 */ >, + <0 131 0 /* CE1 */ >, + <0 132 0 /* CE2 */ >, + <0 133 0 /* CE3 */ >, + <0 134 0 /* CE4 */ >, + <0 135 0 /* CE5 */ >, + <0 136 0 /* CE6 */ >, + <0 137 0 /* CE7 */ >, + <0 138 0 /* CE8 */ >, + <0 139 0 /* CE9 */ >, + <0 140 0 /* CE10 */ >, + <0 141 0 /* CE11 */ >; + vdd-0.8-cx-mx-supply = <&pm8998_l5>; +}; -- cgit v1.2.3 From 6b9227d666f2efe0f8ed234827bb1abdf63f9501 Mon Sep 17 00:00:00 2001 From: Kunihiko Hayashi Date: Thu, 19 Apr 2018 16:24:53 +0900 Subject: net: ethernet: ave: add multiple clocks and resets support as required property When the link is becoming up for Pro4 SoC, the kernel is stalled due to some missing clocks and resets. The AVE block for Pro4 is connected to the GIO bus in the SoC. Without its clock/reset, the access to the AVE register makes the system stall. In the same way, another MAC clock for Giga-bit Connection and the PHY clock are also required for Pro4 to activate the Giga-bit feature and to recognize the PHY. To satisfy these requirements, this patch adds support for multiple clocks and resets, and adds the clock-names and reset-names to the binding because we need to distinguish clock/reset for the AVE main block and the others. Also, make the resets a required property. Currently, "reset is optional" relies on that the bootloader or firmware has deasserted the reset before booting the kernel. Drivers should work without such expectation. Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver") Suggested-by: Masahiro Yamada Signed-off-by: Kunihiko Hayashi Reviewed-by: Rob Herring Signed-off-by: David S. Miller --- .../bindings/net/socionext,uniphier-ave4.txt | 13 ++- drivers/net/ethernet/socionext/sni_ave.c | 108 ++++++++++++++++----- 2 files changed, 96 insertions(+), 25 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt index 96398cc2982f..85e0c49548ed 100644 --- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt +++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt @@ -17,9 +17,18 @@ Required properties: - phy-handle: Should point to the external phy device. See ethernet.txt file in the same directory. - clocks: A phandle to the clock for the MAC. + For Pro4 SoC, that is "socionext,uniphier-pro4-ave4", + another MAC clock, GIO bus clock and PHY clock are also required. + - clock-names: Should contain + - "ether", "ether-gb", "gio", "ether-phy" for Pro4 SoC + - "ether" for others + - resets: A phandle to the reset control for the MAC. For Pro4 SoC, + GIO bus reset is also required. + - reset-names: Should contain + - "ether", "gio" for Pro4 SoC + - "ether" for others Optional properties: - - resets: A phandle to the reset control for the MAC. - local-mac-address: See ethernet.txt in the same directory. Required subnode: @@ -34,7 +43,9 @@ Example: interrupts = <0 66 4>; phy-mode = "rgmii"; phy-handle = <ðphy>; + clock-names = "ether"; clocks = <&sys_clk 6>; + reset-names = "ether"; resets = <&sys_rst 6>; local-mac-address = [00 00 00 00 00 00]; diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c index 0b3b7a460641..52940bdd4ad3 100644 --- a/drivers/net/ethernet/socionext/sni_ave.c +++ b/drivers/net/ethernet/socionext/sni_ave.c @@ -199,6 +199,9 @@ #define IS_DESC_64BIT(p) ((p)->data->is_desc_64bit) +#define AVE_MAX_CLKS 4 +#define AVE_MAX_RSTS 2 + enum desc_id { AVE_DESCID_RX, AVE_DESCID_TX, @@ -227,6 +230,8 @@ struct ave_desc_info { struct ave_soc_data { bool is_desc_64bit; + const char *clock_names[AVE_MAX_CLKS]; + const char *reset_names[AVE_MAX_RSTS]; }; struct ave_stats { @@ -245,8 +250,10 @@ struct ave_private { int phy_id; unsigned int desc_size; u32 msg_enable; - struct clk *clk; - struct reset_control *rst; + int nclks; + struct clk *clk[AVE_MAX_CLKS]; + int nrsts; + struct reset_control *rst[AVE_MAX_RSTS]; phy_interface_t phy_mode; struct phy_device *phydev; struct mii_bus *mdio; @@ -1153,18 +1160,23 @@ static int ave_init(struct net_device *ndev) struct device_node *np = dev->of_node; struct device_node *mdio_np; struct phy_device *phydev; - int ret; + int nc, nr, ret; /* enable clk because of hw access until ndo_open */ - ret = clk_prepare_enable(priv->clk); - if (ret) { - dev_err(dev, "can't enable clock\n"); - return ret; + for (nc = 0; nc < priv->nclks; nc++) { + ret = clk_prepare_enable(priv->clk[nc]); + if (ret) { + dev_err(dev, "can't enable clock\n"); + goto out_clk_disable; + } } - ret = reset_control_deassert(priv->rst); - if (ret) { - dev_err(dev, "can't deassert reset\n"); - goto out_clk_disable; + + for (nr = 0; nr < priv->nrsts; nr++) { + ret = reset_control_deassert(priv->rst[nr]); + if (ret) { + dev_err(dev, "can't deassert reset\n"); + goto out_reset_assert; + } } ave_global_reset(ndev); @@ -1207,9 +1219,11 @@ static int ave_init(struct net_device *ndev) out_mdio_unregister: mdiobus_unregister(priv->mdio); out_reset_assert: - reset_control_assert(priv->rst); + while (--nr >= 0) + reset_control_assert(priv->rst[nr]); out_clk_disable: - clk_disable_unprepare(priv->clk); + while (--nc >= 0) + clk_disable_unprepare(priv->clk[nc]); return ret; } @@ -1217,13 +1231,16 @@ out_clk_disable: static void ave_uninit(struct net_device *ndev) { struct ave_private *priv = netdev_priv(ndev); + int i; phy_disconnect(priv->phydev); mdiobus_unregister(priv->mdio); /* disable clk because of hw access after ndo_stop */ - reset_control_assert(priv->rst); - clk_disable_unprepare(priv->clk); + for (i = 0; i < priv->nrsts; i++) + reset_control_assert(priv->rst[i]); + for (i = 0; i < priv->nclks; i++) + clk_disable_unprepare(priv->clk[i]); } static int ave_open(struct net_device *ndev) @@ -1527,8 +1544,9 @@ static int ave_probe(struct platform_device *pdev) struct resource *res; const void *mac_addr; void __iomem *base; + const char *name; + int i, irq, ret; u64 dma_mask; - int irq, ret; u32 ave_id; data = of_device_get_match_data(dev); @@ -1614,16 +1632,28 @@ static int ave_probe(struct platform_device *pdev) u64_stats_init(&priv->stats_tx.syncp); u64_stats_init(&priv->stats_rx.syncp); - priv->clk = devm_clk_get(dev, NULL); - if (IS_ERR(priv->clk)) { - ret = PTR_ERR(priv->clk); - goto out_free_netdev; + for (i = 0; i < AVE_MAX_CLKS; i++) { + name = priv->data->clock_names[i]; + if (!name) + break; + priv->clk[i] = devm_clk_get(dev, name); + if (IS_ERR(priv->clk[i])) { + ret = PTR_ERR(priv->clk[i]); + goto out_free_netdev; + } + priv->nclks++; } - priv->rst = devm_reset_control_get_optional_shared(dev, NULL); - if (IS_ERR(priv->rst)) { - ret = PTR_ERR(priv->rst); - goto out_free_netdev; + for (i = 0; i < AVE_MAX_RSTS; i++) { + name = priv->data->reset_names[i]; + if (!name) + break; + priv->rst[i] = devm_reset_control_get_shared(dev, name); + if (IS_ERR(priv->rst[i])) { + ret = PTR_ERR(priv->rst[i]); + goto out_free_netdev; + } + priv->nrsts++; } priv->mdio = devm_mdiobus_alloc(dev); @@ -1687,22 +1717,52 @@ static int ave_remove(struct platform_device *pdev) static const struct ave_soc_data ave_pro4_data = { .is_desc_64bit = false, + .clock_names = { + "gio", "ether", "ether-gb", "ether-phy", + }, + .reset_names = { + "gio", "ether", + }, }; static const struct ave_soc_data ave_pxs2_data = { .is_desc_64bit = false, + .clock_names = { + "ether", + }, + .reset_names = { + "ether", + }, }; static const struct ave_soc_data ave_ld11_data = { .is_desc_64bit = false, + .clock_names = { + "ether", + }, + .reset_names = { + "ether", + }, }; static const struct ave_soc_data ave_ld20_data = { .is_desc_64bit = true, + .clock_names = { + "ether", + }, + .reset_names = { + "ether", + }, }; static const struct ave_soc_data ave_pxs3_data = { .is_desc_64bit = false, + .clock_names = { + "ether", + }, + .reset_names = { + "ether", + }, }; static const struct of_device_id of_ave_match[] = { -- cgit v1.2.3 From 74734306c20393910eae3379322c622e57351d98 Mon Sep 17 00:00:00 2001 From: Kunihiko Hayashi Date: Thu, 19 Apr 2018 16:24:54 +0900 Subject: dt-bindings: net: ave: add syscon-phy-mode property to configure phy-mode setting Add "socionext,syscon-phy-mode" property to specify system controller that configures the settings about phy-mode. Signed-off-by: Kunihiko Hayashi Reviewed-by: Rob Herring Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt index 85e0c49548ed..fc8f01718690 100644 --- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt +++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt @@ -13,7 +13,8 @@ Required properties: - reg: Address where registers are mapped and size of region. - interrupts: Should contain the MAC interrupt. - phy-mode: See ethernet.txt in the same directory. Allow to choose - "rgmii", "rmii", or "mii" according to the PHY. + "rgmii", "rmii", "mii", or "internal" according to the PHY. + The acceptable mode is SoC-dependent. - phy-handle: Should point to the external phy device. See ethernet.txt file in the same directory. - clocks: A phandle to the clock for the MAC. @@ -27,6 +28,8 @@ Required properties: - reset-names: Should contain - "ether", "gio" for Pro4 SoC - "ether" for others + - socionext,syscon-phy-mode: A phandle to syscon with one argument + that configures phy mode. The argument is the ID of MAC instance. Optional properties: - local-mac-address: See ethernet.txt in the same directory. @@ -47,6 +50,7 @@ Example: clocks = <&sys_clk 6>; reset-names = "ether"; resets = <&sys_rst 6>; + socionext,syscon-phy-mode = <&soc_glue 0>; local-mac-address = [00 00 00 00 00 00]; mdio { -- cgit v1.2.3 From 01d26589dee4b23376642fba333539605c52d324 Mon Sep 17 00:00:00 2001 From: Phil Elwell Date: Thu, 19 Apr 2018 17:59:40 +0100 Subject: dt-bindings: Document the DT bindings for lan78xx The Microchip LAN78XX family of devices are Ethernet controllers with a USB interface. Despite being discoverable devices it can be useful to be able to configure them from Device Tree, particularly in low-cost applications without an EEPROM or programmed OTP. Document the supported properties in a bindings file. Signed-off-by: Phil Elwell Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- .../devicetree/bindings/net/microchip,lan78xx.txt | 54 ++++++++++++++++++++++ MAINTAINERS | 1 + 2 files changed, 55 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt new file mode 100644 index 000000000000..76786a0f6d3d --- /dev/null +++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt @@ -0,0 +1,54 @@ +Microchip LAN78xx Gigabit Ethernet controller + +The LAN78XX devices are usually configured by programming their OTP or with +an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither. +The Device Tree properties, if present, override the OTP and EEPROM. + +Required properties: +- compatible: Should be one of "usb424,7800", "usb424,7801" or "usb424,7850". + +Optional properties: +- local-mac-address: see ethernet.txt +- mac-address: see ethernet.txt + +Optional properties of the embedded PHY: +- microchip,led-modes: a 0..4 element vector, with each element configuring + the operating mode of an LED. Omitted LEDs are turned off. Allowed values + are defined in "include/dt-bindings/net/microchip-lan78xx.h". + +Example: + +/* Based on the configuration for a Raspberry Pi 3 B+ */ +&usb { + usb-port@1 { + compatible = "usb424,2514"; + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + usb-port@1 { + compatible = "usb424,2514"; + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + ethernet: ethernet@1 { + compatible = "usb424,7800"; + reg = <1>; + local-mac-address = [ 00 11 22 33 44 55 ]; + + mdio { + #address-cells = <0x1>; + #size-cells = <0x0>; + eth_phy: ethernet-phy@1 { + reg = <1>; + microchip,led-modes = < + LAN78XX_LINK_1000_ACTIVITY + LAN78XX_LINK_10_100_ACTIVITY + >; + }; + }; + }; + }; + }; +}; diff --git a/MAINTAINERS b/MAINTAINERS index c952d3076a65..81465707d8a8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14571,6 +14571,7 @@ M: Woojung Huh M: Microchip Linux Driver Support L: netdev@vger.kernel.org S: Maintained +F: Documentation/devicetree/bindings/net/microchip,lan78xx.txt F: drivers/net/usb/lan78xx.* F: include/dt-bindings/net/microchip-lan78xx.h -- cgit v1.2.3 From 660de409e25ba5bf34e8fdf2e13b69f1a7bd7213 Mon Sep 17 00:00:00 2001 From: Chris Novakovic Date: Tue, 24 Apr 2018 03:56:32 +0100 Subject: ipconfig: Document setting of NIS domain name ic_do_bootp_ext() is responsible for parsing the "ip=" and "nfsaddrs=" kernel parameters. If a "." character is found in parameter 4 (the client's hostname), everything before the first "." is used as the hostname, and everything after it is used as the NIS domain name (but not necessarily the DNS domain name). Document this behaviour in Documentation/filesystems/nfs/nfsroot.txt, as it is not made explicit. Signed-off-by: Chris Novakovic Signed-off-by: David S. Miller --- Documentation/filesystems/nfs/nfsroot.txt | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 5efae00f6c7f..1513e5d663fd 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -123,10 +123,13 @@ ip=::::::: Default: Determined using autoconfiguration. - Name of the client. May be supplied by autoconfiguration, - but its absence will not trigger autoconfiguration. - If specified and DHCP is used, the user provided hostname will - be carried in the DHCP request to hopefully update DNS record. + Name of the client. If a '.' character is present, anything + before the first '.' is used as the client's hostname, and anything + after it is used as its NIS domain name. May be supplied by + autoconfiguration, but its absence will not trigger autoconfiguration. + If specified and DHCP is used, the user-provided hostname (and NIS + domain name, if present) will be carried in the DHCP request; this + may cause a DNS record to be created or updated for the client. Default: Client IP address is used in ASCII notation. -- cgit v1.2.3 From 8b0b37c5644e1c0e0feac5bbf673337cefa3efb2 Mon Sep 17 00:00:00 2001 From: Chris Novakovic Date: Tue, 24 Apr 2018 03:56:36 +0100 Subject: ipconfig: Document /proc/net/pnp Fully document the format used by the /proc/net/pnp file written by ipconfig, explain where its values originate from, and clarify that the tertiary name server IP and DNS domain name are only written to the file when autoconfiguration is used. Signed-off-by: Chris Novakovic Signed-off-by: David S. Miller --- Documentation/filesystems/nfs/nfsroot.txt | 34 ++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 1513e5d663fd..a1030bea60d3 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -110,6 +110,9 @@ ip=::::::: will not be triggered if it is missing and NFS root is not in operation. + Value is exported to /proc/net/pnp with the prefix "bootserver " + (see below). + Default: Determined using autoconfiguration. The address of the autoconfiguration server is used. @@ -165,12 +168,33 @@ ip=::::::: Default: any - IP address of first nameserver. - Value gets exported by /proc/net/pnp which is often linked - on embedded systems by /etc/resolv.conf. + IP address of primary nameserver. + Value is exported to /proc/net/pnp with the prefix "nameserver " + (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + IP address of secondary nameserver. + See . + + After configuration (whether manual or automatic) is complete, a file is + created at /proc/net/pnp in the following format; lines are omitted if + their respective value is empty following configuration. + + #PROTO: (depending on configuration method) + domain (if autoconfigured, the DNS domain) + nameserver (primary name server IP) + nameserver (secondary name server IP) + nameserver (tertiary name server IP) + bootserver (NFS server IP) + + and are requested during autoconfiguration; they + cannot be specified as part of the "ip=" kernel command line parameter. - IP address of second nameserver. - Same as above. + Because the "domain" and "nameserver" options are recognised by DNS + resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems + that use an NFS root filesystem. nfsrootdebug -- cgit v1.2.3 From c04d2cb2009f87cba7431c4ed3d85a602f71658e Mon Sep 17 00:00:00 2001 From: Chris Novakovic Date: Tue, 24 Apr 2018 03:56:39 +0100 Subject: ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers Distributed filesystems are most effective when the server and client clocks are synchronised. Embedded devices often use NFS for their root filesystem but typically do not contain an RTC, so the clocks of the NFS server and the embedded device will be out-of-sync when the root filesystem is mounted (and may not be synchronised until late in the boot process). Extend ipconfig with the ability to export IP addresses of NTP servers it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as follows: - If ipconfig is configured manually via the "ip=" or "nfsaddrs=" kernel command line parameters, one NTP server can be specified in the new "" parameter. - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in the DHCPDISCOVER message, and record the IP addresses of up to three NTP servers sent by the responding DHCP server in the subsequent DHCPOFFER message. ipconfig will only write the NTP server IP addresses it discovers to /proc/net/ipconfig/ntp_servers, one per line (in the order received from the DHCP server, if DHCP autoconfiguration is used); making use of these NTP servers is the responsibility of a user space process (e.g. an initrd/initram script that invokes an NTP client before mounting an NFS root filesystem). Signed-off-by: Chris Novakovic Signed-off-by: David S. Miller --- Documentation/filesystems/nfs/nfsroot.txt | 35 +++++++-- net/ipv4/ipconfig.c | 118 +++++++++++++++++++++++++++--- 2 files changed, 136 insertions(+), 17 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index a1030bea60d3..d2963123eb1c 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -5,6 +5,7 @@ Written 1996 by Gero Kuhlmann Updated 1997 by Martin Mares Updated 2006 by Nico Schottelius Updated 2006 by Horms +Updated 2018 by Chris Novakovic @@ -79,7 +80,7 @@ nfsroot=[:][,] ip=::::::: - : + :: This parameter tells the kernel how to configure IP addresses of devices and also how to set up the IP routing table. It was originally called @@ -178,9 +179,18 @@ ip=::::::: IP address of secondary nameserver. See . - After configuration (whether manual or automatic) is complete, a file is - created at /proc/net/pnp in the following format; lines are omitted if - their respective value is empty following configuration. + IP address of a Network Time Protocol (NTP) server. + Value is exported to /proc/net/ipconfig/ntp_servers, but is + otherwise unused (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + After configuration (whether manual or automatic) is complete, two files + are created in the following format; lines are omitted if their respective + value is empty following configuration: + + - /proc/net/pnp: #PROTO: (depending on configuration method) domain (if autoconfigured, the DNS domain) @@ -189,13 +199,26 @@ ip=::::::: nameserver (tertiary name server IP) bootserver (NFS server IP) - and are requested during autoconfiguration; they - cannot be specified as part of the "ip=" kernel command line parameter. + - /proc/net/ipconfig/ntp_servers: + + (NTP server IP) + (NTP server IP) + (NTP server IP) + + and (in /proc/net/pnp) and and + (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration; + they cannot be specified as part of the "ip=" kernel command line parameter. Because the "domain" and "nameserver" options are recognised by DNS resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems that use an NFS root filesystem. + Note that the kernel will not synchronise the system time with any NTP + servers it discovers; this is the responsibility of a user space process + (e.g. an initrd/initramfs script that passes the IP addresses listed in + /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real + root filesystem if it is on NFS). + nfsrootdebug diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index 9abf833f3a99..d839d74853fc 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -28,6 +28,9 @@ * * Multiple Nameservers in /proc/net/pnp * -- Josef Siemes , Aug 2002 + * + * NTP servers in /proc/net/ipconfig/ntp_servers + * -- Chris Novakovic , April 2018 */ #include @@ -93,6 +96,7 @@ #define CONF_TIMEOUT_MAX (HZ*30) /* Maximum allowed timeout */ #define CONF_NAMESERVERS_MAX 3 /* Maximum number of nameservers - '3' from resolv.h */ +#define CONF_NTP_SERVERS_MAX 3 /* Maximum number of NTP servers */ #define NONE cpu_to_be32(INADDR_NONE) #define ANY cpu_to_be32(INADDR_ANY) @@ -152,6 +156,7 @@ static int ic_proto_used; /* Protocol used, if any */ #define ic_proto_used 0 #endif static __be32 ic_nameservers[CONF_NAMESERVERS_MAX]; /* DNS Server IP addresses */ +static __be32 ic_ntp_servers[CONF_NTP_SERVERS_MAX]; /* NTP server IP addresses */ static u8 ic_domain[64]; /* DNS (not NIS) domain name */ /* @@ -579,6 +584,15 @@ static inline void __init ic_nameservers_predef(void) ic_nameservers[i] = NONE; } +/* Predefine NTP servers */ +static inline void __init ic_ntp_servers_predef(void) +{ + int i; + + for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) + ic_ntp_servers[i] = NONE; +} + /* * DHCP/BOOTP support. */ @@ -674,6 +688,7 @@ ic_dhcp_init_options(u8 *options, struct ic_device *d) 17, /* Boot path */ 26, /* MTU */ 40, /* NIS domain name */ + 42, /* NTP servers */ }; *e++ = 55; /* Parameter request list */ @@ -753,12 +768,13 @@ static void __init ic_bootp_init_ext(u8 *e) */ static inline void __init ic_bootp_init(void) { - /* Re-initialise all name servers to NONE, in case any were set via the - * "ip=" or "nfsaddrs=" kernel command line parameters: any IP addresses - * specified there will already have been decoded but are no longer - * needed + /* Re-initialise all name servers and NTP servers to NONE, in case any + * were set via the "ip=" or "nfsaddrs=" kernel command line parameters: + * any IP addresses specified there will already have been decoded but + * are no longer needed */ ic_nameservers_predef(); + ic_ntp_servers_predef(); dev_add_pack(&bootp_packet_type); } @@ -922,6 +938,15 @@ static void __init ic_do_bootp_ext(u8 *ext) ic_bootp_string(utsname()->domainname, ext+1, *ext, __NEW_UTS_LEN); break; + case 42: /* NTP servers */ + servers = *ext / 4; + if (servers > CONF_NTP_SERVERS_MAX) + servers = CONF_NTP_SERVERS_MAX; + for (i = 0; i < servers; i++) { + if (ic_ntp_servers[i] == NONE) + memcpy(&ic_ntp_servers[i], ext+1+4*i, 4); + } + break; } } @@ -1268,6 +1293,7 @@ static int __init ic_dynamic(void) #ifdef CONFIG_PROC_FS +/* Name servers: */ static int pnp_seq_show(struct seq_file *seq, void *v) { int i; @@ -1306,7 +1332,7 @@ static const struct file_operations pnp_seq_fops = { }; /* Create the /proc/net/ipconfig directory */ -static int ipconfig_proc_net_init(void) +static int __init ipconfig_proc_net_init(void) { ipconfig_dir = proc_net_mkdir(&init_net, "ipconfig", init_net.proc_net); if (!ipconfig_dir) @@ -1314,6 +1340,52 @@ static int ipconfig_proc_net_init(void) return 0; } + +/* Create a new file under /proc/net/ipconfig */ +static int ipconfig_proc_net_create(const char *name, + const struct file_operations *fops) +{ + char *pname; + struct proc_dir_entry *p; + + if (!ipconfig_dir) + return -ENOMEM; + + pname = kasprintf(GFP_KERNEL, "%s%s", "ipconfig/", name); + if (!pname) + return -ENOMEM; + + p = proc_create(pname, 0444, init_net.proc_net, fops); + kfree(pname); + if (!p) + return -ENOMEM; + + return 0; +} + +/* Write NTP server IP addresses to /proc/net/ipconfig/ntp_servers */ +static int ntp_servers_seq_show(struct seq_file *seq, void *v) +{ + int i; + + for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) { + if (ic_ntp_servers[i] != NONE) + seq_printf(seq, "%pI4\n", &ic_ntp_servers[i]); + } + return 0; +} + +static int ntp_servers_seq_open(struct inode *inode, struct file *file) +{ + return single_open(file, ntp_servers_seq_show, NULL); +} + +static const struct file_operations ntp_servers_seq_fops = { + .open = ntp_servers_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; #endif /* CONFIG_PROC_FS */ /* @@ -1388,17 +1460,20 @@ static int __init ip_auto_config(void) int err; unsigned int i; - /* Initialise all name servers to NONE (but only if the "ip=" or - * "nfsaddrs=" kernel command line parameters weren't decoded, otherwise - * we'll overwrite the IP addresses specified there) + /* Initialise all name servers and NTP servers to NONE (but only if the + * "ip=" or "nfsaddrs=" kernel command line parameters weren't decoded, + * otherwise we'll overwrite the IP addresses specified there) */ - if (ic_set_manually == 0) + if (ic_set_manually == 0) { ic_nameservers_predef(); + ic_ntp_servers_predef(); + } #ifdef CONFIG_PROC_FS proc_create("pnp", 0444, init_net.proc_net, &pnp_seq_fops); - ipconfig_proc_net_init(); + if (ipconfig_proc_net_init() == 0) + ipconfig_proc_net_create("ntp_servers", &ntp_servers_seq_fops); #endif /* CONFIG_PROC_FS */ if (!ic_enable) @@ -1523,6 +1598,19 @@ static int __init ip_auto_config(void) if (i + 1 == CONF_NAMESERVERS_MAX) pr_cont("\n"); } + /* NTP servers (if any): */ + for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) { + if (ic_ntp_servers[i] != NONE) { + if (i == 0) + pr_info(" ntpserver%u=%pI4", + i, &ic_ntp_servers[i]); + else + pr_cont(", ntpserver%u=%pI4", + i, &ic_ntp_servers[i]); + } + if (i + 1 == CONF_NTP_SERVERS_MAX) + pr_cont("\n"); + } #endif /* !SILENT */ /* @@ -1620,8 +1708,9 @@ static int __init ip_auto_config_setup(char *addrs) return 1; } - /* Initialise all name servers to NONE */ + /* Initialise all name servers and NTP servers to NONE */ ic_nameservers_predef(); + ic_ntp_servers_predef(); /* Parse string for static IP assignment. */ ip = addrs; @@ -1680,6 +1769,13 @@ static int __init ip_auto_config_setup(char *addrs) ic_nameservers[1] = NONE; } break; + case 9: + if (CONF_NTP_SERVERS_MAX >= 1) { + ic_ntp_servers[0] = in_aton(ip); + if (ic_ntp_servers[0] == ANY) + ic_ntp_servers[0] = NONE; + } + break; } } ip = cp; -- cgit v1.2.3 From 83aa025f535f76733e334e3d2a4d8577c8441a7e Mon Sep 17 00:00:00 2001 From: Willem de Bruijn Date: Thu, 26 Apr 2018 13:42:21 -0400 Subject: udp: add gso support to virtual devices Virtual devices such as tunnels and bonding can handle large packets. Only segment packets when reaching a physical or loopback device. Signed-off-by: Willem de Bruijn Signed-off-by: David S. Miller --- Documentation/networking/netdev-features.txt | 7 +++++++ include/linux/netdev_features.h | 5 ++++- include/linux/netdevice.h | 1 + net/core/ethtool.c | 1 + 4 files changed, 13 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt index c77f9d57eb91..c4a54c162547 100644 --- a/Documentation/networking/netdev-features.txt +++ b/Documentation/networking/netdev-features.txt @@ -113,6 +113,13 @@ whatever headers there might be. NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). + * Transmit UDP segmentation offload + +NETIF_F_GSO_UDP_GSO_L4 accepts a single UDP header with a payload that exceeds +gso_size. On segmentation, it segments the payload on gso_size boundaries and +replicates the network and UDP headers (fixing up the last one if less than +gso_size). + * Transmit DMA from high memory On platforms where this is relevant, NETIF_F_HIGHDMA signals that diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index 35b79f47a13d..fe2f3b30960e 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -55,8 +55,9 @@ enum { NETIF_F_GSO_SCTP_BIT, /* ... SCTP fragmentation */ NETIF_F_GSO_ESP_BIT, /* ... ESP with TSO */ NETIF_F_GSO_UDP_BIT, /* ... UFO, deprecated except tuntap */ + NETIF_F_GSO_UDP_L4_BIT, /* ... UDP payload GSO (not UFO) */ /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */ - NETIF_F_GSO_UDP_BIT, + NETIF_F_GSO_UDP_L4_BIT, NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */ NETIF_F_SCTP_CRC_BIT, /* SCTP checksum offload */ @@ -147,6 +148,7 @@ enum { #define NETIF_F_HW_ESP_TX_CSUM __NETIF_F(HW_ESP_TX_CSUM) #define NETIF_F_RX_UDP_TUNNEL_PORT __NETIF_F(RX_UDP_TUNNEL_PORT) #define NETIF_F_HW_TLS_RECORD __NETIF_F(HW_TLS_RECORD) +#define NETIF_F_GSO_UDP_L4 __NETIF_F(GSO_UDP_L4) #define for_each_netdev_feature(mask_addr, bit) \ for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT) @@ -216,6 +218,7 @@ enum { NETIF_F_GSO_GRE_CSUM | \ NETIF_F_GSO_IPXIP4 | \ NETIF_F_GSO_IPXIP6 | \ + NETIF_F_GSO_UDP_L4 | \ NETIF_F_GSO_UDP_TUNNEL | \ NETIF_F_GSO_UDP_TUNNEL_CSUM) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 14e0777ffcfb..366c32891158 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -4186,6 +4186,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type) BUILD_BUG_ON(SKB_GSO_SCTP != (NETIF_F_GSO_SCTP >> NETIF_F_GSO_SHIFT)); BUILD_BUG_ON(SKB_GSO_ESP != (NETIF_F_GSO_ESP >> NETIF_F_GSO_SHIFT)); BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_GSO_UDP >> NETIF_F_GSO_SHIFT)); + BUILD_BUG_ON(SKB_GSO_UDP_L4 != (NETIF_F_GSO_UDP_L4 >> NETIF_F_GSO_SHIFT)); return (features & feature) == feature; } diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 03416e6dd5d7..4650fd6d678c 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -92,6 +92,7 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] [NETIF_F_GSO_PARTIAL_BIT] = "tx-gso-partial", [NETIF_F_GSO_SCTP_BIT] = "tx-sctp-segmentation", [NETIF_F_GSO_ESP_BIT] = "tx-esp-segmentation", + [NETIF_F_GSO_UDP_L4_BIT] = "tx-udp-segmentation", [NETIF_F_FCOE_CRC_BIT] = "tx-checksum-fcoe-crc", [NETIF_F_SCTP_CRC_BIT] = "tx-checksum-sctp", -- cgit v1.2.3 From 2c25fc9a503adef4279951382fc9d47b59977f59 Mon Sep 17 00:00:00 2001 From: Leo Yan Date: Fri, 27 Apr 2018 18:02:54 +0800 Subject: bpf, doc: Update bpf_jit_enable limitation for CONFIG_BPF_JIT_ALWAYS_ON When CONFIG_BPF_JIT_ALWAYS_ON is enabled, kernel has limitation for bpf_jit_enable, so it has fixed value 1 and we cannot set it to 2 for JIT opcode dumping; this patch is to update the doc for it. Suggested-by: Daniel Borkmann Signed-off-by: Leo Yan Signed-off-by: Daniel Borkmann --- Documentation/networking/filter.txt | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index fd55c7de9991..5032e1263bc9 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -483,6 +483,12 @@ Example output from dmesg: [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for generating disassembly out of the kernel log's hexdump: -- cgit v1.2.3 From a6dc6670cd7e0bba580026443cbf77fdd370c791 Mon Sep 17 00:00:00 2001 From: Ahmed Abdelsalam Date: Fri, 27 Apr 2018 17:51:48 +0200 Subject: ipv6: sr: Add documentation for seg_flowlabel sysctl This patch adds a documentation for seg_flowlabel sysctl into Documentation/networking/ip-sysctl.txt Signed-off-by: Ahmed Abdelsalam Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b583a73cf95f..b2f463e0cb33 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1428,6 +1428,19 @@ ip6frag_low_thresh - INTEGER ip6frag_time - INTEGER Time in seconds to keep an IPv6 fragment in memory. +IPv6 Segment Routing: + +seg6_flowlabel - INTEGER + Controls the behaviour of computing the flowlabel of outer + IPv6 header in case of SR T.encaps + + -1 set flowlabel to zero. + 0 copy flowlabel from Inner packet in case of Inner IPv6 + (Set flowlabel to 0 in case IPv4/L2) + 1 Compute the flowlabel using seg6_make_flowlabel() + + Default is 0. + conf/default/*: Change the interface-specific default settings. -- cgit v1.2.3 From 7e5d05e18ba1ed491c6f836edee7f0b90f3167bc Mon Sep 17 00:00:00 2001 From: Yixun Lan Date: Sat, 28 Apr 2018 10:21:10 +0000 Subject: dt-bindings: net: meson-dwmac: new compatible name for AXG SoC We need to introduce a new compatible name for the Meson-AXG SoC in order to support the RMII 100M ethernet PHY, since the PRG_ETH0 register of the dwmac glue layer is changed from previous old SoC. Signed-off-by: Yixun Lan Reviewed-by: Rob Herring Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/meson-dwmac.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt b/Documentation/devicetree/bindings/net/meson-dwmac.txt index 61cada22ae6c..1321bb194ed9 100644 --- a/Documentation/devicetree/bindings/net/meson-dwmac.txt +++ b/Documentation/devicetree/bindings/net/meson-dwmac.txt @@ -11,6 +11,7 @@ Required properties on all platforms: - "amlogic,meson8b-dwmac" - "amlogic,meson8m2-dwmac" - "amlogic,meson-gxbb-dwmac" + - "amlogic,meson-axg-dwmac" Additionally "snps,dwmac" and any applicable more detailed version number described in net/stmmac.txt should be used. -- cgit v1.2.3 From 03f5781be2c7b7e728d724ac70ba10799cc710d7 Mon Sep 17 00:00:00 2001 From: Wang YanQing Date: Thu, 3 May 2018 14:10:43 +0800 Subject: bpf, x86_32: add eBPF JIT compiler for ia32 The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF only. Classic BPF is supported because of the conversion by BPF core. Almost all instructions from eBPF ISA supported except the following: BPF_ALU64 | BPF_DIV | BPF_K BPF_ALU64 | BPF_DIV | BPF_X BPF_ALU64 | BPF_MOD | BPF_K BPF_ALU64 | BPF_MOD | BPF_X BPF_STX | BPF_XADD | BPF_W BPF_STX | BPF_XADD | BPF_DW It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment. IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others eBPF registers, R0-R10, are simulated through scratch space on stack. The reasons behind the hardware registers allocation policy are: 1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit for general eBPF 64bit register simulation. 2:We need at least 4 registers to simulate most eBPF ISA operations on registers operands instead of on register&memory operands. 3:We need to put BPF_REG_AX on hardware registers, or constant blinding will degrade jit performance heavily. Tested on PC (Intel(R) Core(TM) i5-5200U CPU). Testing results on i5-5200U: 1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed] 2) test_progs: Summary: 83 PASSED, 0 FAILED. 3) test_lpm: OK 4) test_lru_map: OK 5) test_verifier: Summary: 828 PASSED, 0 FAILED. Above tests are all done in following two conditions separately: 1:bpf_jit_enable=1 and bpf_jit_harden=0 2:bpf_jit_enable=1 and bpf_jit_harden=2 Below are some numbers for this jit implementation: Note: I run test_progs in kselftest 100 times continuously for every condition, the numbers are in format: total/times=avg. The numbers that test_bpf reports show almost the same relation. a:jit_enable=0 and jit_harden=0 b:jit_enable=1 and jit_harden=0 test_pkt_access:PASS:ipv4:15622/100=156 test_pkt_access:PASS:ipv4:10674/100=106 test_pkt_access:PASS:ipv6:9130/100=91 test_pkt_access:PASS:ipv6:4855/100=48 test_xdp:PASS:ipv4:240198/100=2401 test_xdp:PASS:ipv4:138912/100=1389 test_xdp:PASS:ipv6:137326/100=1373 test_xdp:PASS:ipv6:68542/100=685 test_l4lb:PASS:ipv4:61100/100=611 test_l4lb:PASS:ipv4:37302/100=373 test_l4lb:PASS:ipv6:101000/100=1010 test_l4lb:PASS:ipv6:55030/100=550 c:jit_enable=1 and jit_harden=2 test_pkt_access:PASS:ipv4:10558/100=105 test_pkt_access:PASS:ipv6:5092/100=50 test_xdp:PASS:ipv4:131902/100=1319 test_xdp:PASS:ipv6:77932/100=779 test_l4lb:PASS:ipv4:38924/100=389 test_l4lb:PASS:ipv6:57520/100=575 The numbers show we get 30%~50% improvement. See Documentation/networking/filter.txt for more information. Changelog: Changes v5-v6: 1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for consistence reason. 2:Clean up non-standard comments, reported by Daniel Borkmann. 3:Fix a memory leak issue, repoted by Daniel Borkmann. Changes v4-v5: 1:Delete is_on_stack, BPF_REG_AX is the only one on real hardware registers, so just check with it. 2:Apply commit 1612a981b766 ("bpf, x64: fix JIT emission for dead code"), suggested by Daniel Borkmann. Changes v3-v4: 1:Fix changelog in commit. I install llvm-6.0, then test_progs willn't report errors. I submit another patch: "bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform" to fix another problem, after that patch, test_verifier willn't report errors too. 2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation. Changes v2-v3: 1:Move BPF_REG_AX to real hardware registers for performance reason. 3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann. 4:Delete partial codes in 1c2a088a6626, suggested by Daniel Borkmann. 5:Some bug fixes and comments improvement. Changes v1-v2: 1:Fix bug in emit_ia32_neg64. 2:Fix bug in emit_ia32_arsh_r64. 3:Delete filename in top level comment, suggested by Thomas Gleixner. 4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner. 5:Rewrite some words in changelog. 6:CodingSytle improvement and a little more comments. Signed-off-by: Wang YanQing Signed-off-by: Daniel Borkmann --- Documentation/sysctl/net.txt | 1 + arch/x86/Kconfig | 2 +- arch/x86/include/asm/nospec-branch.h | 30 +- arch/x86/net/Makefile | 8 +- arch/x86/net/bpf_jit_comp32.c | 2553 ++++++++++++++++++++++++++++++++++ 5 files changed, 2588 insertions(+), 6 deletions(-) create mode 100644 arch/x86/net/bpf_jit_comp32.c (limited to 'Documentation') diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt index 5992602469d8..9ecde517728c 100644 --- a/Documentation/sysctl/net.txt +++ b/Documentation/sysctl/net.txt @@ -45,6 +45,7 @@ through bpf(2) and passing a verifier in the kernel, a JIT will then translate these BPF proglets into native CPU instructions. There are two flavors of JITs, the newer eBPF JIT currently supported on: - x86_64 + - x86_32 - arm64 - arm32 - ppc64 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 00fcf81f2c56..1f5fa2f2c168 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -137,7 +137,7 @@ config X86 select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS - select HAVE_EBPF_JIT if X86_64 + select HAVE_EBPF_JIT select HAVE_EFFICIENT_UNALIGNED_ACCESS select HAVE_EXIT_THREAD select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index f928ad9b143f..2cd344d1a6e5 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -291,16 +291,20 @@ do { \ * lfence * jmp spec_trap * do_rop: - * mov %rax,(%rsp) + * mov %rax,(%rsp) for x86_64 + * mov %edx,(%esp) for x86_32 * retq * * Without retpolines configured: * - * jmp *%rax + * jmp *%rax for x86_64 + * jmp *%edx for x86_32 */ #ifdef CONFIG_RETPOLINE +#ifdef CONFIG_X86_64 # define RETPOLINE_RAX_BPF_JIT_SIZE 17 # define RETPOLINE_RAX_BPF_JIT() \ +do { \ EMIT1_off32(0xE8, 7); /* callq do_rop */ \ /* spec_trap: */ \ EMIT2(0xF3, 0x90); /* pause */ \ @@ -308,11 +312,31 @@ do { \ EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \ /* do_rop: */ \ EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \ - EMIT1(0xC3); /* retq */ + EMIT1(0xC3); /* retq */ \ +} while (0) #else +# define RETPOLINE_EDX_BPF_JIT() \ +do { \ + EMIT1_off32(0xE8, 7); /* call do_rop */ \ + /* spec_trap: */ \ + EMIT2(0xF3, 0x90); /* pause */ \ + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \ + EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \ + /* do_rop: */ \ + EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */ \ + EMIT1(0xC3); /* ret */ \ +} while (0) +#endif +#else /* !CONFIG_RETPOLINE */ + +#ifdef CONFIG_X86_64 # define RETPOLINE_RAX_BPF_JIT_SIZE 2 # define RETPOLINE_RAX_BPF_JIT() \ EMIT2(0xFF, 0xE0); /* jmp *%rax */ +#else +# define RETPOLINE_EDX_BPF_JIT() \ + EMIT2(0xFF, 0xE2) /* jmp *%edx */ +#endif #endif #endif /* _ASM_X86_NOSPEC_BRANCH_H_ */ diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile index fefb4b619598..c6b464a261bb 100644 --- a/arch/x86/net/Makefile +++ b/arch/x86/net/Makefile @@ -1,6 +1,10 @@ # # Arch-specific network modules # -OBJECT_FILES_NON_STANDARD_bpf_jit.o += y -obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o +ifeq ($(CONFIG_X86_32),y) + obj-$(CONFIG_BPF_JIT) += bpf_jit_comp32.o +else + OBJECT_FILES_NON_STANDARD_bpf_jit.o += y + obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o +endif diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c new file mode 100644 index 000000000000..61e61341b777 --- /dev/null +++ b/arch/x86/net/bpf_jit_comp32.c @@ -0,0 +1,2553 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Just-In-Time compiler for eBPF filters on IA32 (32bit x86) + * + * Author: Wang YanQing (udknight@gmail.com) + * The code based on code and ideas from: + * Eric Dumazet (eric.dumazet@gmail.com) + * and from: + * Shubham Bansal + */ + +#include +#include +#include +#include +#include +#include +#include + +/* + * eBPF prog stack layout: + * + * high + * original ESP => +-----+ + * | | callee saved registers + * +-----+ + * | ... | eBPF JIT scratch space + * BPF_FP,IA32_EBP => +-----+ + * | ... | eBPF prog stack + * +-----+ + * |RSVD | JIT scratchpad + * current ESP => +-----+ + * | | + * | ... | Function call stack + * | | + * +-----+ + * low + * + * The callee saved registers: + * + * high + * original ESP => +------------------+ \ + * | ebp | | + * current EBP => +------------------+ } callee saved registers + * | ebx,esi,edi | | + * +------------------+ / + * low + */ + +static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len) +{ + if (len == 1) + *ptr = bytes; + else if (len == 2) + *(u16 *)ptr = bytes; + else { + *(u32 *)ptr = bytes; + barrier(); + } + return ptr + len; +} + +#define EMIT(bytes, len) \ + do { prog = emit_code(prog, bytes, len); cnt += len; } while (0) + +#define EMIT1(b1) EMIT(b1, 1) +#define EMIT2(b1, b2) EMIT((b1) + ((b2) << 8), 2) +#define EMIT3(b1, b2, b3) EMIT((b1) + ((b2) << 8) + ((b3) << 16), 3) +#define EMIT4(b1, b2, b3, b4) \ + EMIT((b1) + ((b2) << 8) + ((b3) << 16) + ((b4) << 24), 4) + +#define EMIT1_off32(b1, off) \ + do { EMIT1(b1); EMIT(off, 4); } while (0) +#define EMIT2_off32(b1, b2, off) \ + do { EMIT2(b1, b2); EMIT(off, 4); } while (0) +#define EMIT3_off32(b1, b2, b3, off) \ + do { EMIT3(b1, b2, b3); EMIT(off, 4); } while (0) +#define EMIT4_off32(b1, b2, b3, b4, off) \ + do { EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0) + +#define jmp_label(label, jmp_insn_len) (label - cnt - jmp_insn_len) + +static bool is_imm8(int value) +{ + return value <= 127 && value >= -128; +} + +static bool is_simm32(s64 value) +{ + return value == (s64) (s32) value; +} + +#define STACK_OFFSET(k) (k) +#define TCALL_CNT (MAX_BPF_JIT_REG + 0) /* Tail Call Count */ + +#define IA32_EAX (0x0) +#define IA32_EBX (0x3) +#define IA32_ECX (0x1) +#define IA32_EDX (0x2) +#define IA32_ESI (0x6) +#define IA32_EDI (0x7) +#define IA32_EBP (0x5) +#define IA32_ESP (0x4) + +/* + * List of x86 cond jumps opcodes (. + s8) + * Add 0x10 (and an extra 0x0f) to generate far jumps (. + s32) + */ +#define IA32_JB 0x72 +#define IA32_JAE 0x73 +#define IA32_JE 0x74 +#define IA32_JNE 0x75 +#define IA32_JBE 0x76 +#define IA32_JA 0x77 +#define IA32_JL 0x7C +#define IA32_JGE 0x7D +#define IA32_JLE 0x7E +#define IA32_JG 0x7F + +/* + * Map eBPF registers to IA32 32bit registers or stack scratch space. + * + * 1. All the registers, R0-R10, are mapped to scratch space on stack. + * 2. We need two 64 bit temp registers to do complex operations on eBPF + * registers. + * 3. For performance reason, the BPF_REG_AX for blinding constant, is + * mapped to real hardware register pair, IA32_ESI and IA32_EDI. + * + * As the eBPF registers are all 64 bit registers and IA32 has only 32 bit + * registers, we have to map each eBPF registers with two IA32 32 bit regs + * or scratch memory space and we have to build eBPF 64 bit register from those. + * + * We use IA32_EAX, IA32_EDX, IA32_ECX, IA32_EBX as temporary registers. + */ +static const u8 bpf2ia32[][2] = { + /* Return value from in-kernel function, and exit value from eBPF */ + [BPF_REG_0] = {STACK_OFFSET(0), STACK_OFFSET(4)}, + + /* The arguments from eBPF program to in-kernel function */ + /* Stored on stack scratch space */ + [BPF_REG_1] = {STACK_OFFSET(8), STACK_OFFSET(12)}, + [BPF_REG_2] = {STACK_OFFSET(16), STACK_OFFSET(20)}, + [BPF_REG_3] = {STACK_OFFSET(24), STACK_OFFSET(28)}, + [BPF_REG_4] = {STACK_OFFSET(32), STACK_OFFSET(36)}, + [BPF_REG_5] = {STACK_OFFSET(40), STACK_OFFSET(44)}, + + /* Callee saved registers that in-kernel function will preserve */ + /* Stored on stack scratch space */ + [BPF_REG_6] = {STACK_OFFSET(48), STACK_OFFSET(52)}, + [BPF_REG_7] = {STACK_OFFSET(56), STACK_OFFSET(60)}, + [BPF_REG_8] = {STACK_OFFSET(64), STACK_OFFSET(68)}, + [BPF_REG_9] = {STACK_OFFSET(72), STACK_OFFSET(76)}, + + /* Read only Frame Pointer to access Stack */ + [BPF_REG_FP] = {STACK_OFFSET(80), STACK_OFFSET(84)}, + + /* Temporary register for blinding constants. */ + [BPF_REG_AX] = {IA32_ESI, IA32_EDI}, + + /* Tail call count. Stored on stack scratch space. */ + [TCALL_CNT] = {STACK_OFFSET(88), STACK_OFFSET(92)}, +}; + +#define dst_lo dst[0] +#define dst_hi dst[1] +#define src_lo src[0] +#define src_hi src[1] + +#define STACK_ALIGNMENT 8 +/* + * Stack space for BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, + * BPF_REG_5, BPF_REG_6, BPF_REG_7, BPF_REG_8, BPF_REG_9, + * BPF_REG_FP, BPF_REG_AX and Tail call counts. + */ +#define SCRATCH_SIZE 96 + +/* Total stack size used in JITed code */ +#define _STACK_SIZE \ + (stack_depth + \ + + SCRATCH_SIZE + \ + + 4 /* Extra space for skb_copy_bits buffer */) + +#define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT) + +/* Get the offset of eBPF REGISTERs stored on scratch space. */ +#define STACK_VAR(off) (off) + +/* Offset of skb_copy_bits buffer */ +#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE) + +/* Encode 'dst_reg' register into IA32 opcode 'byte' */ +static u8 add_1reg(u8 byte, u32 dst_reg) +{ + return byte + dst_reg; +} + +/* Encode 'dst_reg' and 'src_reg' registers into IA32 opcode 'byte' */ +static u8 add_2reg(u8 byte, u32 dst_reg, u32 src_reg) +{ + return byte + dst_reg + (src_reg << 3); +} + +static void jit_fill_hole(void *area, unsigned int size) +{ + /* Fill whole space with int3 instructions */ + memset(area, 0xcc, size); +} + +static inline void emit_ia32_mov_i(const u8 dst, const u32 val, bool dstk, + u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (dstk) { + if (val == 0) { + /* xor eax,eax */ + EMIT2(0x33, add_2reg(0xC0, IA32_EAX, IA32_EAX)); + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst)); + } else { + EMIT3_off32(0xC7, add_1reg(0x40, IA32_EBP), + STACK_VAR(dst), val); + } + } else { + if (val == 0) + EMIT2(0x33, add_2reg(0xC0, dst, dst)); + else + EMIT2_off32(0xC7, add_1reg(0xC0, dst), + val); + } + *pprog = prog; +} + +/* dst = imm (4 bytes)*/ +static inline void emit_ia32_mov_r(const u8 dst, const u8 src, bool dstk, + bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 sreg = sstk ? IA32_EAX : src; + + if (sstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(src)); + if (dstk) + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, sreg), STACK_VAR(dst)); + else + /* mov dst,sreg */ + EMIT2(0x89, add_2reg(0xC0, dst, sreg)); + + *pprog = prog; +} + +/* dst = src */ +static inline void emit_ia32_mov_r64(const bool is64, const u8 dst[], + const u8 src[], bool dstk, + bool sstk, u8 **pprog) +{ + emit_ia32_mov_r(dst_lo, src_lo, dstk, sstk, pprog); + if (is64) + /* complete 8 byte move */ + emit_ia32_mov_r(dst_hi, src_hi, dstk, sstk, pprog); + else + /* zero out high 4 bytes */ + emit_ia32_mov_i(dst_hi, 0, dstk, pprog); +} + +/* Sign extended move */ +static inline void emit_ia32_mov_i64(const bool is64, const u8 dst[], + const u32 val, bool dstk, u8 **pprog) +{ + u32 hi = 0; + + if (is64 && (val & (1<<31))) + hi = (u32)~0; + emit_ia32_mov_i(dst_lo, val, dstk, pprog); + emit_ia32_mov_i(dst_hi, hi, dstk, pprog); +} + +/* + * ALU operation (32 bit) + * dst = dst * src + */ +static inline void emit_ia32_mul_r(const u8 dst, const u8 src, bool dstk, + bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 sreg = sstk ? IA32_ECX : src; + + if (sstk) + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src)); + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(dst)); + else + /* mov eax,dst */ + EMIT2(0x8B, add_2reg(0xC0, dst, IA32_EAX)); + + + EMIT2(0xF7, add_1reg(0xE0, sreg)); + + if (dstk) + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst)); + else + /* mov dst,eax */ + EMIT2(0x89, add_2reg(0xC0, dst, IA32_EAX)); + + *pprog = prog; +} + +static inline void emit_ia32_to_le_r64(const u8 dst[], s32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk && val != 64) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + switch (val) { + case 16: + /* + * Emit 'movzwl eax,ax' to zero extend 16-bit + * into 64 bit + */ + EMIT2(0x0F, 0xB7); + EMIT1(add_2reg(0xC0, dreg_lo, dreg_lo)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + break; + case 32: + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + break; + case 64: + /* nop */ + break; + } + + if (dstk && val != 64) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + *pprog = prog; +} + +static inline void emit_ia32_to_be_r64(const u8 dst[], s32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + switch (val) { + case 16: + /* Emit 'ror %ax, 8' to swap lower 2 bytes */ + EMIT1(0x66); + EMIT3(0xC1, add_1reg(0xC8, dreg_lo), 8); + + EMIT2(0x0F, 0xB7); + EMIT1(add_2reg(0xC0, dreg_lo, dreg_lo)); + + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + break; + case 32: + /* Emit 'bswap eax' to swap lower 4 bytes */ + EMIT1(0x0F); + EMIT1(add_1reg(0xC8, dreg_lo)); + + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + break; + case 64: + /* Emit 'bswap eax' to swap lower 4 bytes */ + EMIT1(0x0F); + EMIT1(add_1reg(0xC8, dreg_lo)); + + /* Emit 'bswap edx' to swap lower 4 bytes */ + EMIT1(0x0F); + EMIT1(add_1reg(0xC8, dreg_hi)); + + /* mov ecx,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, IA32_ECX, dreg_hi)); + /* mov dreg_hi,dreg_lo */ + EMIT2(0x89, add_2reg(0xC0, dreg_hi, dreg_lo)); + /* mov dreg_lo,ecx */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, IA32_ECX)); + + break; + } + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + *pprog = prog; +} + +/* + * ALU operation (32 bit) + * dst = dst (div|mod) src + */ +static inline void emit_ia32_div_mod_r(const u8 op, const u8 dst, const u8 src, + bool dstk, bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (sstk) + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(src)); + else if (src != IA32_ECX) + /* mov ecx,src */ + EMIT2(0x8B, add_2reg(0xC0, src, IA32_ECX)); + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst)); + else + /* mov eax,dst */ + EMIT2(0x8B, add_2reg(0xC0, dst, IA32_EAX)); + + /* xor edx,edx */ + EMIT2(0x31, add_2reg(0xC0, IA32_EDX, IA32_EDX)); + /* div ecx */ + EMIT2(0xF7, add_1reg(0xF0, IA32_ECX)); + + if (op == BPF_MOD) { + if (dstk) + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst)); + else + EMIT2(0x89, add_2reg(0xC0, dst, IA32_EDX)); + } else { + if (dstk) + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst)); + else + EMIT2(0x89, add_2reg(0xC0, dst, IA32_EAX)); + } + *pprog = prog; +} + +/* + * ALU operation (32 bit) + * dst = dst (shift) src + */ +static inline void emit_ia32_shift_r(const u8 op, const u8 dst, const u8 src, + bool dstk, bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg = dstk ? IA32_EAX : dst; + u8 b2; + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(dst)); + + if (sstk) + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src)); + else if (src != IA32_ECX) + /* mov ecx,src */ + EMIT2(0x8B, add_2reg(0xC0, src, IA32_ECX)); + + switch (op) { + case BPF_LSH: + b2 = 0xE0; break; + case BPF_RSH: + b2 = 0xE8; break; + case BPF_ARSH: + b2 = 0xF8; break; + default: + return; + } + EMIT2(0xD3, add_1reg(b2, dreg)); + + if (dstk) + /* mov dword ptr [ebp+off],dreg */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg), STACK_VAR(dst)); + *pprog = prog; +} + +/* + * ALU operation (32 bit) + * dst = dst (op) src + */ +static inline void emit_ia32_alu_r(const bool is64, const bool hi, const u8 op, + const u8 dst, const u8 src, bool dstk, + bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 sreg = sstk ? IA32_EAX : src; + u8 dreg = dstk ? IA32_EDX : dst; + + if (sstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(src)); + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(dst)); + + switch (BPF_OP(op)) { + /* dst = dst + src */ + case BPF_ADD: + if (hi && is64) + EMIT2(0x11, add_2reg(0xC0, dreg, sreg)); + else + EMIT2(0x01, add_2reg(0xC0, dreg, sreg)); + break; + /* dst = dst - src */ + case BPF_SUB: + if (hi && is64) + EMIT2(0x19, add_2reg(0xC0, dreg, sreg)); + else + EMIT2(0x29, add_2reg(0xC0, dreg, sreg)); + break; + /* dst = dst | src */ + case BPF_OR: + EMIT2(0x09, add_2reg(0xC0, dreg, sreg)); + break; + /* dst = dst & src */ + case BPF_AND: + EMIT2(0x21, add_2reg(0xC0, dreg, sreg)); + break; + /* dst = dst ^ src */ + case BPF_XOR: + EMIT2(0x31, add_2reg(0xC0, dreg, sreg)); + break; + } + + if (dstk) + /* mov dword ptr [ebp+off],dreg */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg), + STACK_VAR(dst)); + *pprog = prog; +} + +/* ALU operation (64 bit) */ +static inline void emit_ia32_alu_r64(const bool is64, const u8 op, + const u8 dst[], const u8 src[], + bool dstk, bool sstk, + u8 **pprog) +{ + u8 *prog = *pprog; + + emit_ia32_alu_r(is64, false, op, dst_lo, src_lo, dstk, sstk, &prog); + if (is64) + emit_ia32_alu_r(is64, true, op, dst_hi, src_hi, dstk, sstk, + &prog); + else + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + *pprog = prog; +} + +/* + * ALU operation (32 bit) + * dst = dst (op) val + */ +static inline void emit_ia32_alu_i(const bool is64, const bool hi, const u8 op, + const u8 dst, const s32 val, bool dstk, + u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg = dstk ? IA32_EAX : dst; + u8 sreg = IA32_EDX; + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(dst)); + + if (!is_imm8(val)) + /* mov edx,imm32*/ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EDX), val); + + switch (op) { + /* dst = dst + val */ + case BPF_ADD: + if (hi && is64) { + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xD0, dreg), val); + else + EMIT2(0x11, add_2reg(0xC0, dreg, sreg)); + } else { + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xC0, dreg), val); + else + EMIT2(0x01, add_2reg(0xC0, dreg, sreg)); + } + break; + /* dst = dst - val */ + case BPF_SUB: + if (hi && is64) { + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xD8, dreg), val); + else + EMIT2(0x19, add_2reg(0xC0, dreg, sreg)); + } else { + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xE8, dreg), val); + else + EMIT2(0x29, add_2reg(0xC0, dreg, sreg)); + } + break; + /* dst = dst | val */ + case BPF_OR: + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xC8, dreg), val); + else + EMIT2(0x09, add_2reg(0xC0, dreg, sreg)); + break; + /* dst = dst & val */ + case BPF_AND: + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xE0, dreg), val); + else + EMIT2(0x21, add_2reg(0xC0, dreg, sreg)); + break; + /* dst = dst ^ val */ + case BPF_XOR: + if (is_imm8(val)) + EMIT3(0x83, add_1reg(0xF0, dreg), val); + else + EMIT2(0x31, add_2reg(0xC0, dreg, sreg)); + break; + case BPF_NEG: + EMIT2(0xF7, add_1reg(0xD8, dreg)); + break; + } + + if (dstk) + /* mov dword ptr [ebp+off],dreg */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg), + STACK_VAR(dst)); + *pprog = prog; +} + +/* ALU operation (64 bit) */ +static inline void emit_ia32_alu_i64(const bool is64, const u8 op, + const u8 dst[], const u32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + u32 hi = 0; + + if (is64 && (val & (1<<31))) + hi = (u32)~0; + + emit_ia32_alu_i(is64, false, op, dst_lo, val, dstk, &prog); + if (is64) + emit_ia32_alu_i(is64, true, op, dst_hi, hi, dstk, &prog); + else + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + + *pprog = prog; +} + +/* dst = ~dst (64 bit) */ +static inline void emit_ia32_neg64(const u8 dst[], bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + /* xor ecx,ecx */ + EMIT2(0x31, add_2reg(0xC0, IA32_ECX, IA32_ECX)); + /* sub dreg_lo,ecx */ + EMIT2(0x2B, add_2reg(0xC0, dreg_lo, IA32_ECX)); + /* mov dreg_lo,ecx */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, IA32_ECX)); + + /* xor ecx,ecx */ + EMIT2(0x31, add_2reg(0xC0, IA32_ECX, IA32_ECX)); + /* sbb dreg_hi,ecx */ + EMIT2(0x19, add_2reg(0xC0, dreg_hi, IA32_ECX)); + /* mov dreg_hi,ecx */ + EMIT2(0x89, add_2reg(0xC0, dreg_hi, IA32_ECX)); + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + *pprog = prog; +} + +/* dst = dst << src */ +static inline void emit_ia32_lsh_r64(const u8 dst[], const u8 src[], + bool dstk, bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + static int jmp_label1 = -1; + static int jmp_label2 = -1; + static int jmp_label3 = -1; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + if (sstk) + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(src_lo)); + else + /* mov ecx,src_lo */ + EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_ECX)); + + /* cmp ecx,32 */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32); + /* Jumps when >= 32 */ + if (is_imm8(jmp_label(jmp_label1, 2))) + EMIT2(IA32_JAE, jmp_label(jmp_label1, 2)); + else + EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6)); + + /* < 32 */ + /* shl dreg_hi,cl */ + EMIT2(0xD3, add_1reg(0xE0, dreg_hi)); + /* mov ebx,dreg_lo */ + EMIT2(0x8B, add_2reg(0xC0, dreg_lo, IA32_EBX)); + /* shl dreg_lo,cl */ + EMIT2(0xD3, add_1reg(0xE0, dreg_lo)); + + /* IA32_ECX = -IA32_ECX + 32 */ + /* neg ecx */ + EMIT2(0xF7, add_1reg(0xD8, IA32_ECX)); + /* add ecx,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32); + + /* shr ebx,cl */ + EMIT2(0xD3, add_1reg(0xE8, IA32_EBX)); + /* or dreg_hi,ebx */ + EMIT2(0x09, add_2reg(0xC0, dreg_hi, IA32_EBX)); + + /* goto out; */ + if (is_imm8(jmp_label(jmp_label3, 2))) + EMIT2(0xEB, jmp_label(jmp_label3, 2)); + else + EMIT1_off32(0xE9, jmp_label(jmp_label3, 5)); + + /* >= 32 */ + if (jmp_label1 == -1) + jmp_label1 = cnt; + + /* cmp ecx,64 */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64); + /* Jumps when >= 64 */ + if (is_imm8(jmp_label(jmp_label2, 2))) + EMIT2(IA32_JAE, jmp_label(jmp_label2, 2)); + else + EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6)); + + /* >= 32 && < 64 */ + /* sub ecx,32 */ + EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32); + /* shl dreg_lo,cl */ + EMIT2(0xD3, add_1reg(0xE0, dreg_lo)); + /* mov dreg_hi,dreg_lo */ + EMIT2(0x89, add_2reg(0xC0, dreg_hi, dreg_lo)); + + /* xor dreg_lo,dreg_lo */ + EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo)); + + /* goto out; */ + if (is_imm8(jmp_label(jmp_label3, 2))) + EMIT2(0xEB, jmp_label(jmp_label3, 2)); + else + EMIT1_off32(0xE9, jmp_label(jmp_label3, 5)); + + /* >= 64 */ + if (jmp_label2 == -1) + jmp_label2 = cnt; + /* xor dreg_lo,dreg_lo */ + EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + + if (jmp_label3 == -1) + jmp_label3 = cnt; + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + /* out: */ + *pprog = prog; +} + +/* dst = dst >> src (signed)*/ +static inline void emit_ia32_arsh_r64(const u8 dst[], const u8 src[], + bool dstk, bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + static int jmp_label1 = -1; + static int jmp_label2 = -1; + static int jmp_label3 = -1; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + if (sstk) + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(src_lo)); + else + /* mov ecx,src_lo */ + EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_ECX)); + + /* cmp ecx,32 */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32); + /* Jumps when >= 32 */ + if (is_imm8(jmp_label(jmp_label1, 2))) + EMIT2(IA32_JAE, jmp_label(jmp_label1, 2)); + else + EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6)); + + /* < 32 */ + /* lshr dreg_lo,cl */ + EMIT2(0xD3, add_1reg(0xE8, dreg_lo)); + /* mov ebx,dreg_hi */ + EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX)); + /* ashr dreg_hi,cl */ + EMIT2(0xD3, add_1reg(0xF8, dreg_hi)); + + /* IA32_ECX = -IA32_ECX + 32 */ + /* neg ecx */ + EMIT2(0xF7, add_1reg(0xD8, IA32_ECX)); + /* add ecx,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32); + + /* shl ebx,cl */ + EMIT2(0xD3, add_1reg(0xE0, IA32_EBX)); + /* or dreg_lo,ebx */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX)); + + /* goto out; */ + if (is_imm8(jmp_label(jmp_label3, 2))) + EMIT2(0xEB, jmp_label(jmp_label3, 2)); + else + EMIT1_off32(0xE9, jmp_label(jmp_label3, 5)); + + /* >= 32 */ + if (jmp_label1 == -1) + jmp_label1 = cnt; + + /* cmp ecx,64 */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64); + /* Jumps when >= 64 */ + if (is_imm8(jmp_label(jmp_label2, 2))) + EMIT2(IA32_JAE, jmp_label(jmp_label2, 2)); + else + EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6)); + + /* >= 32 && < 64 */ + /* sub ecx,32 */ + EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32); + /* ashr dreg_hi,cl */ + EMIT2(0xD3, add_1reg(0xF8, dreg_hi)); + /* mov dreg_lo,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi)); + + /* ashr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31); + + /* goto out; */ + if (is_imm8(jmp_label(jmp_label3, 2))) + EMIT2(0xEB, jmp_label(jmp_label3, 2)); + else + EMIT1_off32(0xE9, jmp_label(jmp_label3, 5)); + + /* >= 64 */ + if (jmp_label2 == -1) + jmp_label2 = cnt; + /* ashr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31); + /* mov dreg_lo,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi)); + + if (jmp_label3 == -1) + jmp_label3 = cnt; + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + /* out: */ + *pprog = prog; +} + +/* dst = dst >> src */ +static inline void emit_ia32_rsh_r64(const u8 dst[], const u8 src[], bool dstk, + bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + static int jmp_label1 = -1; + static int jmp_label2 = -1; + static int jmp_label3 = -1; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + if (sstk) + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(src_lo)); + else + /* mov ecx,src_lo */ + EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_ECX)); + + /* cmp ecx,32 */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32); + /* Jumps when >= 32 */ + if (is_imm8(jmp_label(jmp_label1, 2))) + EMIT2(IA32_JAE, jmp_label(jmp_label1, 2)); + else + EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6)); + + /* < 32 */ + /* lshr dreg_lo,cl */ + EMIT2(0xD3, add_1reg(0xE8, dreg_lo)); + /* mov ebx,dreg_hi */ + EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX)); + /* shr dreg_hi,cl */ + EMIT2(0xD3, add_1reg(0xE8, dreg_hi)); + + /* IA32_ECX = -IA32_ECX + 32 */ + /* neg ecx */ + EMIT2(0xF7, add_1reg(0xD8, IA32_ECX)); + /* add ecx,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32); + + /* shl ebx,cl */ + EMIT2(0xD3, add_1reg(0xE0, IA32_EBX)); + /* or dreg_lo,ebx */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX)); + + /* goto out; */ + if (is_imm8(jmp_label(jmp_label3, 2))) + EMIT2(0xEB, jmp_label(jmp_label3, 2)); + else + EMIT1_off32(0xE9, jmp_label(jmp_label3, 5)); + + /* >= 32 */ + if (jmp_label1 == -1) + jmp_label1 = cnt; + /* cmp ecx,64 */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64); + /* Jumps when >= 64 */ + if (is_imm8(jmp_label(jmp_label2, 2))) + EMIT2(IA32_JAE, jmp_label(jmp_label2, 2)); + else + EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6)); + + /* >= 32 && < 64 */ + /* sub ecx,32 */ + EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32); + /* shr dreg_hi,cl */ + EMIT2(0xD3, add_1reg(0xE8, dreg_hi)); + /* mov dreg_lo,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + + /* goto out; */ + if (is_imm8(jmp_label(jmp_label3, 2))) + EMIT2(0xEB, jmp_label(jmp_label3, 2)); + else + EMIT1_off32(0xE9, jmp_label(jmp_label3, 5)); + + /* >= 64 */ + if (jmp_label2 == -1) + jmp_label2 = cnt; + /* xor dreg_lo,dreg_lo */ + EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + + if (jmp_label3 == -1) + jmp_label3 = cnt; + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + /* out: */ + *pprog = prog; +} + +/* dst = dst << val */ +static inline void emit_ia32_lsh_i64(const u8 dst[], const u32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + /* Do LSH operation */ + if (val < 32) { + /* shl dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xE0, dreg_hi), val); + /* mov ebx,dreg_lo */ + EMIT2(0x8B, add_2reg(0xC0, dreg_lo, IA32_EBX)); + /* shl dreg_lo,imm8 */ + EMIT3(0xC1, add_1reg(0xE0, dreg_lo), val); + + /* IA32_ECX = 32 - val */ + /* mov ecx,val */ + EMIT2(0xB1, val); + /* movzx ecx,ecx */ + EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX)); + /* neg ecx */ + EMIT2(0xF7, add_1reg(0xD8, IA32_ECX)); + /* add ecx,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32); + + /* shr ebx,cl */ + EMIT2(0xD3, add_1reg(0xE8, IA32_EBX)); + /* or dreg_hi,ebx */ + EMIT2(0x09, add_2reg(0xC0, dreg_hi, IA32_EBX)); + } else if (val >= 32 && val < 64) { + u32 value = val - 32; + + /* shl dreg_lo,imm8 */ + EMIT3(0xC1, add_1reg(0xE0, dreg_lo), value); + /* mov dreg_hi,dreg_lo */ + EMIT2(0x89, add_2reg(0xC0, dreg_hi, dreg_lo)); + /* xor dreg_lo,dreg_lo */ + EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo)); + } else { + /* xor dreg_lo,dreg_lo */ + EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + } + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + *pprog = prog; +} + +/* dst = dst >> val */ +static inline void emit_ia32_rsh_i64(const u8 dst[], const u32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + /* Do RSH operation */ + if (val < 32) { + /* shr dreg_lo,imm8 */ + EMIT3(0xC1, add_1reg(0xE8, dreg_lo), val); + /* mov ebx,dreg_hi */ + EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX)); + /* shr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xE8, dreg_hi), val); + + /* IA32_ECX = 32 - val */ + /* mov ecx,val */ + EMIT2(0xB1, val); + /* movzx ecx,ecx */ + EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX)); + /* neg ecx */ + EMIT2(0xF7, add_1reg(0xD8, IA32_ECX)); + /* add ecx,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32); + + /* shl ebx,cl */ + EMIT2(0xD3, add_1reg(0xE0, IA32_EBX)); + /* or dreg_lo,ebx */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX)); + } else if (val >= 32 && val < 64) { + u32 value = val - 32; + + /* shr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xE8, dreg_hi), value); + /* mov dreg_lo,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + } else { + /* xor dreg_lo,dreg_lo */ + EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo)); + /* xor dreg_hi,dreg_hi */ + EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi)); + } + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + *pprog = prog; +} + +/* dst = dst >> val (signed) */ +static inline void emit_ia32_arsh_i64(const u8 dst[], const u32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + /* Do RSH operation */ + if (val < 32) { + /* shr dreg_lo,imm8 */ + EMIT3(0xC1, add_1reg(0xE8, dreg_lo), val); + /* mov ebx,dreg_hi */ + EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX)); + /* ashr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xF8, dreg_hi), val); + + /* IA32_ECX = 32 - val */ + /* mov ecx,val */ + EMIT2(0xB1, val); + /* movzx ecx,ecx */ + EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX)); + /* neg ecx */ + EMIT2(0xF7, add_1reg(0xD8, IA32_ECX)); + /* add ecx,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32); + + /* shl ebx,cl */ + EMIT2(0xD3, add_1reg(0xE0, IA32_EBX)); + /* or dreg_lo,ebx */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX)); + } else if (val >= 32 && val < 64) { + u32 value = val - 32; + + /* ashr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xF8, dreg_hi), value); + /* mov dreg_lo,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi)); + + /* ashr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31); + } else { + /* ashr dreg_hi,imm8 */ + EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31); + /* mov dreg_lo,dreg_hi */ + EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi)); + } + + if (dstk) { + /* mov dword ptr [ebp+off],dreg_lo */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],dreg_hi */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi), + STACK_VAR(dst_hi)); + } + *pprog = prog; +} + +static inline void emit_ia32_mul_r64(const u8 dst[], const u8 src[], bool dstk, + bool sstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_hi)); + else + /* mov eax,dst_hi */ + EMIT2(0x8B, add_2reg(0xC0, dst_hi, IA32_EAX)); + + if (sstk) + /* mul dword ptr [ebp+off] */ + EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_lo)); + else + /* mul src_lo */ + EMIT2(0xF7, add_1reg(0xE0, src_lo)); + + /* mov ecx,eax */ + EMIT2(0x89, add_2reg(0xC0, IA32_ECX, IA32_EAX)); + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + else + /* mov eax,dst_lo */ + EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX)); + + if (sstk) + /* mul dword ptr [ebp+off] */ + EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_hi)); + else + /* mul src_hi */ + EMIT2(0xF7, add_1reg(0xE0, src_hi)); + + /* add eax,eax */ + EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EAX)); + + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + else + /* mov eax,dst_lo */ + EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX)); + + if (sstk) + /* mul dword ptr [ebp+off] */ + EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_lo)); + else + /* mul src_lo */ + EMIT2(0xF7, add_1reg(0xE0, src_lo)); + + /* add ecx,edx */ + EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EDX)); + + if (dstk) { + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],ecx */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(dst_hi)); + } else { + /* mov dst_lo,eax */ + EMIT2(0x89, add_2reg(0xC0, dst_lo, IA32_EAX)); + /* mov dst_hi,ecx */ + EMIT2(0x89, add_2reg(0xC0, dst_hi, IA32_ECX)); + } + + *pprog = prog; +} + +static inline void emit_ia32_mul_i64(const u8 dst[], const u32 val, + bool dstk, u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + u32 hi; + + hi = val & (1<<31) ? (u32)~0 : 0; + /* movl eax,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EAX), val); + if (dstk) + /* mul dword ptr [ebp+off] */ + EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_hi)); + else + /* mul dst_hi */ + EMIT2(0xF7, add_1reg(0xE0, dst_hi)); + + /* mov ecx,eax */ + EMIT2(0x89, add_2reg(0xC0, IA32_ECX, IA32_EAX)); + + /* movl eax,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EAX), hi); + if (dstk) + /* mul dword ptr [ebp+off] */ + EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo)); + else + /* mul dst_lo */ + EMIT2(0xF7, add_1reg(0xE0, dst_lo)); + /* add ecx,eax */ + EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EAX)); + + /* movl eax,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EAX), val); + if (dstk) + /* mul dword ptr [ebp+off] */ + EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo)); + else + /* mul dst_lo */ + EMIT2(0xF7, add_1reg(0xE0, dst_lo)); + + /* add ecx,edx */ + EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EDX)); + + if (dstk) { + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + /* mov dword ptr [ebp+off],ecx */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(dst_hi)); + } else { + /* mov dword ptr [ebp+off],eax */ + EMIT2(0x89, add_2reg(0xC0, dst_lo, IA32_EAX)); + /* mov dword ptr [ebp+off],ecx */ + EMIT2(0x89, add_2reg(0xC0, dst_hi, IA32_ECX)); + } + + *pprog = prog; +} + +static int bpf_size_to_x86_bytes(int bpf_size) +{ + if (bpf_size == BPF_W) + return 4; + else if (bpf_size == BPF_H) + return 2; + else if (bpf_size == BPF_B) + return 1; + else if (bpf_size == BPF_DW) + return 4; /* imm32 */ + else + return 0; +} + +struct jit_context { + int cleanup_addr; /* Epilogue code offset */ +}; + +/* Maximum number of bytes emitted while JITing one eBPF insn */ +#define BPF_MAX_INSN_SIZE 128 +#define BPF_INSN_SAFETY 64 + +#define PROLOGUE_SIZE 35 + +/* + * Emit prologue code for BPF program and check it's size. + * bpf_tail_call helper will skip it while jumping into another program. + */ +static void emit_prologue(u8 **pprog, u32 stack_depth) +{ + u8 *prog = *pprog; + int cnt = 0; + const u8 *r1 = bpf2ia32[BPF_REG_1]; + const u8 fplo = bpf2ia32[BPF_REG_FP][0]; + const u8 fphi = bpf2ia32[BPF_REG_FP][1]; + const u8 *tcc = bpf2ia32[TCALL_CNT]; + + /* push ebp */ + EMIT1(0x55); + /* mov ebp,esp */ + EMIT2(0x89, 0xE5); + /* push edi */ + EMIT1(0x57); + /* push esi */ + EMIT1(0x56); + /* push ebx */ + EMIT1(0x53); + + /* sub esp,STACK_SIZE */ + EMIT2_off32(0x81, 0xEC, STACK_SIZE); + /* sub ebp,SCRATCH_SIZE+4+12*/ + EMIT3(0x83, add_1reg(0xE8, IA32_EBP), SCRATCH_SIZE + 16); + /* xor ebx,ebx */ + EMIT2(0x31, add_2reg(0xC0, IA32_EBX, IA32_EBX)); + + /* Set up BPF prog stack base register */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBP), STACK_VAR(fplo)); + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(fphi)); + + /* Move BPF_CTX (EAX) to BPF_REG_R1 */ + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r1[0])); + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(r1[1])); + + /* Initialize Tail Count */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[0])); + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[1])); + + BUILD_BUG_ON(cnt != PROLOGUE_SIZE); + *pprog = prog; +} + +/* Emit epilogue code for BPF program */ +static void emit_epilogue(u8 **pprog, u32 stack_depth) +{ + u8 *prog = *pprog; + const u8 *r0 = bpf2ia32[BPF_REG_0]; + int cnt = 0; + + /* mov eax,dword ptr [ebp+off]*/ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r0[0])); + /* mov edx,dword ptr [ebp+off]*/ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(r0[1])); + + /* add ebp,SCRATCH_SIZE+4+12*/ + EMIT3(0x83, add_1reg(0xC0, IA32_EBP), SCRATCH_SIZE + 16); + + /* mov ebx,dword ptr [ebp-12]*/ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), -12); + /* mov esi,dword ptr [ebp-8]*/ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ESI), -8); + /* mov edi,dword ptr [ebp-4]*/ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDI), -4); + + EMIT1(0xC9); /* leave */ + EMIT1(0xC3); /* ret */ + *pprog = prog; +} + +/* + * Generate the following code: + * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ... + * if (index >= array->map.max_entries) + * goto out; + * if (++tail_call_cnt > MAX_TAIL_CALL_CNT) + * goto out; + * prog = array->ptrs[index]; + * if (prog == NULL) + * goto out; + * goto *(prog->bpf_func + prologue_size); + * out: + */ +static void emit_bpf_tail_call(u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + const u8 *r1 = bpf2ia32[BPF_REG_1]; + const u8 *r2 = bpf2ia32[BPF_REG_2]; + const u8 *r3 = bpf2ia32[BPF_REG_3]; + const u8 *tcc = bpf2ia32[TCALL_CNT]; + u32 lo, hi; + static int jmp_label1 = -1; + + /* + * if (index >= array->map.max_entries) + * goto out; + */ + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r2[0])); + /* mov edx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(r3[0])); + + /* cmp dword ptr [eax+off],edx */ + EMIT3(0x39, add_2reg(0x40, IA32_EAX, IA32_EDX), + offsetof(struct bpf_array, map.max_entries)); + /* jbe out */ + EMIT2(IA32_JBE, jmp_label(jmp_label1, 2)); + + /* + * if (tail_call_cnt > MAX_TAIL_CALL_CNT) + * goto out; + */ + lo = (u32)MAX_TAIL_CALL_CNT; + hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(tcc[0])); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[1])); + + /* cmp edx,hi */ + EMIT3(0x83, add_1reg(0xF8, IA32_EBX), hi); + EMIT2(IA32_JNE, 3); + /* cmp ecx,lo */ + EMIT3(0x83, add_1reg(0xF8, IA32_ECX), lo); + + /* ja out */ + EMIT2(IA32_JAE, jmp_label(jmp_label1, 2)); + + /* add eax,0x1 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 0x01); + /* adc ebx,0x0 */ + EMIT3(0x83, add_1reg(0xD0, IA32_EBX), 0x00); + + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(tcc[0])); + /* mov dword ptr [ebp+off],edx */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[1])); + + /* prog = array->ptrs[index]; */ + /* mov edx, [eax + edx * 4 + offsetof(...)] */ + EMIT3_off32(0x8B, 0x94, 0x90, offsetof(struct bpf_array, ptrs)); + + /* + * if (prog == NULL) + * goto out; + */ + /* test edx,edx */ + EMIT2(0x85, add_2reg(0xC0, IA32_EDX, IA32_EDX)); + /* je out */ + EMIT2(IA32_JE, jmp_label(jmp_label1, 2)); + + /* goto *(prog->bpf_func + prologue_size); */ + /* mov edx, dword ptr [edx + 32] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EDX, IA32_EDX), + offsetof(struct bpf_prog, bpf_func)); + /* add edx,prologue_size */ + EMIT3(0x83, add_1reg(0xC0, IA32_EDX), PROLOGUE_SIZE); + + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r1[0])); + + /* + * Now we're ready to jump into next BPF program: + * eax == ctx (1st arg) + * edx == prog->bpf_func + prologue_size + */ + RETPOLINE_EDX_BPF_JIT(); + + if (jmp_label1 == -1) + jmp_label1 = cnt; + + /* out: */ + *pprog = prog; +} + +/* Push the scratch stack register on top of the stack. */ +static inline void emit_push_r64(const u8 src[], u8 **pprog) +{ + u8 *prog = *pprog; + int cnt = 0; + + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_hi)); + /* push ecx */ + EMIT1(0x51); + + /* mov ecx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo)); + /* push ecx */ + EMIT1(0x51); + + *pprog = prog; +} + +static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, + int oldproglen, struct jit_context *ctx) +{ + struct bpf_insn *insn = bpf_prog->insnsi; + int insn_cnt = bpf_prog->len; + bool seen_exit = false; + u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY]; + int i, cnt = 0; + int proglen = 0; + u8 *prog = temp; + + emit_prologue(&prog, bpf_prog->aux->stack_depth); + + for (i = 0; i < insn_cnt; i++, insn++) { + const s32 imm32 = insn->imm; + const bool is64 = BPF_CLASS(insn->code) == BPF_ALU64; + const bool dstk = insn->dst_reg == BPF_REG_AX ? false : true; + const bool sstk = insn->src_reg == BPF_REG_AX ? false : true; + const u8 code = insn->code; + const u8 *dst = bpf2ia32[insn->dst_reg]; + const u8 *src = bpf2ia32[insn->src_reg]; + const u8 *r0 = bpf2ia32[BPF_REG_0]; + s64 jmp_offset; + u8 jmp_cond; + int ilen; + u8 *func; + + switch (code) { + /* ALU operations */ + /* dst = src */ + case BPF_ALU | BPF_MOV | BPF_K: + case BPF_ALU | BPF_MOV | BPF_X: + case BPF_ALU64 | BPF_MOV | BPF_K: + case BPF_ALU64 | BPF_MOV | BPF_X: + switch (BPF_SRC(code)) { + case BPF_X: + emit_ia32_mov_r64(is64, dst, src, dstk, + sstk, &prog); + break; + case BPF_K: + /* Sign-extend immediate value to dst reg */ + emit_ia32_mov_i64(is64, dst, imm32, + dstk, &prog); + break; + } + break; + /* dst = dst + src/imm */ + /* dst = dst - src/imm */ + /* dst = dst | src/imm */ + /* dst = dst & src/imm */ + /* dst = dst ^ src/imm */ + /* dst = dst * src/imm */ + /* dst = dst << src */ + /* dst = dst >> src */ + case BPF_ALU | BPF_ADD | BPF_K: + case BPF_ALU | BPF_ADD | BPF_X: + case BPF_ALU | BPF_SUB | BPF_K: + case BPF_ALU | BPF_SUB | BPF_X: + case BPF_ALU | BPF_OR | BPF_K: + case BPF_ALU | BPF_OR | BPF_X: + case BPF_ALU | BPF_AND | BPF_K: + case BPF_ALU | BPF_AND | BPF_X: + case BPF_ALU | BPF_XOR | BPF_K: + case BPF_ALU | BPF_XOR | BPF_X: + case BPF_ALU64 | BPF_ADD | BPF_K: + case BPF_ALU64 | BPF_ADD | BPF_X: + case BPF_ALU64 | BPF_SUB | BPF_K: + case BPF_ALU64 | BPF_SUB | BPF_X: + case BPF_ALU64 | BPF_OR | BPF_K: + case BPF_ALU64 | BPF_OR | BPF_X: + case BPF_ALU64 | BPF_AND | BPF_K: + case BPF_ALU64 | BPF_AND | BPF_X: + case BPF_ALU64 | BPF_XOR | BPF_K: + case BPF_ALU64 | BPF_XOR | BPF_X: + switch (BPF_SRC(code)) { + case BPF_X: + emit_ia32_alu_r64(is64, BPF_OP(code), dst, + src, dstk, sstk, &prog); + break; + case BPF_K: + emit_ia32_alu_i64(is64, BPF_OP(code), dst, + imm32, dstk, &prog); + break; + } + break; + case BPF_ALU | BPF_MUL | BPF_K: + case BPF_ALU | BPF_MUL | BPF_X: + switch (BPF_SRC(code)) { + case BPF_X: + emit_ia32_mul_r(dst_lo, src_lo, dstk, + sstk, &prog); + break; + case BPF_K: + /* mov ecx,imm32*/ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), + imm32); + emit_ia32_mul_r(dst_lo, IA32_ECX, dstk, + false, &prog); + break; + } + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + break; + case BPF_ALU | BPF_LSH | BPF_X: + case BPF_ALU | BPF_RSH | BPF_X: + case BPF_ALU | BPF_ARSH | BPF_K: + case BPF_ALU | BPF_ARSH | BPF_X: + switch (BPF_SRC(code)) { + case BPF_X: + emit_ia32_shift_r(BPF_OP(code), dst_lo, src_lo, + dstk, sstk, &prog); + break; + case BPF_K: + /* mov ecx,imm32*/ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), + imm32); + emit_ia32_shift_r(BPF_OP(code), dst_lo, + IA32_ECX, dstk, false, + &prog); + break; + } + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + break; + /* dst = dst / src(imm) */ + /* dst = dst % src(imm) */ + case BPF_ALU | BPF_DIV | BPF_K: + case BPF_ALU | BPF_DIV | BPF_X: + case BPF_ALU | BPF_MOD | BPF_K: + case BPF_ALU | BPF_MOD | BPF_X: + switch (BPF_SRC(code)) { + case BPF_X: + emit_ia32_div_mod_r(BPF_OP(code), dst_lo, + src_lo, dstk, sstk, &prog); + break; + case BPF_K: + /* mov ecx,imm32*/ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), + imm32); + emit_ia32_div_mod_r(BPF_OP(code), dst_lo, + IA32_ECX, dstk, false, + &prog); + break; + } + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + break; + case BPF_ALU64 | BPF_DIV | BPF_K: + case BPF_ALU64 | BPF_DIV | BPF_X: + case BPF_ALU64 | BPF_MOD | BPF_K: + case BPF_ALU64 | BPF_MOD | BPF_X: + goto notyet; + /* dst = dst >> imm */ + /* dst = dst << imm */ + case BPF_ALU | BPF_RSH | BPF_K: + case BPF_ALU | BPF_LSH | BPF_K: + if (unlikely(imm32 > 31)) + return -EINVAL; + /* mov ecx,imm32*/ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), imm32); + emit_ia32_shift_r(BPF_OP(code), dst_lo, IA32_ECX, dstk, + false, &prog); + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + break; + /* dst = dst << imm */ + case BPF_ALU64 | BPF_LSH | BPF_K: + if (unlikely(imm32 > 63)) + return -EINVAL; + emit_ia32_lsh_i64(dst, imm32, dstk, &prog); + break; + /* dst = dst >> imm */ + case BPF_ALU64 | BPF_RSH | BPF_K: + if (unlikely(imm32 > 63)) + return -EINVAL; + emit_ia32_rsh_i64(dst, imm32, dstk, &prog); + break; + /* dst = dst << src */ + case BPF_ALU64 | BPF_LSH | BPF_X: + emit_ia32_lsh_r64(dst, src, dstk, sstk, &prog); + break; + /* dst = dst >> src */ + case BPF_ALU64 | BPF_RSH | BPF_X: + emit_ia32_rsh_r64(dst, src, dstk, sstk, &prog); + break; + /* dst = dst >> src (signed) */ + case BPF_ALU64 | BPF_ARSH | BPF_X: + emit_ia32_arsh_r64(dst, src, dstk, sstk, &prog); + break; + /* dst = dst >> imm (signed) */ + case BPF_ALU64 | BPF_ARSH | BPF_K: + if (unlikely(imm32 > 63)) + return -EINVAL; + emit_ia32_arsh_i64(dst, imm32, dstk, &prog); + break; + /* dst = ~dst */ + case BPF_ALU | BPF_NEG: + emit_ia32_alu_i(is64, false, BPF_OP(code), + dst_lo, 0, dstk, &prog); + emit_ia32_mov_i(dst_hi, 0, dstk, &prog); + break; + /* dst = ~dst (64 bit) */ + case BPF_ALU64 | BPF_NEG: + emit_ia32_neg64(dst, dstk, &prog); + break; + /* dst = dst * src/imm */ + case BPF_ALU64 | BPF_MUL | BPF_X: + case BPF_ALU64 | BPF_MUL | BPF_K: + switch (BPF_SRC(code)) { + case BPF_X: + emit_ia32_mul_r64(dst, src, dstk, sstk, &prog); + break; + case BPF_K: + emit_ia32_mul_i64(dst, imm32, dstk, &prog); + break; + } + break; + /* dst = htole(dst) */ + case BPF_ALU | BPF_END | BPF_FROM_LE: + emit_ia32_to_le_r64(dst, imm32, dstk, &prog); + break; + /* dst = htobe(dst) */ + case BPF_ALU | BPF_END | BPF_FROM_BE: + emit_ia32_to_be_r64(dst, imm32, dstk, &prog); + break; + /* dst = imm64 */ + case BPF_LD | BPF_IMM | BPF_DW: { + s32 hi, lo = imm32; + + hi = insn[1].imm; + emit_ia32_mov_i(dst_lo, lo, dstk, &prog); + emit_ia32_mov_i(dst_hi, hi, dstk, &prog); + insn++; + i++; + break; + } + /* ST: *(u8*)(dst_reg + off) = imm */ + case BPF_ST | BPF_MEM | BPF_H: + case BPF_ST | BPF_MEM | BPF_B: + case BPF_ST | BPF_MEM | BPF_W: + case BPF_ST | BPF_MEM | BPF_DW: + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + else + /* mov eax,dst_lo */ + EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX)); + + switch (BPF_SIZE(code)) { + case BPF_B: + EMIT(0xC6, 1); break; + case BPF_H: + EMIT2(0x66, 0xC7); break; + case BPF_W: + case BPF_DW: + EMIT(0xC7, 1); break; + } + + if (is_imm8(insn->off)) + EMIT2(add_1reg(0x40, IA32_EAX), insn->off); + else + EMIT1_off32(add_1reg(0x80, IA32_EAX), + insn->off); + EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(code))); + + if (BPF_SIZE(code) == BPF_DW) { + u32 hi; + + hi = imm32 & (1<<31) ? (u32)~0 : 0; + EMIT2_off32(0xC7, add_1reg(0x80, IA32_EAX), + insn->off + 4); + EMIT(hi, 4); + } + break; + + /* STX: *(u8*)(dst_reg + off) = src_reg */ + case BPF_STX | BPF_MEM | BPF_B: + case BPF_STX | BPF_MEM | BPF_H: + case BPF_STX | BPF_MEM | BPF_W: + case BPF_STX | BPF_MEM | BPF_DW: + if (dstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + else + /* mov eax,dst_lo */ + EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX)); + + if (sstk) + /* mov edx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(src_lo)); + else + /* mov edx,src_lo */ + EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_EDX)); + + switch (BPF_SIZE(code)) { + case BPF_B: + EMIT(0x88, 1); break; + case BPF_H: + EMIT2(0x66, 0x89); break; + case BPF_W: + case BPF_DW: + EMIT(0x89, 1); break; + } + + if (is_imm8(insn->off)) + EMIT2(add_2reg(0x40, IA32_EAX, IA32_EDX), + insn->off); + else + EMIT1_off32(add_2reg(0x80, IA32_EAX, IA32_EDX), + insn->off); + + if (BPF_SIZE(code) == BPF_DW) { + if (sstk) + /* mov edi,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, + IA32_EDX), + STACK_VAR(src_hi)); + else + /* mov edi,src_hi */ + EMIT2(0x8B, add_2reg(0xC0, src_hi, + IA32_EDX)); + EMIT1(0x89); + if (is_imm8(insn->off + 4)) { + EMIT2(add_2reg(0x40, IA32_EAX, + IA32_EDX), + insn->off + 4); + } else { + EMIT1(add_2reg(0x80, IA32_EAX, + IA32_EDX)); + EMIT(insn->off + 4, 4); + } + } + break; + + /* LDX: dst_reg = *(u8*)(src_reg + off) */ + case BPF_LDX | BPF_MEM | BPF_B: + case BPF_LDX | BPF_MEM | BPF_H: + case BPF_LDX | BPF_MEM | BPF_W: + case BPF_LDX | BPF_MEM | BPF_DW: + if (sstk) + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(src_lo)); + else + /* mov eax,dword ptr [ebp+off] */ + EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_EAX)); + + switch (BPF_SIZE(code)) { + case BPF_B: + EMIT2(0x0F, 0xB6); break; + case BPF_H: + EMIT2(0x0F, 0xB7); break; + case BPF_W: + case BPF_DW: + EMIT(0x8B, 1); break; + } + + if (is_imm8(insn->off)) + EMIT2(add_2reg(0x40, IA32_EAX, IA32_EDX), + insn->off); + else + EMIT1_off32(add_2reg(0x80, IA32_EAX, IA32_EDX), + insn->off); + + if (dstk) + /* mov dword ptr [ebp+off],edx */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_lo)); + else + /* mov dst_lo,edx */ + EMIT2(0x89, add_2reg(0xC0, dst_lo, IA32_EDX)); + switch (BPF_SIZE(code)) { + case BPF_B: + case BPF_H: + case BPF_W: + if (dstk) { + EMIT3(0xC7, add_1reg(0x40, IA32_EBP), + STACK_VAR(dst_hi)); + EMIT(0x0, 4); + } else { + EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0); + } + break; + case BPF_DW: + EMIT2_off32(0x8B, + add_2reg(0x80, IA32_EAX, IA32_EDX), + insn->off + 4); + if (dstk) + EMIT3(0x89, + add_2reg(0x40, IA32_EBP, + IA32_EDX), + STACK_VAR(dst_hi)); + else + EMIT2(0x89, + add_2reg(0xC0, dst_hi, IA32_EDX)); + break; + default: + break; + } + break; + /* call */ + case BPF_JMP | BPF_CALL: + { + const u8 *r1 = bpf2ia32[BPF_REG_1]; + const u8 *r2 = bpf2ia32[BPF_REG_2]; + const u8 *r3 = bpf2ia32[BPF_REG_3]; + const u8 *r4 = bpf2ia32[BPF_REG_4]; + const u8 *r5 = bpf2ia32[BPF_REG_5]; + + if (insn->src_reg == BPF_PSEUDO_CALL) + goto notyet; + + func = (u8 *) __bpf_call_base + imm32; + jmp_offset = func - (image + addrs[i]); + + if (!imm32 || !is_simm32(jmp_offset)) { + pr_err("unsupported BPF func %d addr %p image %p\n", + imm32, func, image); + return -EINVAL; + } + + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(r1[0])); + /* mov edx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(r1[1])); + + emit_push_r64(r5, &prog); + emit_push_r64(r4, &prog); + emit_push_r64(r3, &prog); + emit_push_r64(r2, &prog); + + EMIT1_off32(0xE8, jmp_offset + 9); + + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(r0[0])); + /* mov dword ptr [ebp+off],edx */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(r0[1])); + + /* add esp,32 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ESP), 32); + break; + } + case BPF_JMP | BPF_TAIL_CALL: + emit_bpf_tail_call(&prog); + break; + + /* cond jump */ + case BPF_JMP | BPF_JEQ | BPF_X: + case BPF_JMP | BPF_JNE | BPF_X: + case BPF_JMP | BPF_JGT | BPF_X: + case BPF_JMP | BPF_JLT | BPF_X: + case BPF_JMP | BPF_JGE | BPF_X: + case BPF_JMP | BPF_JLE | BPF_X: + case BPF_JMP | BPF_JSGT | BPF_X: + case BPF_JMP | BPF_JSLE | BPF_X: + case BPF_JMP | BPF_JSLT | BPF_X: + case BPF_JMP | BPF_JSGE | BPF_X: { + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + u8 sreg_lo = sstk ? IA32_ECX : src_lo; + u8 sreg_hi = sstk ? IA32_EBX : src_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + if (sstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(src_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), + STACK_VAR(src_hi)); + } + + /* cmp dreg_hi,sreg_hi */ + EMIT2(0x39, add_2reg(0xC0, dreg_hi, sreg_hi)); + EMIT2(IA32_JNE, 2); + /* cmp dreg_lo,sreg_lo */ + EMIT2(0x39, add_2reg(0xC0, dreg_lo, sreg_lo)); + goto emit_cond_jmp; + } + case BPF_JMP | BPF_JSET | BPF_X: { + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + u8 sreg_lo = sstk ? IA32_ECX : src_lo; + u8 sreg_hi = sstk ? IA32_EBX : src_hi; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + if (sstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), + STACK_VAR(src_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), + STACK_VAR(src_hi)); + } + /* and dreg_lo,sreg_lo */ + EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo)); + /* and dreg_hi,sreg_hi */ + EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi)); + /* or dreg_lo,dreg_hi */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi)); + goto emit_cond_jmp; + } + case BPF_JMP | BPF_JSET | BPF_K: { + u32 hi; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + u8 sreg_lo = IA32_ECX; + u8 sreg_hi = IA32_EBX; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + hi = imm32 & (1<<31) ? (u32)~0 : 0; + + /* mov ecx,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), imm32); + /* mov ebx,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EBX), hi); + + /* and dreg_lo,sreg_lo */ + EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo)); + /* and dreg_hi,sreg_hi */ + EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi)); + /* or dreg_lo,dreg_hi */ + EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi)); + goto emit_cond_jmp; + } + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JNE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JLT | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JLE | BPF_K: + case BPF_JMP | BPF_JSGT | BPF_K: + case BPF_JMP | BPF_JSLE | BPF_K: + case BPF_JMP | BPF_JSLT | BPF_K: + case BPF_JMP | BPF_JSGE | BPF_K: { + u32 hi; + u8 dreg_lo = dstk ? IA32_EAX : dst_lo; + u8 dreg_hi = dstk ? IA32_EDX : dst_hi; + u8 sreg_lo = IA32_ECX; + u8 sreg_hi = IA32_EBX; + + if (dstk) { + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(dst_lo)); + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(dst_hi)); + } + + hi = imm32 & (1<<31) ? (u32)~0 : 0; + /* mov ecx,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), imm32); + /* mov ebx,imm32 */ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EBX), hi); + + /* cmp dreg_hi,sreg_hi */ + EMIT2(0x39, add_2reg(0xC0, dreg_hi, sreg_hi)); + EMIT2(IA32_JNE, 2); + /* cmp dreg_lo,sreg_lo */ + EMIT2(0x39, add_2reg(0xC0, dreg_lo, sreg_lo)); + +emit_cond_jmp: /* Convert BPF opcode to x86 */ + switch (BPF_OP(code)) { + case BPF_JEQ: + jmp_cond = IA32_JE; + break; + case BPF_JSET: + case BPF_JNE: + jmp_cond = IA32_JNE; + break; + case BPF_JGT: + /* GT is unsigned '>', JA in x86 */ + jmp_cond = IA32_JA; + break; + case BPF_JLT: + /* LT is unsigned '<', JB in x86 */ + jmp_cond = IA32_JB; + break; + case BPF_JGE: + /* GE is unsigned '>=', JAE in x86 */ + jmp_cond = IA32_JAE; + break; + case BPF_JLE: + /* LE is unsigned '<=', JBE in x86 */ + jmp_cond = IA32_JBE; + break; + case BPF_JSGT: + /* Signed '>', GT in x86 */ + jmp_cond = IA32_JG; + break; + case BPF_JSLT: + /* Signed '<', LT in x86 */ + jmp_cond = IA32_JL; + break; + case BPF_JSGE: + /* Signed '>=', GE in x86 */ + jmp_cond = IA32_JGE; + break; + case BPF_JSLE: + /* Signed '<=', LE in x86 */ + jmp_cond = IA32_JLE; + break; + default: /* to silence GCC warning */ + return -EFAULT; + } + jmp_offset = addrs[i + insn->off] - addrs[i]; + if (is_imm8(jmp_offset)) { + EMIT2(jmp_cond, jmp_offset); + } else if (is_simm32(jmp_offset)) { + EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset); + } else { + pr_err("cond_jmp gen bug %llx\n", jmp_offset); + return -EFAULT; + } + + break; + } + case BPF_JMP | BPF_JA: + if (insn->off == -1) + /* -1 jmp instructions will always jump + * backwards two bytes. Explicitly handling + * this case avoids wasting too many passes + * when there are long sequences of replaced + * dead code. + */ + jmp_offset = -2; + else + jmp_offset = addrs[i + insn->off] - addrs[i]; + + if (!jmp_offset) + /* Optimize out nop jumps */ + break; +emit_jmp: + if (is_imm8(jmp_offset)) { + EMIT2(0xEB, jmp_offset); + } else if (is_simm32(jmp_offset)) { + EMIT1_off32(0xE9, jmp_offset); + } else { + pr_err("jmp gen bug %llx\n", jmp_offset); + return -EFAULT; + } + break; + + case BPF_LD | BPF_ABS | BPF_W: + case BPF_LD | BPF_ABS | BPF_H: + case BPF_LD | BPF_ABS | BPF_B: + case BPF_LD | BPF_IND | BPF_W: + case BPF_LD | BPF_IND | BPF_H: + case BPF_LD | BPF_IND | BPF_B: + { + int size; + const u8 *r6 = bpf2ia32[BPF_REG_6]; + + /* Setting up first argument */ + /* mov eax,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(r6[0])); + + /* Setting up second argument */ + if (BPF_MODE(code) == BPF_ABS) { + /* mov %edx, imm32 */ + EMIT1_off32(0xBA, imm32); + } else { + if (sstk) + /* mov edx,dword ptr [ebp+off] */ + EMIT3(0x8B, add_2reg(0x40, IA32_EBP, + IA32_EDX), + STACK_VAR(src_lo)); + else + /* mov edx,src_lo */ + EMIT2(0x8B, add_2reg(0xC0, src_lo, + IA32_EDX)); + if (imm32) { + if (is_imm8(imm32)) + /* add %edx,imm8 */ + EMIT3(0x83, 0xC2, imm32); + else + /* add %edx,imm32 */ + EMIT2_off32(0x81, 0xC2, imm32); + } + } + + /* Setting up third argument */ + switch (BPF_SIZE(code)) { + case BPF_W: + size = 4; + break; + case BPF_H: + size = 2; + break; + case BPF_B: + size = 1; + break; + default: + return -EINVAL; + } + /* mov ecx,val */ + EMIT2(0xB1, size); + /* movzx ecx,ecx */ + EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX)); + + /* mov ebx,ebp */ + EMIT2(0x8B, add_2reg(0xC0, IA32_EBP, IA32_EBX)); + /* add %ebx,imm8 */ + EMIT3(0x83, add_1reg(0xC0, IA32_EBX), SKB_BUFFER); + /* push ebx */ + EMIT1(0x53); + + /* Setting up function pointer to call */ + /* mov ebx,imm32*/ + EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EBX), + (unsigned int)bpf_load_pointer); + + EMIT2(0xFF, add_1reg(0xD0, IA32_EBX)); + /* add %esp,4 */ + EMIT3(0x83, add_1reg(0xC0, IA32_ESP), 4); + /* xor edx,edx */ + EMIT2(0x33, add_2reg(0xC0, IA32_EDX, IA32_EDX)); + + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(r0[0])); + /* mov dword ptr [ebp+off],edx */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX), + STACK_VAR(r0[1])); + + /* + * Check if return address is NULL or not. + * If NULL then jump to epilogue else continue + * to load the value from retn address + */ + EMIT3(0x83, add_1reg(0xF8, IA32_EAX), 0); + jmp_offset = ctx->cleanup_addr - addrs[i]; + + switch (BPF_SIZE(code)) { + case BPF_W: + jmp_offset += 7; + break; + case BPF_H: + jmp_offset += 10; + break; + case BPF_B: + jmp_offset += 6; + break; + } + + EMIT2_off32(0x0F, IA32_JE + 0x10, jmp_offset); + /* Load value from the address */ + switch (BPF_SIZE(code)) { + case BPF_W: + /* mov eax,[eax] */ + EMIT2(0x8B, 0x0); + /* Emit 'bswap eax' */ + EMIT2(0x0F, add_1reg(0xC8, IA32_EAX)); + break; + case BPF_H: + EMIT3(0x0F, 0xB7, 0x0); + EMIT1(0x66); + EMIT3(0xC1, add_1reg(0xC8, IA32_EAX), 8); + break; + case BPF_B: + EMIT3(0x0F, 0xB6, 0x0); + break; + } + + /* mov dword ptr [ebp+off],eax */ + EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), + STACK_VAR(r0[0])); + break; + } + /* STX XADD: lock *(u32 *)(dst + off) += src */ + case BPF_STX | BPF_XADD | BPF_W: + /* STX XADD: lock *(u64 *)(dst + off) += src */ + case BPF_STX | BPF_XADD | BPF_DW: + goto notyet; + case BPF_JMP | BPF_EXIT: + if (seen_exit) { + jmp_offset = ctx->cleanup_addr - addrs[i]; + goto emit_jmp; + } + seen_exit = true; + /* Update cleanup_addr */ + ctx->cleanup_addr = proglen; + emit_epilogue(&prog, bpf_prog->aux->stack_depth); + break; +notyet: + pr_info_once("*** NOT YET: opcode %02x ***\n", code); + return -EFAULT; + default: + /* + * This error will be seen if new instruction was added + * to interpreter, but not to JIT or if there is junk in + * bpf_prog + */ + pr_err("bpf_jit: unknown opcode %02x\n", code); + return -EINVAL; + } + + ilen = prog - temp; + if (ilen > BPF_MAX_INSN_SIZE) { + pr_err("bpf_jit: fatal insn size error\n"); + return -EFAULT; + } + + if (image) { + if (unlikely(proglen + ilen > oldproglen)) { + pr_err("bpf_jit: fatal error\n"); + return -EFAULT; + } + memcpy(image + proglen, temp, ilen); + } + proglen += ilen; + addrs[i] = proglen; + prog = temp; + } + return proglen; +} + +struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) +{ + struct bpf_binary_header *header = NULL; + struct bpf_prog *tmp, *orig_prog = prog; + int proglen, oldproglen = 0; + struct jit_context ctx = {}; + bool tmp_blinded = false; + u8 *image = NULL; + int *addrs; + int pass; + int i; + + if (!prog->jit_requested) + return orig_prog; + + tmp = bpf_jit_blind_constants(prog); + /* + * If blinding was requested and we failed during blinding, + * we must fall back to the interpreter. + */ + if (IS_ERR(tmp)) + return orig_prog; + if (tmp != prog) { + tmp_blinded = true; + prog = tmp; + } + + addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL); + if (!addrs) { + prog = orig_prog; + goto out; + } + + /* + * Before first pass, make a rough estimation of addrs[] + * each BPF instruction is translated to less than 64 bytes + */ + for (proglen = 0, i = 0; i < prog->len; i++) { + proglen += 64; + addrs[i] = proglen; + } + ctx.cleanup_addr = proglen; + + /* + * JITed image shrinks with every pass and the loop iterates + * until the image stops shrinking. Very large BPF programs + * may converge on the last pass. In such case do one more + * pass to emit the final image. + */ + for (pass = 0; pass < 20 || image; pass++) { + proglen = do_jit(prog, addrs, image, oldproglen, &ctx); + if (proglen <= 0) { +out_image: + image = NULL; + if (header) + bpf_jit_binary_free(header); + prog = orig_prog; + goto out_addrs; + } + if (image) { + if (proglen != oldproglen) { + pr_err("bpf_jit: proglen=%d != oldproglen=%d\n", + proglen, oldproglen); + goto out_image; + } + break; + } + if (proglen == oldproglen) { + header = bpf_jit_binary_alloc(proglen, &image, + 1, jit_fill_hole); + if (!header) { + prog = orig_prog; + goto out_addrs; + } + } + oldproglen = proglen; + cond_resched(); + } + + if (bpf_jit_enable > 1) + bpf_jit_dump(prog->len, proglen, pass + 1, image); + + if (image) { + bpf_jit_binary_lock_ro(header); + prog->bpf_func = (void *)image; + prog->jited = 1; + prog->jited_len = proglen; + } else { + prog = orig_prog; + } + +out_addrs: + kfree(addrs); +out: + if (tmp_blinded) + bpf_jit_prog_release_other(prog, prog == orig_prog ? + tmp : orig_prog); + return prog; +} -- cgit v1.2.3 From b4b8faa1ded7a3bb34db374c692a51cea29f9080 Mon Sep 17 00:00:00 2001 From: Magnus Karlsson Date: Wed, 2 May 2018 13:01:36 +0200 Subject: samples/bpf: sample application and documentation for AF_XDP sockets MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This is a sample application for AF_XDP sockets. The application supports three different modes of operation: rxdrop, txonly and l2fwd. To show-case a simple round-robin load-balancing between a set of sockets in an xskmap, set the RR_LB compile time define option to 1 in "xdpsock.h". v2: The entries variable was calculated twice in {umem,xq}_nb_avail. Co-authored-by: Björn Töpel Signed-off-by: Björn Töpel Signed-off-by: Magnus Karlsson Signed-off-by: Alexei Starovoitov --- Documentation/networking/af_xdp.rst | 297 +++++++++++ Documentation/networking/index.rst | 1 + samples/bpf/Makefile | 4 + samples/bpf/xdpsock.h | 11 + samples/bpf/xdpsock_kern.c | 56 +++ samples/bpf/xdpsock_user.c | 948 ++++++++++++++++++++++++++++++++++++ 6 files changed, 1317 insertions(+) create mode 100644 Documentation/networking/af_xdp.rst create mode 100644 samples/bpf/xdpsock.h create mode 100644 samples/bpf/xdpsock_kern.c create mode 100644 samples/bpf/xdpsock_user.c (limited to 'Documentation') diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst new file mode 100644 index 000000000000..91928d9ee4bf --- /dev/null +++ b/Documentation/networking/af_xdp.rst @@ -0,0 +1,297 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +AF_XDP +====== + +Overview +======== + +AF_XDP is an address family that is optimized for high performance +packet processing. + +This document assumes that the reader is familiar with BPF and XDP. If +not, the Cilium project has an excellent reference guide at +http://cilium.readthedocs.io/en/doc-1.0/bpf/. + +Using the XDP_REDIRECT action from an XDP program, the program can +redirect ingress frames to other XDP enabled netdevs, using the +bpf_redirect_map() function. AF_XDP sockets enable the possibility for +XDP programs to redirect frames to a memory buffer in a user-space +application. + +An AF_XDP socket (XSK) is created with the normal socket() +syscall. Associated with each XSK are two rings: the RX ring and the +TX ring. A socket can receive packets on the RX ring and it can send +packets on the TX ring. These rings are registered and sized with the +setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory +to have at least one of these rings for each socket. An RX or TX +descriptor ring points to a data buffer in a memory area called a +UMEM. RX and TX can share the same UMEM so that a packet does not have +to be copied between RX and TX. Moreover, if a packet needs to be kept +for a while due to a possible retransmit, the descriptor that points +to that packet can be changed to point to another and reused right +away. This again avoids copying data. + +The UMEM consists of a number of equally size frames and each frame +has a unique frame id. A descriptor in one of the rings references a +frame by referencing its frame id. The user space allocates memory for +this UMEM using whatever means it feels is most appropriate (malloc, +mmap, huge pages, etc). This memory area is then registered with the +kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two +rings: the FILL ring and the COMPLETION ring. The fill ring is used by +the application to send down frame ids for the kernel to fill in with +RX packet data. References to these frames will then appear in the RX +ring once each packet has been received. The completion ring, on the +other hand, contains frame ids that the kernel has transmitted +completely and can now be used again by user space, for either TX or +RX. Thus, the frame ids appearing in the completion ring are ids that +were previously transmitted using the TX ring. In summary, the RX and +FILL rings are used for the RX path and the TX and COMPLETION rings +are used for the TX path. + +The socket is then finally bound with a bind() call to a device and a +specific queue id on that device, and it is not until bind is +completed that traffic starts to flow. + +The UMEM can be shared between processes, if desired. If a process +wants to do this, it simply skips the registration of the UMEM and its +corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind +call and submits the XSK of the process it would like to share UMEM +with as well as its own newly created XSK socket. The new process will +then receive frame id references in its own RX ring that point to this +shared UMEM. Note that since the ring structures are single-consumer / +single-producer (for performance reasons), the new process has to +create its own socket with associated RX and TX rings, since it cannot +share this with the other process. This is also the reason that there +is only one set of FILL and COMPLETION rings per UMEM. It is the +responsibility of a single process to handle the UMEM. + +How is then packets distributed from an XDP program to the XSKs? There +is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The +user-space application can place an XSK at an arbitrary place in this +map. The XDP program can then redirect a packet to a specific index in +this map and at this point XDP validates that the XSK in that map was +indeed bound to that device and ring number. If not, the packet is +dropped. If the map is empty at that index, the packet is also +dropped. This also means that it is currently mandatory to have an XDP +program loaded (and one XSK in the XSKMAP) to be able to get any +traffic to user space through the XSK. + +AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the +driver does not have support for XDP, or XDP_SKB is explicitly chosen +when loading the XDP program, XDP_SKB mode is employed that uses SKBs +together with the generic XDP support and copies out the data to user +space. A fallback mode that works for any network device. On the other +hand, if the driver has support for XDP, it will be used by the AF_XDP +code to provide better performance, but there is still a copy of the +data into user space. + +Concepts +======== + +In order to use an AF_XDP socket, a number of associated objects need +to be setup. + +Jonathan Corbet has also written an excellent article on LWN, +"Accelerating networking with AF_XDP". It can be found at +https://lwn.net/Articles/750845/. + +UMEM +---- + +UMEM is a region of virtual contiguous memory, divided into +equal-sized frames. An UMEM is associated to a netdev and a specific +queue id of that netdev. It is created and configured (frame size, +frame headroom, start address and size) by using the XDP_UMEM_REG +setsockopt system call. A UMEM is bound to a netdev and queue id, via +the bind() system call. + +An AF_XDP is socket linked to a single UMEM, but one UMEM can have +multiple AF_XDP sockets. To share an UMEM created via one socket A, +the next socket B can do this by setting the XDP_SHARED_UMEM flag in +struct sockaddr_xdp member sxdp_flags, and passing the file descriptor +of A to struct sockaddr_xdp member sxdp_shared_umem_fd. + +The UMEM has two single-producer/single-consumer rings, that are used +to transfer ownership of UMEM frames between the kernel and the +user-space application. + +Rings +----- + +There are a four different kind of rings: Fill, Completion, RX and +TX. All rings are single-producer/single-consumer, so the user-space +application need explicit synchronization of multiple +processes/threads are reading/writing to them. + +The UMEM uses two rings: Fill and Completion. Each socket associated +with the UMEM must have an RX queue, TX queue or both. Say, that there +is a setup with four sockets (all doing TX and RX). Then there will be +one Fill ring, one Completion ring, four TX rings and four RX rings. + +The rings are head(producer)/tail(consumer) based rings. A producer +writes the data ring at the index pointed out by struct xdp_ring +producer member, and increasing the producer index. A consumer reads +the data ring at the index pointed out by struct xdp_ring consumer +member, and increasing the consumer index. + +The rings are configured and created via the _RING setsockopt system +calls and mmapped to user-space using the appropriate offset to mmap() +(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and +XDP_UMEM_PGOFF_COMPLETION_RING). + +The size of the rings need to be of size power of two. + +UMEM Fill Ring +~~~~~~~~~~~~~~ + +The Fill ring is used to transfer ownership of UMEM frames from +user-space to kernel-space. The UMEM indicies are passed in the +ring. As an example, if the UMEM is 64k and each frame is 4k, then the +UMEM has 16 frames and can pass indicies between 0 and 15. + +Frames passed to the kernel are used for the ingress path (RX rings). + +The user application produces UMEM indicies to this ring. + +UMEM Completetion Ring +~~~~~~~~~~~~~~~~~~~~~~ + +The Completion Ring is used transfer ownership of UMEM frames from +kernel-space to user-space. Just like the Fill ring, UMEM indicies are +used. + +Frames passed from the kernel to user-space are frames that has been +sent (TX ring) and can be used by user-space again. + +The user application consumes UMEM indicies from this ring. + + +RX Ring +~~~~~~~ + +The RX ring is the receiving side of a socket. Each entry in the ring +is a struct xdp_desc descriptor. The descriptor contains UMEM index +(idx), the length of the data (len), the offset into the frame +(offset). + +If no frames have been passed to kernel via the Fill ring, no +descriptors will (or can) appear on the RX ring. + +The user application consumes struct xdp_desc descriptors from this +ring. + +TX Ring +~~~~~~~ + +The TX ring is used to send frames. The struct xdp_desc descriptor is +filled (index, length and offset) and passed into the ring. + +To start the transfer a sendmsg() system call is required. This might +be relaxed in the future. + +The user application produces struct xdp_desc descriptors to this +ring. + +XSKMAP / BPF_MAP_TYPE_XSKMAP +---------------------------- + +On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that +is used in conjunction with bpf_redirect_map() to pass the ingress +frame to a socket. + +The user application inserts the socket into the map, via the bpf() +system call. + +Note that if an XDP program tries to redirect to a socket that does +not match the queue configuration and netdev, the frame will be +dropped. E.g. an AF_XDP socket is bound to netdev eth0 and +queue 17. Only the XDP program executing for eth0 and queue 17 will +successfully pass data to the socket. Please refer to the sample +application (samples/bpf/) in for an example. + +Usage +===== + +In order to use AF_XDP sockets there are two parts needed. The +user-space application and the XDP program. For a complete setup and +usage example, please refer to the sample application. The user-space +side is xdpsock_user.c and the XDP side xdpsock_kern.c. + +Naive ring dequeue and enqueue could look like this:: + + // typedef struct xdp_rxtx_ring RING; + // typedef struct xdp_umem_ring RING; + + // typedef struct xdp_desc RING_TYPE; + // typedef __u32 RING_TYPE; + + int dequeue_one(RING *ring, RING_TYPE *item) + { + __u32 entries = ring->ptrs.producer - ring->ptrs.consumer; + + if (entries == 0) + return -1; + + // read-barrier! + + *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)]; + ring->ptrs.consumer++; + return 0; + } + + int enqueue_one(RING *ring, const RING_TYPE *item) + { + u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer); + + if (free_entries == 0) + return -1; + + ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item; + + // write-barrier! + + ring->ptrs.producer++; + return 0; + } + + +For a more optimized version, please refer to the sample application. + +Sample application +================== + +There is a xdpsock benchmarking/test application included that +demonstrates how to use AF_XDP sockets with both private and shared +UMEMs. Say that you would like your UDP traffic from port 4242 to end +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool +for this:: + + ethtool -N p3p2 rx-flow-hash udp4 fn + ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ + action 16 + +Running the rxdrop benchmark in XDP_DRV mode can then be done +using:: + + samples/bpf/xdpsock -i p3p2 -q 16 -r -N + +For XDP_SKB mode, use the switch "-S" instead of "-N" and all options +can be displayed with "-h", as usual. + +Credits +======= + +- Björn Töpel (AF_XDP core) +- Magnus Karlsson (AF_XDP core) +- Alexander Duyck +- Alexei Starovoitov +- Daniel Borkmann +- Jesper Dangaard Brouer +- John Fastabend +- Jonathan Corbet (LWN coverage) +- Michael S. Tsirkin +- Qi Z Zhang +- Willem de Bruijn + diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index f204eaff657d..cbd9bdd4a79e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -6,6 +6,7 @@ Contents: .. toctree:: :maxdepth: 2 + af_xdp batman-adv can dpaa2/index diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 5e31770ac087..8e0c7fb6d7cc 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info hostprogs-y += syscall_tp hostprogs-y += cpustat hostprogs-y += xdp_adjust_tail +hostprogs-y += xdpsock # Libbpf dependencies LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o @@ -98,6 +99,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o always += syscall_tp_kern.o always += cpustat_kern.o always += xdp_adjust_tail_kern.o +always += xdpsock_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf HOSTLOADLIBES_syscall_tp += -lelf HOSTLOADLIBES_cpustat += -lelf HOSTLOADLIBES_xdp_adjust_tail += -lelf +HOSTLOADLIBES_xdpsock += -lelf -pthread # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h new file mode 100644 index 000000000000..533ab81adfa1 --- /dev/null +++ b/samples/bpf/xdpsock.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef XDPSOCK_H_ +#define XDPSOCK_H_ + +/* Power-of-2 number of sockets */ +#define MAX_SOCKS 4 + +/* Round-robin receive */ +#define RR_LB 0 + +#endif /* XDPSOCK_H_ */ diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c new file mode 100644 index 000000000000..d8806c41362e --- /dev/null +++ b/samples/bpf/xdpsock_kern.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 +#define KBUILD_MODNAME "foo" +#include +#include "bpf_helpers.h" + +#include "xdpsock.h" + +struct bpf_map_def SEC("maps") qidconf_map = { + .type = BPF_MAP_TYPE_ARRAY, + .key_size = sizeof(int), + .value_size = sizeof(int), + .max_entries = 1, +}; + +struct bpf_map_def SEC("maps") xsks_map = { + .type = BPF_MAP_TYPE_XSKMAP, + .key_size = sizeof(int), + .value_size = sizeof(int), + .max_entries = 4, +}; + +struct bpf_map_def SEC("maps") rr_map = { + .type = BPF_MAP_TYPE_PERCPU_ARRAY, + .key_size = sizeof(int), + .value_size = sizeof(unsigned int), + .max_entries = 1, +}; + +SEC("xdp_sock") +int xdp_sock_prog(struct xdp_md *ctx) +{ + int *qidconf, key = 0, idx; + unsigned int *rr; + + qidconf = bpf_map_lookup_elem(&qidconf_map, &key); + if (!qidconf) + return XDP_ABORTED; + + if (*qidconf != ctx->rx_queue_index) + return XDP_PASS; + +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */ + rr = bpf_map_lookup_elem(&rr_map, &key); + if (!rr) + return XDP_ABORTED; + + *rr = (*rr + 1) & (MAX_SOCKS - 1); + idx = *rr; +#else + idx = 0; +#endif + + return bpf_redirect_map(&xsks_map, idx, 0); +} + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c new file mode 100644 index 000000000000..4b8a7cf3e63b --- /dev/null +++ b/samples/bpf/xdpsock_user.c @@ -0,0 +1,948 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2017 - 2018 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bpf_load.h" +#include "bpf_util.h" +#include "libbpf.h" + +#include "xdpsock.h" + +#ifndef SOL_XDP +#define SOL_XDP 283 +#endif + +#ifndef AF_XDP +#define AF_XDP 44 +#endif + +#ifndef PF_XDP +#define PF_XDP AF_XDP +#endif + +#define NUM_FRAMES 131072 +#define FRAME_HEADROOM 0 +#define FRAME_SIZE 2048 +#define NUM_DESCS 1024 +#define BATCH_SIZE 16 + +#define FQ_NUM_DESCS 1024 +#define CQ_NUM_DESCS 1024 + +#define DEBUG_HEXDUMP 0 + +typedef __u32 u32; + +static unsigned long prev_time; + +enum benchmark_type { + BENCH_RXDROP = 0, + BENCH_TXONLY = 1, + BENCH_L2FWD = 2, +}; + +static enum benchmark_type opt_bench = BENCH_RXDROP; +static u32 opt_xdp_flags; +static const char *opt_if = ""; +static int opt_ifindex; +static int opt_queue; +static int opt_poll; +static int opt_shared_packet_buffer; +static int opt_interval = 1; + +struct xdp_umem_uqueue { + u32 cached_prod; + u32 cached_cons; + u32 mask; + u32 size; + struct xdp_umem_ring *ring; +}; + +struct xdp_umem { + char (*frames)[FRAME_SIZE]; + struct xdp_umem_uqueue fq; + struct xdp_umem_uqueue cq; + int fd; +}; + +struct xdp_uqueue { + u32 cached_prod; + u32 cached_cons; + u32 mask; + u32 size; + struct xdp_rxtx_ring *ring; +}; + +struct xdpsock { + struct xdp_uqueue rx; + struct xdp_uqueue tx; + int sfd; + struct xdp_umem *umem; + u32 outstanding_tx; + unsigned long rx_npkts; + unsigned long tx_npkts; + unsigned long prev_rx_npkts; + unsigned long prev_tx_npkts; +}; + +#define MAX_SOCKS 4 +static int num_socks; +struct xdpsock *xsks[MAX_SOCKS]; + +static unsigned long get_nsecs(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return ts.tv_sec * 1000000000UL + ts.tv_nsec; +} + +static void dump_stats(void); + +#define lassert(expr) \ + do { \ + if (!(expr)) { \ + fprintf(stderr, "%s:%s:%i: Assertion failed: " \ + #expr ": errno: %d/\"%s\"\n", \ + __FILE__, __func__, __LINE__, \ + errno, strerror(errno)); \ + dump_stats(); \ + exit(EXIT_FAILURE); \ + } \ + } while (0) + +#define barrier() __asm__ __volatile__("": : :"memory") +#define u_smp_rmb() barrier() +#define u_smp_wmb() barrier() +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) + +static const char pkt_data[] = + "\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00" + "\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14" + "\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b" + "\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa"; + +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb) +{ + u32 free_entries = q->size - (q->cached_prod - q->cached_cons); + + if (free_entries >= nb) + return free_entries; + + /* Refresh the local tail pointer */ + q->cached_cons = q->ring->ptrs.consumer; + + return q->size - (q->cached_prod - q->cached_cons); +} + +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs) +{ + u32 free_entries = q->cached_cons - q->cached_prod; + + if (free_entries >= ndescs) + return free_entries; + + /* Refresh the local tail pointer */ + q->cached_cons = q->ring->ptrs.consumer + q->size; + return q->cached_cons - q->cached_prod; +} + +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb) +{ + u32 entries = q->cached_prod - q->cached_cons; + + if (entries == 0) { + q->cached_prod = q->ring->ptrs.producer; + entries = q->cached_prod - q->cached_cons; + } + + return (entries > nb) ? nb : entries; +} + +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs) +{ + u32 entries = q->cached_prod - q->cached_cons; + + if (entries == 0) { + q->cached_prod = q->ring->ptrs.producer; + entries = q->cached_prod - q->cached_cons; + } + + return (entries > ndescs) ? ndescs : entries; +} + +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq, + struct xdp_desc *d, + size_t nb) +{ + u32 i; + + if (umem_nb_free(fq, nb) < nb) + return -ENOSPC; + + for (i = 0; i < nb; i++) { + u32 idx = fq->cached_prod++ & fq->mask; + + fq->ring->desc[idx] = d[i].idx; + } + + u_smp_wmb(); + + fq->ring->ptrs.producer = fq->cached_prod; + + return 0; +} + +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d, + size_t nb) +{ + u32 i; + + if (umem_nb_free(fq, nb) < nb) + return -ENOSPC; + + for (i = 0; i < nb; i++) { + u32 idx = fq->cached_prod++ & fq->mask; + + fq->ring->desc[idx] = d[i]; + } + + u_smp_wmb(); + + fq->ring->ptrs.producer = fq->cached_prod; + + return 0; +} + +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq, + u32 *d, size_t nb) +{ + u32 idx, i, entries = umem_nb_avail(cq, nb); + + u_smp_rmb(); + + for (i = 0; i < entries; i++) { + idx = cq->cached_cons++ & cq->mask; + d[i] = cq->ring->desc[idx]; + } + + if (entries > 0) { + u_smp_wmb(); + + cq->ring->ptrs.consumer = cq->cached_cons; + } + + return entries; +} + +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off) +{ + lassert(idx < NUM_FRAMES); + return &xsk->umem->frames[idx][off]; +} + +static inline int xq_enq(struct xdp_uqueue *uq, + const struct xdp_desc *descs, + unsigned int ndescs) +{ + struct xdp_rxtx_ring *r = uq->ring; + unsigned int i; + + if (xq_nb_free(uq, ndescs) < ndescs) + return -ENOSPC; + + for (i = 0; i < ndescs; i++) { + u32 idx = uq->cached_prod++ & uq->mask; + + r->desc[idx].idx = descs[i].idx; + r->desc[idx].len = descs[i].len; + r->desc[idx].offset = descs[i].offset; + } + + u_smp_wmb(); + + r->ptrs.producer = uq->cached_prod; + return 0; +} + +static inline int xq_enq_tx_only(struct xdp_uqueue *uq, + __u32 idx, unsigned int ndescs) +{ + struct xdp_rxtx_ring *q = uq->ring; + unsigned int i; + + if (xq_nb_free(uq, ndescs) < ndescs) + return -ENOSPC; + + for (i = 0; i < ndescs; i++) { + u32 idx = uq->cached_prod++ & uq->mask; + + q->desc[idx].idx = idx + i; + q->desc[idx].len = sizeof(pkt_data) - 1; + q->desc[idx].offset = 0; + } + + u_smp_wmb(); + + q->ptrs.producer = uq->cached_prod; + return 0; +} + +static inline int xq_deq(struct xdp_uqueue *uq, + struct xdp_desc *descs, + int ndescs) +{ + struct xdp_rxtx_ring *r = uq->ring; + unsigned int idx; + int i, entries; + + entries = xq_nb_avail(uq, ndescs); + + u_smp_rmb(); + + for (i = 0; i < entries; i++) { + idx = uq->cached_cons++ & uq->mask; + descs[i] = r->desc[idx]; + } + + if (entries > 0) { + u_smp_wmb(); + + r->ptrs.consumer = uq->cached_cons; + } + + return entries; +} + +static void swap_mac_addresses(void *data) +{ + struct ether_header *eth = (struct ether_header *)data; + struct ether_addr *src_addr = (struct ether_addr *)ð->ether_shost; + struct ether_addr *dst_addr = (struct ether_addr *)ð->ether_dhost; + struct ether_addr tmp; + + tmp = *src_addr; + *src_addr = *dst_addr; + *dst_addr = tmp; +} + +#if DEBUG_HEXDUMP +static void hex_dump(void *pkt, size_t length, const char *prefix) +{ + int i = 0; + const unsigned char *address = (unsigned char *)pkt; + const unsigned char *line = address; + size_t line_size = 32; + unsigned char c; + + printf("length = %zu\n", length); + printf("%s | ", prefix); + while (length-- > 0) { + printf("%02X ", *address++); + if (!(++i % line_size) || (length == 0 && i % line_size)) { + if (length == 0) { + while (i++ % line_size) + printf("__ "); + } + printf(" | "); /* right close */ + while (line < address) { + c = *line++; + printf("%c", (c < 33 || c == 255) ? 0x2E : c); + } + printf("\n"); + if (length > 0) + printf("%s | ", prefix); + } + } + printf("\n"); +} +#endif + +static size_t gen_eth_frame(char *frame) +{ + memcpy(frame, pkt_data, sizeof(pkt_data) - 1); + return sizeof(pkt_data) - 1; +} + +static struct xdp_umem *xdp_umem_configure(int sfd) +{ + int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS; + struct xdp_umem_reg mr; + struct xdp_umem *umem; + void *bufs; + + umem = calloc(1, sizeof(*umem)); + lassert(umem); + + lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */ + NUM_FRAMES * FRAME_SIZE) == 0); + + mr.addr = (__u64)bufs; + mr.len = NUM_FRAMES * FRAME_SIZE; + mr.frame_size = FRAME_SIZE; + mr.frame_headroom = FRAME_HEADROOM; + + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0); + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size, + sizeof(int)) == 0); + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size, + sizeof(int)) == 0); + + umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) + + FQ_NUM_DESCS * sizeof(u32), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_UMEM_PGOFF_FILL_RING); + lassert(umem->fq.ring != MAP_FAILED); + + umem->fq.mask = FQ_NUM_DESCS - 1; + umem->fq.size = FQ_NUM_DESCS; + + umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) + + CQ_NUM_DESCS * sizeof(u32), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_UMEM_PGOFF_COMPLETION_RING); + lassert(umem->cq.ring != MAP_FAILED); + + umem->cq.mask = CQ_NUM_DESCS - 1; + umem->cq.size = CQ_NUM_DESCS; + + umem->frames = (char (*)[FRAME_SIZE])bufs; + umem->fd = sfd; + + if (opt_bench == BENCH_TXONLY) { + int i; + + for (i = 0; i < NUM_FRAMES; i++) + (void)gen_eth_frame(&umem->frames[i][0]); + } + + return umem; +} + +static struct xdpsock *xsk_configure(struct xdp_umem *umem) +{ + struct sockaddr_xdp sxdp = {}; + int sfd, ndescs = NUM_DESCS; + struct xdpsock *xsk; + bool shared = true; + u32 i; + + sfd = socket(PF_XDP, SOCK_RAW, 0); + lassert(sfd >= 0); + + xsk = calloc(1, sizeof(*xsk)); + lassert(xsk); + + xsk->sfd = sfd; + xsk->outstanding_tx = 0; + + if (!umem) { + shared = false; + xsk->umem = xdp_umem_configure(sfd); + } else { + xsk->umem = umem; + } + + lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING, + &ndescs, sizeof(int)) == 0); + lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING, + &ndescs, sizeof(int)) == 0); + + /* Rx */ + xsk->rx.ring = mmap(NULL, + sizeof(struct xdp_ring) + + NUM_DESCS * sizeof(struct xdp_desc), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_PGOFF_RX_RING); + lassert(xsk->rx.ring != MAP_FAILED); + + if (!shared) { + for (i = 0; i < NUM_DESCS / 2; i++) + lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1) + == 0); + } + + /* Tx */ + xsk->tx.ring = mmap(NULL, + sizeof(struct xdp_ring) + + NUM_DESCS * sizeof(struct xdp_desc), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_PGOFF_TX_RING); + lassert(xsk->tx.ring != MAP_FAILED); + + xsk->rx.mask = NUM_DESCS - 1; + xsk->rx.size = NUM_DESCS; + + xsk->tx.mask = NUM_DESCS - 1; + xsk->tx.size = NUM_DESCS; + + sxdp.sxdp_family = PF_XDP; + sxdp.sxdp_ifindex = opt_ifindex; + sxdp.sxdp_queue_id = opt_queue; + if (shared) { + sxdp.sxdp_flags = XDP_SHARED_UMEM; + sxdp.sxdp_shared_umem_fd = umem->fd; + } + + lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0); + + return xsk; +} + +static void print_benchmark(bool running) +{ + const char *bench_str = "INVALID"; + + if (opt_bench == BENCH_RXDROP) + bench_str = "rxdrop"; + else if (opt_bench == BENCH_TXONLY) + bench_str = "txonly"; + else if (opt_bench == BENCH_L2FWD) + bench_str = "l2fwd"; + + printf("%s:%d %s ", opt_if, opt_queue, bench_str); + if (opt_xdp_flags & XDP_FLAGS_SKB_MODE) + printf("xdp-skb "); + else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE) + printf("xdp-drv "); + else + printf(" "); + + if (opt_poll) + printf("poll() "); + + if (running) { + printf("running..."); + fflush(stdout); + } +} + +static void dump_stats(void) +{ + unsigned long now = get_nsecs(); + long dt = now - prev_time; + int i; + + prev_time = now; + + for (i = 0; i < num_socks; i++) { + char *fmt = "%-15s %'-11.0f %'-11lu\n"; + double rx_pps, tx_pps; + + rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) * + 1000000000. / dt; + tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) * + 1000000000. / dt; + + printf("\n sock%d@", i); + print_benchmark(false); + printf("\n"); + + printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts", + dt / 1000000000.); + printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts); + printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts); + + xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts; + xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts; + } +} + +static void *poller(void *arg) +{ + (void)arg; + for (;;) { + sleep(opt_interval); + dump_stats(); + } + + return NULL; +} + +static void int_exit(int sig) +{ + (void)sig; + dump_stats(); + bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags); + exit(EXIT_SUCCESS); +} + +static struct option long_options[] = { + {"rxdrop", no_argument, 0, 'r'}, + {"txonly", no_argument, 0, 't'}, + {"l2fwd", no_argument, 0, 'l'}, + {"interface", required_argument, 0, 'i'}, + {"queue", required_argument, 0, 'q'}, + {"poll", no_argument, 0, 'p'}, + {"shared-buffer", no_argument, 0, 's'}, + {"xdp-skb", no_argument, 0, 'S'}, + {"xdp-native", no_argument, 0, 'N'}, + {"interval", required_argument, 0, 'n'}, + {0, 0, 0, 0} +}; + +static void usage(const char *prog) +{ + const char *str = + " Usage: %s [OPTIONS]\n" + " Options:\n" + " -r, --rxdrop Discard all incoming packets (default)\n" + " -t, --txonly Only send packets\n" + " -l, --l2fwd MAC swap L2 forwarding\n" + " -i, --interface=n Run on interface n\n" + " -q, --queue=n Use queue n (default 0)\n" + " -p, --poll Use poll syscall\n" + " -s, --shared-buffer Use shared packet buffer\n" + " -S, --xdp-skb=n Use XDP skb-mod\n" + " -N, --xdp-native=n Enfore XDP native mode\n" + " -n, --interval=n Specify statistics update interval (default 1 sec).\n" + "\n"; + fprintf(stderr, str, prog); + exit(EXIT_FAILURE); +} + +static void parse_command_line(int argc, char **argv) +{ + int option_index, c; + + opterr = 0; + + for (;;) { + c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options, + &option_index); + if (c == -1) + break; + + switch (c) { + case 'r': + opt_bench = BENCH_RXDROP; + break; + case 't': + opt_bench = BENCH_TXONLY; + break; + case 'l': + opt_bench = BENCH_L2FWD; + break; + case 'i': + opt_if = optarg; + break; + case 'q': + opt_queue = atoi(optarg); + break; + case 's': + opt_shared_packet_buffer = 1; + break; + case 'p': + opt_poll = 1; + break; + case 'S': + opt_xdp_flags |= XDP_FLAGS_SKB_MODE; + break; + case 'N': + opt_xdp_flags |= XDP_FLAGS_DRV_MODE; + break; + case 'n': + opt_interval = atoi(optarg); + break; + default: + usage(basename(argv[0])); + } + } + + opt_ifindex = if_nametoindex(opt_if); + if (!opt_ifindex) { + fprintf(stderr, "ERROR: interface \"%s\" does not exist\n", + opt_if); + usage(basename(argv[0])); + } +} + +static void kick_tx(int fd) +{ + int ret; + + ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0); + if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN) + return; + lassert(0); +} + +static inline void complete_tx_l2fwd(struct xdpsock *xsk) +{ + u32 descs[BATCH_SIZE]; + unsigned int rcvd; + size_t ndescs; + + if (!xsk->outstanding_tx) + return; + + kick_tx(xsk->sfd); + ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE : + xsk->outstanding_tx; + + /* re-add completed Tx buffers */ + rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs); + if (rcvd > 0) { + umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd); + xsk->outstanding_tx -= rcvd; + xsk->tx_npkts += rcvd; + } +} + +static inline void complete_tx_only(struct xdpsock *xsk) +{ + u32 descs[BATCH_SIZE]; + unsigned int rcvd; + + if (!xsk->outstanding_tx) + return; + + kick_tx(xsk->sfd); + + rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE); + if (rcvd > 0) { + xsk->outstanding_tx -= rcvd; + xsk->tx_npkts += rcvd; + } +} + +static void rx_drop(struct xdpsock *xsk) +{ + struct xdp_desc descs[BATCH_SIZE]; + unsigned int rcvd, i; + + rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE); + if (!rcvd) + return; + + for (i = 0; i < rcvd; i++) { + u32 idx = descs[i].idx; + + lassert(idx < NUM_FRAMES); +#if DEBUG_HEXDUMP + char *pkt; + char buf[32]; + + pkt = xq_get_data(xsk, idx, descs[i].offset); + sprintf(buf, "idx=%d", idx); + hex_dump(pkt, descs[i].len, buf); +#endif + } + + xsk->rx_npkts += rcvd; + + umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd); +} + +static void rx_drop_all(void) +{ + struct pollfd fds[MAX_SOCKS + 1]; + int i, ret, timeout, nfds = 1; + + memset(fds, 0, sizeof(fds)); + + for (i = 0; i < num_socks; i++) { + fds[i].fd = xsks[i]->sfd; + fds[i].events = POLLIN; + timeout = 1000; /* 1sn */ + } + + for (;;) { + if (opt_poll) { + ret = poll(fds, nfds, timeout); + if (ret <= 0) + continue; + } + + for (i = 0; i < num_socks; i++) + rx_drop(xsks[i]); + } +} + +static void tx_only(struct xdpsock *xsk) +{ + int timeout, ret, nfds = 1; + struct pollfd fds[nfds + 1]; + unsigned int idx = 0; + + memset(fds, 0, sizeof(fds)); + fds[0].fd = xsk->sfd; + fds[0].events = POLLOUT; + timeout = 1000; /* 1sn */ + + for (;;) { + if (opt_poll) { + ret = poll(fds, nfds, timeout); + if (ret <= 0) + continue; + + if (fds[0].fd != xsk->sfd || + !(fds[0].revents & POLLOUT)) + continue; + } + + if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) { + lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0); + + xsk->outstanding_tx += BATCH_SIZE; + idx += BATCH_SIZE; + idx %= NUM_FRAMES; + } + + complete_tx_only(xsk); + } +} + +static void l2fwd(struct xdpsock *xsk) +{ + for (;;) { + struct xdp_desc descs[BATCH_SIZE]; + unsigned int rcvd, i; + int ret; + + for (;;) { + complete_tx_l2fwd(xsk); + + rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE); + if (rcvd > 0) + break; + } + + for (i = 0; i < rcvd; i++) { + char *pkt = xq_get_data(xsk, descs[i].idx, + descs[i].offset); + + swap_mac_addresses(pkt); +#if DEBUG_HEXDUMP + char buf[32]; + u32 idx = descs[i].idx; + + sprintf(buf, "idx=%d", idx); + hex_dump(pkt, descs[i].len, buf); +#endif + } + + xsk->rx_npkts += rcvd; + + ret = xq_enq(&xsk->tx, descs, rcvd); + lassert(ret == 0); + xsk->outstanding_tx += rcvd; + } +} + +int main(int argc, char **argv) +{ + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + char xdp_filename[256]; + int i, ret, key = 0; + pthread_t pt; + + parse_command_line(argc, argv); + + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n", + strerror(errno)); + exit(EXIT_FAILURE); + } + + snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(xdp_filename)) { + fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf); + exit(EXIT_FAILURE); + } + + if (!prog_fd[0]) { + fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n", + strerror(errno)); + exit(EXIT_FAILURE); + } + + if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) { + fprintf(stderr, "ERROR: link set xdp fd failed\n"); + exit(EXIT_FAILURE); + } + + ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0); + if (ret) { + fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n"); + exit(EXIT_FAILURE); + } + + /* Create sockets... */ + xsks[num_socks++] = xsk_configure(NULL); + +#if RR_LB + for (i = 0; i < MAX_SOCKS - 1; i++) + xsks[num_socks++] = xsk_configure(xsks[0]->umem); +#endif + + /* ...and insert them into the map. */ + for (i = 0; i < num_socks; i++) { + key = i; + ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0); + if (ret) { + fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i); + exit(EXIT_FAILURE); + } + } + + signal(SIGINT, int_exit); + signal(SIGTERM, int_exit); + signal(SIGABRT, int_exit); + + setlocale(LC_ALL, ""); + + ret = pthread_create(&pt, NULL, poller, NULL); + lassert(ret == 0); + + prev_time = get_nsecs(); + + if (opt_bench == BENCH_RXDROP) + rx_drop_all(); + else if (opt_bench == BENCH_TXONLY) + tx_only(xsks[0]); + else + l2fwd(xsks[0]); + + return 0; +} -- cgit v1.2.3 From 53a7bdfb2a2756cce8003b90817f8a6fb4d830d9 Mon Sep 17 00:00:00 2001 From: Fabio Estevam Date: Mon, 7 May 2018 09:17:51 -0300 Subject: dt-bindings: dsa: Remove unnecessary #address/#size-cells If the example binding is used on a real dts file, the following DTC warning is seen with W=1: arch/arm/boot/dts/imx6q-b450v3.dtb: Warning (avoid_unnecessary_addr_size): /mdio-gpio/switch@0: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property Remove unnecessary #address-cells/#size-cells to improve the binding document examples. Signed-off-by: Fabio Estevam Reviewed-by: Rob Herring Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dsa/dsa.txt | 6 ------ 1 file changed, 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt index cfe8f64eca4f..3ceeb8de1196 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa.txt +++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt @@ -82,8 +82,6 @@ linked into one DSA cluster. switch0: switch0@0 { compatible = "marvell,mv88e6085"; - #address-cells = <1>; - #size-cells = <0>; reg = <0>; dsa,member = <0 0>; @@ -135,8 +133,6 @@ linked into one DSA cluster. switch1: switch1@0 { compatible = "marvell,mv88e6085"; - #address-cells = <1>; - #size-cells = <0>; reg = <0>; dsa,member = <0 1>; @@ -204,8 +200,6 @@ linked into one DSA cluster. switch2: switch2@0 { compatible = "marvell,mv88e6085"; - #address-cells = <1>; - #size-cells = <0>; reg = <0>; dsa,member = <0 2>; -- cgit v1.2.3 From 68625b7631e040707b24197451a475a3e9197e2a Mon Sep 17 00:00:00 2001 From: Wang YanQing Date: Thu, 10 May 2018 11:09:21 +0800 Subject: bpf, doc: clarification for the meaning of 'id' For me, as a reader whose mother language isn't English, the old words bring a little difficulty to catch the meaning, this patch rewords the subsection in a more clarificatory way. This patch also add blank lines as separator at two places to improve readability. Signed-off-by: Wang YanQing Signed-off-by: Daniel Borkmann --- Documentation/networking/filter.txt | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 5032e1263bc9..e6b4ebb2b243 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -1142,6 +1142,7 @@ into a register from memory, the register's top 56 bits are known zero, while the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 0x1ff), because of potential carries. + Besides arithmetic, the register state can also be updated by conditional branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' @@ -1150,14 +1151,16 @@ BPF_JSGE) would instead update the signed minimum/maximum values. Information from the signed and unsigned bounds can be combined; for instance if a value is first tested < 8 and then tested s> 4, the verifier will conclude that the value is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. + PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all pointers sharing that same variable offset. This is important for packet range -checks: after adding some variable to a packet pointer, if you then copy it to -another register and (say) add a constant 4, both registers will share the same -'id' but one will have a fixed offset of +4. Then if it is bounds-checked and -found to be less than a PTR_TO_PACKET_END, the other register is now known to -have a safe range of at least 4 bytes. See 'Direct packet access', below, for -more on PTR_TO_PACKET ranges. +checks: after adding a variable to a packet pointer register A, if you then copy +it to another register B and then add a constant 4 to A, both registers will +share the same 'id' but the A will have a fixed offset of +4. Then if A is +bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is +now known to have a safe range of at least 4 bytes. See 'Direct packet access', +below, for more on PTR_TO_PACKET ranges. + The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of the pointer returned from a map lookup. This means that when one copy is checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. -- cgit v1.2.3 From 804c1466c76717f48ee2e0f2e78ed24810a9818e Mon Sep 17 00:00:00 2001 From: Tonghao Zhang Date: Fri, 11 May 2018 02:53:12 -0700 Subject: net: doc: fix spelling mistake: "modrobe.d" -> "modprobe.d" Signed-off-by: Tonghao Zhang Signed-off-by: David S. Miller --- Documentation/networking/bonding.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 9ba04c0bab8d..c13214d073a4 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -140,7 +140,7 @@ bonding module at load time, or are specified via sysfs. Module options may be given as command line arguments to the insmod or modprobe command, but are usually specified in either the -/etc/modrobe.d/*.conf configuration files, or in a distro-specific +/etc/modprobe.d/*.conf configuration files, or in a distro-specific configuration file (some of which are detailed in the next section). Details on bonding support for sysfs is provided in the -- cgit v1.2.3 From a4a78a97ee4bccb865006015340905c90b38cd8f Mon Sep 17 00:00:00 2001 From: Chen-Yu Tsai Date: Mon, 14 May 2018 03:14:18 +0800 Subject: dt-bindings: net: dwmac-sun8i: Clean up clock delay chain descriptions The clock delay chains found in the glue layer for dwmac-sun8i are only used with RGMII PHYs. They are not intended for non-RGMII PHYs, such as MII external PHYs or the internal PHY. Also, a recent SoC has a smaller range of possible values for the delay chain. This patch reformats the delay chain section of the device tree binding to make it clear that the delay chains only apply to RGMII PHYs, and make it easier to add the R40-specific bits later. Signed-off-by: Chen-Yu Tsai Reviewed-by: Rob Herring Acked-by: Maxime Ripard Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dwmac-sun8i.txt | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt index 3d6d5fa0c4d5..e04ce75e24a3 100644 --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt @@ -28,10 +28,13 @@ Required properties: - allwinner,sun8i-a83t-system-controller Optional properties: -- allwinner,tx-delay-ps: TX clock delay chain value in ps. Range value is 0-700. Default is 0) -- allwinner,rx-delay-ps: RX clock delay chain value in ps. Range value is 0-3100. Default is 0) -Both delay properties need to be a multiple of 100. They control the delay for -external PHY. +- allwinner,tx-delay-ps: TX clock delay chain value in ps. + Range is 0-700. Default is 0. +- allwinner,rx-delay-ps: RX clock delay chain value in ps. + Range is 0-3100. Default is 0. +Both delay properties need to be a multiple of 100. They control the +clock delay for external RGMII PHY. They do not apply to the internal +PHY or external non-RGMII PHYs. Optional properties for the following compatibles: - "allwinner,sun8i-h3-emac", -- cgit v1.2.3 From 9ed3fec3c336b71d532aaeda8d3239246aa43d61 Mon Sep 17 00:00:00 2001 From: Chen-Yu Tsai Date: Mon, 14 May 2018 03:14:19 +0800 Subject: dt-bindings: net: dwmac-sun8i: Sort syscon compatibles by alphabetical order The A83T syscon compatible was appended to the syscon compatibles list, instead of inserted in to preserve the ordering. Move it to the proper place to keep the list sorted. Signed-off-by: Chen-Yu Tsai Reviewed-by: Rob Herring Acked-by: Maxime Ripard Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dwmac-sun8i.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt index e04ce75e24a3..1b8e33e71651 100644 --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt @@ -22,10 +22,10 @@ Required properties: - #size-cells: shall be 0 - syscon: A phandle to the syscon of the SoC with one of the following compatible string: + - allwinner,sun8i-a83t-system-controller - allwinner,sun8i-h3-system-controller - allwinner,sun8i-v3s-system-controller - allwinner,sun50i-a64-system-controller - - allwinner,sun8i-a83t-system-controller Optional properties: - allwinner,tx-delay-ps: TX clock delay chain value in ps. -- cgit v1.2.3 From a6fe692e6eb554eb6f9e097142c7b7099edd203f Mon Sep 17 00:00:00 2001 From: Chen-Yu Tsai Date: Mon, 14 May 2018 03:14:20 +0800 Subject: dt-bindings: net: dwmac-sun8i: simplify description of syscon property The syscon property is used to point to the device that holds the glue layer control register known as the "EMAC (or GMAC) clock register". We do not need to explicitly list what compatible strings are needed, as this information is readily available in the user manuals. Also the "syscon" device type is more of an implementation detail. There are many ways to access a register not in a device's address range, the syscon interface being the most generic and unrestricted one. Simplify the description so that it says what it is supposed to describe. Signed-off-by: Chen-Yu Tsai Reviewed-by: Rob Herring Acked-by: Maxime Ripard Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dwmac-sun8i.txt | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt index 1b8e33e71651..1c0906a5c02b 100644 --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt @@ -20,12 +20,7 @@ Required properties: - phy-handle: See ethernet.txt - #address-cells: shall be 1 - #size-cells: shall be 0 -- syscon: A phandle to the syscon of the SoC with one of the following - compatible string: - - allwinner,sun8i-a83t-system-controller - - allwinner,sun8i-h3-system-controller - - allwinner,sun8i-v3s-system-controller - - allwinner,sun50i-a64-system-controller +- syscon: A phandle to the device containing the EMAC or GMAC clock register Optional properties: - allwinner,tx-delay-ps: TX clock delay chain value in ps. -- cgit v1.2.3 From eef8811d9219d197d158cda9233ce2f78ea0a790 Mon Sep 17 00:00:00 2001 From: Chen-Yu Tsai Date: Mon, 14 May 2018 03:14:21 +0800 Subject: dt-bindings: net: dwmac-sun8i: Add binding for GMAC on Allwinner R40 SoC The Allwinner R40 SoC has the EMAC controller supported by dwmac-sun8i. It is named "GMAC", while EMAC refers to the 10/100 Mbps Ethernet controller supported by sun4i-emac. The controller is the same, but the R40 has the glue layer controls in the clock control unit (CCU), with a reduced RX delay chain, and no TX delay chain. This patch adds the R40 specific bits to the dwmac-sun8i binding. Signed-off-by: Chen-Yu Tsai Reviewed-by: Rob Herring Acked-by: Maxime Ripard Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dwmac-sun8i.txt | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt index 1c0906a5c02b..cfe724398a12 100644 --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt @@ -7,6 +7,7 @@ Required properties: - compatible: must be one of the following string: "allwinner,sun8i-a83t-emac" "allwinner,sun8i-h3-emac" + "allwinner,sun8i-r40-gmac" "allwinner,sun8i-v3s-emac" "allwinner,sun50i-a64-emac" - reg: address and length of the register for the device. @@ -25,8 +26,10 @@ Required properties: Optional properties: - allwinner,tx-delay-ps: TX clock delay chain value in ps. Range is 0-700. Default is 0. + Unavailable for allwinner,sun8i-r40-gmac - allwinner,rx-delay-ps: RX clock delay chain value in ps. Range is 0-3100. Default is 0. + Range is 0-700 for allwinner,sun8i-r40-gmac Both delay properties need to be a multiple of 100. They control the clock delay for external RGMII PHY. They do not apply to the internal PHY or external non-RGMII PHYs. -- cgit v1.2.3 From 4712c1b203dd04d9b0f3139000bc4bfbb94b0c05 Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Mon, 14 May 2018 15:42:12 +0200 Subject: bpf, doc: add basic README.rst file A README.rst file in a directory have special meaning for sites like github, which auto renders the contents. Plus search engines like Google also index these README.rst files. Auto rendering allow us to use links, for (re)directing eBPF users to other places where docs live. The end-goal would be to direct users towards https://www.kernel.org/doc/html/latest but we haven't written the full docs yet, so we start out small and take this incrementally. This directory itself contains some useful docs, which can be linked to from the README.rst file (verified this works for github). Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexei Starovoitov --- Documentation/bpf/README.rst | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 Documentation/bpf/README.rst (limited to 'Documentation') diff --git a/Documentation/bpf/README.rst b/Documentation/bpf/README.rst new file mode 100644 index 000000000000..329469c33db8 --- /dev/null +++ b/Documentation/bpf/README.rst @@ -0,0 +1,36 @@ +================= +BPF documentation +================= + +This directory contains documentation for the BPF (Berkeley Packet +Filter) facility, with a focus on the extended BPF version (eBPF). + +This kernel side documentation is still work in progress. The main +textual documentation is (for historical reasons) described in +`Documentation/networking/filter.txt`_, which describe both classical +and extended BPF instruction-set. +The Cilium project also maintains a `BPF and XDP Reference Guide`_ +that goes into great technical depth about the BPF Architecture. + +The primary info for the bpf syscall is available in the `man-pages`_ +for `bpf(2)`_. + + + +Frequently asked questions (FAQ) +================================ + +Two sets of Questions and Answers (Q&A) are maintained. + +* QA for common questions about BPF see: bpf_design_QA_ + +* QA for developers interacting with BPF subsystem: bpf_devel_QA_ + + +.. Links: +.. _bpf_design_QA: bpf_design_QA.txt +.. _bpf_devel_QA: bpf_devel_QA.txt +.. _Documentation/networking/filter.txt: ../networking/filter.txt +.. _man-pages: https://www.kernel.org/doc/man-pages/ +.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html +.. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/ -- cgit v1.2.3 From 192092faa02dd5e5d1ff875d7512a5d803db95a0 Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Mon, 14 May 2018 15:42:17 +0200 Subject: bpf, doc: rename txt files to rst files This will cause them to get auto rendered, e.g. when viewing them on GitHub. Followup patches will correct the content to be RST compliant. Also adjust README.rst to point to the renamed files. Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexei Starovoitov --- Documentation/bpf/README.rst | 4 +- Documentation/bpf/bpf_design_QA.rst | 156 ++++++++++ Documentation/bpf/bpf_design_QA.txt | 156 ---------- Documentation/bpf/bpf_devel_QA.rst | 570 ++++++++++++++++++++++++++++++++++++ Documentation/bpf/bpf_devel_QA.txt | 570 ------------------------------------ 5 files changed, 728 insertions(+), 728 deletions(-) create mode 100644 Documentation/bpf/bpf_design_QA.rst delete mode 100644 Documentation/bpf/bpf_design_QA.txt create mode 100644 Documentation/bpf/bpf_devel_QA.rst delete mode 100644 Documentation/bpf/bpf_devel_QA.txt (limited to 'Documentation') diff --git a/Documentation/bpf/README.rst b/Documentation/bpf/README.rst index 329469c33db8..b9a80c9e9392 100644 --- a/Documentation/bpf/README.rst +++ b/Documentation/bpf/README.rst @@ -28,8 +28,8 @@ Two sets of Questions and Answers (Q&A) are maintained. .. Links: -.. _bpf_design_QA: bpf_design_QA.txt -.. _bpf_devel_QA: bpf_devel_QA.txt +.. _bpf_design_QA: bpf_design_QA.rst +.. _bpf_devel_QA: bpf_devel_QA.rst .. _Documentation/networking/filter.txt: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ .. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst new file mode 100644 index 000000000000..f3e458a0bb2f --- /dev/null +++ b/Documentation/bpf/bpf_design_QA.rst @@ -0,0 +1,156 @@ +BPF extensibility and applicability to networking, tracing, security +in the linux kernel and several user space implementations of BPF +virtual machine led to a number of misunderstanding on what BPF actually is. +This short QA is an attempt to address that and outline a direction +of where BPF is heading long term. + +Q: Is BPF a generic instruction set similar to x64 and arm64? +A: NO. + +Q: Is BPF a generic virtual machine ? +A: NO. + +BPF is generic instruction set _with_ C calling convention. + +Q: Why C calling convention was chosen? +A: Because BPF programs are designed to run in the linux kernel + which is written in C, hence BPF defines instruction set compatible + with two most used architectures x64 and arm64 (and takes into + consideration important quirks of other architectures) and + defines calling convention that is compatible with C calling + convention of the linux kernel on those architectures. + +Q: can multiple return values be supported in the future? +A: NO. BPF allows only register R0 to be used as return value. + +Q: can more than 5 function arguments be supported in the future? +A: NO. BPF calling convention only allows registers R1-R5 to be used + as arguments. BPF is not a standalone instruction set. + (unlike x64 ISA that allows msft, cdecl and other conventions) + +Q: can BPF programs access instruction pointer or return address? +A: NO. + +Q: can BPF programs access stack pointer ? +A: NO. Only frame pointer (register R10) is accessible. + From compiler point of view it's necessary to have stack pointer. + For example LLVM defines register R11 as stack pointer in its + BPF backend, but it makes sure that generated code never uses it. + +Q: Does C-calling convention diminishes possible use cases? +A: YES. BPF design forces addition of major functionality in the form + of kernel helper functions and kernel objects like BPF maps with + seamless interoperability between them. It lets kernel call into + BPF programs and programs call kernel helpers with zero overhead. + As all of them were native C code. That is particularly the case + for JITed BPF programs that are indistinguishable from + native kernel C code. + +Q: Does it mean that 'innovative' extensions to BPF code are disallowed? +A: Soft yes. At least for now until BPF core has support for + bpf-to-bpf calls, indirect calls, loops, global variables, + jump tables, read only sections and all other normal constructs + that C code can produce. + +Q: Can loops be supported in a safe way? +A: It's not clear yet. BPF developers are trying to find a way to + support bounded loops where the verifier can guarantee that + the program terminates in less than 4096 instructions. + +Q: How come LD_ABS and LD_IND instruction are present in BPF whereas + C code cannot express them and has to use builtin intrinsics? +A: This is artifact of compatibility with classic BPF. Modern + networking code in BPF performs better without them. + See 'direct packet access'. + +Q: It seems not all BPF instructions are one-to-one to native CPU. + For example why BPF_JNE and other compare and jumps are not cpu-like? +A: This was necessary to avoid introducing flags into ISA which are + impossible to make generic and efficient across CPU architectures. + +Q: why BPF_DIV instruction doesn't map to x64 div? +A: Because if we picked one-to-one relationship to x64 it would have made + it more complicated to support on arm64 and other archs. Also it + needs div-by-zero runtime check. + +Q: why there is no BPF_SDIV for signed divide operation? +A: Because it would be rarely used. llvm errors in such case and + prints a suggestion to use unsigned divide instead + +Q: Why BPF has implicit prologue and epilogue? +A: Because architectures like sparc have register windows and in general + there are enough subtle differences between architectures, so naive + store return address into stack won't work. Another reason is BPF has + to be safe from division by zero (and legacy exception path + of LD_ABS insn). Those instructions need to invoke epilogue and + return implicitly. + +Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? +A: Because classic BPF didn't have them and BPF authors felt that compiler + workaround would be acceptable. Turned out that programs lose performance + due to lack of these compare instructions and they were added. + These two instructions is a perfect example what kind of new BPF + instructions are acceptable and can be added in the future. + These two already had equivalent instructions in native CPUs. + New instructions that don't have one-to-one mapping to HW instructions + will not be accepted. + +Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF + registers which makes BPF inefficient virtual machine for 32-bit + CPU architectures and 32-bit HW accelerators. Can true 32-bit registers + be added to BPF in the future? +A: NO. The first thing to improve performance on 32-bit archs is to teach + LLVM to generate code that uses 32-bit subregisters. Then second step + is to teach verifier to mark operations where zero-ing upper bits + is unnecessary. Then JITs can take advantage of those markings and + drastically reduce size of generated code and improve performance. + +Q: Does BPF have a stable ABI? +A: YES. BPF instructions, arguments to BPF programs, set of helper + functions and their arguments, recognized return codes are all part + of ABI. However when tracing programs are using bpf_probe_read() helper + to walk kernel internal datastructures and compile with kernel + internal headers these accesses can and will break with newer + kernels. The union bpf_attr -> kern_version is checked at load time + to prevent accidentally loading kprobe-based bpf programs written + for a different kernel. Networking programs don't do kern_version check. + +Q: How much stack space a BPF program uses? +A: Currently all program types are limited to 512 bytes of stack + space, but the verifier computes the actual amount of stack used + and both interpreter and most JITed code consume necessary amount. + +Q: Can BPF be offloaded to HW? +A: YES. BPF HW offload is supported by NFP driver. + +Q: Does classic BPF interpreter still exist? +A: NO. Classic BPF programs are converted into extend BPF instructions. + +Q: Can BPF call arbitrary kernel functions? +A: NO. BPF programs can only call a set of helper functions which + is defined for every program type. + +Q: Can BPF overwrite arbitrary kernel memory? +A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read() + and bpf_probe_read_str() helpers. Networking programs cannot read + arbitrary memory, since they don't have access to these helpers. + Programs can never read or write arbitrary memory directly. + +Q: Can BPF overwrite arbitrary user memory? +A: Sort-of. Tracing BPF programs can overwrite the user memory + of the current task with bpf_probe_write_user(). Every time such + program is loaded the kernel will print warning message, so + this helper is only useful for experiments and prototypes. + Tracing BPF programs are root only. + +Q: When bpf_trace_printk() helper is used the kernel prints nasty + warning message. Why is that? +A: This is done to nudge program authors into better interfaces when + programs need to pass data to user space. Like bpf_perf_event_output() + can be used to efficiently stream data via perf ring buffer. + BPF maps can be used for asynchronous data sharing between kernel + and user space. bpf_trace_printk() should only be used for debugging. + +Q: Can BPF functionality such as new program or map types, new + helpers, etc be added out of kernel module code? +A: NO. diff --git a/Documentation/bpf/bpf_design_QA.txt b/Documentation/bpf/bpf_design_QA.txt deleted file mode 100644 index f3e458a0bb2f..000000000000 --- a/Documentation/bpf/bpf_design_QA.txt +++ /dev/null @@ -1,156 +0,0 @@ -BPF extensibility and applicability to networking, tracing, security -in the linux kernel and several user space implementations of BPF -virtual machine led to a number of misunderstanding on what BPF actually is. -This short QA is an attempt to address that and outline a direction -of where BPF is heading long term. - -Q: Is BPF a generic instruction set similar to x64 and arm64? -A: NO. - -Q: Is BPF a generic virtual machine ? -A: NO. - -BPF is generic instruction set _with_ C calling convention. - -Q: Why C calling convention was chosen? -A: Because BPF programs are designed to run in the linux kernel - which is written in C, hence BPF defines instruction set compatible - with two most used architectures x64 and arm64 (and takes into - consideration important quirks of other architectures) and - defines calling convention that is compatible with C calling - convention of the linux kernel on those architectures. - -Q: can multiple return values be supported in the future? -A: NO. BPF allows only register R0 to be used as return value. - -Q: can more than 5 function arguments be supported in the future? -A: NO. BPF calling convention only allows registers R1-R5 to be used - as arguments. BPF is not a standalone instruction set. - (unlike x64 ISA that allows msft, cdecl and other conventions) - -Q: can BPF programs access instruction pointer or return address? -A: NO. - -Q: can BPF programs access stack pointer ? -A: NO. Only frame pointer (register R10) is accessible. - From compiler point of view it's necessary to have stack pointer. - For example LLVM defines register R11 as stack pointer in its - BPF backend, but it makes sure that generated code never uses it. - -Q: Does C-calling convention diminishes possible use cases? -A: YES. BPF design forces addition of major functionality in the form - of kernel helper functions and kernel objects like BPF maps with - seamless interoperability between them. It lets kernel call into - BPF programs and programs call kernel helpers with zero overhead. - As all of them were native C code. That is particularly the case - for JITed BPF programs that are indistinguishable from - native kernel C code. - -Q: Does it mean that 'innovative' extensions to BPF code are disallowed? -A: Soft yes. At least for now until BPF core has support for - bpf-to-bpf calls, indirect calls, loops, global variables, - jump tables, read only sections and all other normal constructs - that C code can produce. - -Q: Can loops be supported in a safe way? -A: It's not clear yet. BPF developers are trying to find a way to - support bounded loops where the verifier can guarantee that - the program terminates in less than 4096 instructions. - -Q: How come LD_ABS and LD_IND instruction are present in BPF whereas - C code cannot express them and has to use builtin intrinsics? -A: This is artifact of compatibility with classic BPF. Modern - networking code in BPF performs better without them. - See 'direct packet access'. - -Q: It seems not all BPF instructions are one-to-one to native CPU. - For example why BPF_JNE and other compare and jumps are not cpu-like? -A: This was necessary to avoid introducing flags into ISA which are - impossible to make generic and efficient across CPU architectures. - -Q: why BPF_DIV instruction doesn't map to x64 div? -A: Because if we picked one-to-one relationship to x64 it would have made - it more complicated to support on arm64 and other archs. Also it - needs div-by-zero runtime check. - -Q: why there is no BPF_SDIV for signed divide operation? -A: Because it would be rarely used. llvm errors in such case and - prints a suggestion to use unsigned divide instead - -Q: Why BPF has implicit prologue and epilogue? -A: Because architectures like sparc have register windows and in general - there are enough subtle differences between architectures, so naive - store return address into stack won't work. Another reason is BPF has - to be safe from division by zero (and legacy exception path - of LD_ABS insn). Those instructions need to invoke epilogue and - return implicitly. - -Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? -A: Because classic BPF didn't have them and BPF authors felt that compiler - workaround would be acceptable. Turned out that programs lose performance - due to lack of these compare instructions and they were added. - These two instructions is a perfect example what kind of new BPF - instructions are acceptable and can be added in the future. - These two already had equivalent instructions in native CPUs. - New instructions that don't have one-to-one mapping to HW instructions - will not be accepted. - -Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF - registers which makes BPF inefficient virtual machine for 32-bit - CPU architectures and 32-bit HW accelerators. Can true 32-bit registers - be added to BPF in the future? -A: NO. The first thing to improve performance on 32-bit archs is to teach - LLVM to generate code that uses 32-bit subregisters. Then second step - is to teach verifier to mark operations where zero-ing upper bits - is unnecessary. Then JITs can take advantage of those markings and - drastically reduce size of generated code and improve performance. - -Q: Does BPF have a stable ABI? -A: YES. BPF instructions, arguments to BPF programs, set of helper - functions and their arguments, recognized return codes are all part - of ABI. However when tracing programs are using bpf_probe_read() helper - to walk kernel internal datastructures and compile with kernel - internal headers these accesses can and will break with newer - kernels. The union bpf_attr -> kern_version is checked at load time - to prevent accidentally loading kprobe-based bpf programs written - for a different kernel. Networking programs don't do kern_version check. - -Q: How much stack space a BPF program uses? -A: Currently all program types are limited to 512 bytes of stack - space, but the verifier computes the actual amount of stack used - and both interpreter and most JITed code consume necessary amount. - -Q: Can BPF be offloaded to HW? -A: YES. BPF HW offload is supported by NFP driver. - -Q: Does classic BPF interpreter still exist? -A: NO. Classic BPF programs are converted into extend BPF instructions. - -Q: Can BPF call arbitrary kernel functions? -A: NO. BPF programs can only call a set of helper functions which - is defined for every program type. - -Q: Can BPF overwrite arbitrary kernel memory? -A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read() - and bpf_probe_read_str() helpers. Networking programs cannot read - arbitrary memory, since they don't have access to these helpers. - Programs can never read or write arbitrary memory directly. - -Q: Can BPF overwrite arbitrary user memory? -A: Sort-of. Tracing BPF programs can overwrite the user memory - of the current task with bpf_probe_write_user(). Every time such - program is loaded the kernel will print warning message, so - this helper is only useful for experiments and prototypes. - Tracing BPF programs are root only. - -Q: When bpf_trace_printk() helper is used the kernel prints nasty - warning message. Why is that? -A: This is done to nudge program authors into better interfaces when - programs need to pass data to user space. Like bpf_perf_event_output() - can be used to efficiently stream data via perf ring buffer. - BPF maps can be used for asynchronous data sharing between kernel - and user space. bpf_trace_printk() should only be used for debugging. - -Q: Can BPF functionality such as new program or map types, new - helpers, etc be added out of kernel module code? -A: NO. diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst new file mode 100644 index 000000000000..da57601153a0 --- /dev/null +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -0,0 +1,570 @@ +This document provides information for the BPF subsystem about various +workflows related to reporting bugs, submitting patches, and queueing +patches for stable kernels. + +For general information about submitting patches, please refer to +Documentation/process/. This document only describes additional specifics +related to BPF. + +Reporting bugs: +--------------- + +Q: How do I report bugs for BPF kernel code? + +A: Since all BPF kernel development as well as bpftool and iproute2 BPF + loader development happens through the netdev kernel mailing list, + please report any found issues around BPF to the following mailing + list: + + netdev@vger.kernel.org + + This may also include issues related to XDP, BPF tracing, etc. + + Given netdev has a high volume of traffic, please also add the BPF + maintainers to Cc (from kernel MAINTAINERS file): + + Alexei Starovoitov + Daniel Borkmann + + In case a buggy commit has already been identified, make sure to keep + the actual commit authors in Cc as well for the report. They can + typically be identified through the kernel's git tree. + + Please do *not* report BPF issues to bugzilla.kernel.org since it + is a guarantee that the reported issue will be overlooked. + +Submitting patches: +------------------- + +Q: To which mailing list do I need to submit my BPF patches? + +A: Please submit your BPF patches to the netdev kernel mailing list: + + netdev@vger.kernel.org + + Historically, BPF came out of networking and has always been maintained + by the kernel networking community. Although these days BPF touches + many other subsystems as well, the patches are still routed mainly + through the networking community. + + In case your patch has changes in various different subsystems (e.g. + tracing, security, etc), make sure to Cc the related kernel mailing + lists and maintainers from there as well, so they are able to review + the changes and provide their Acked-by's to the patches. + +Q: Where can I find patches currently under discussion for BPF subsystem? + +A: All patches that are Cc'ed to netdev are queued for review under netdev + patchwork project: + + http://patchwork.ozlabs.org/project/netdev/list/ + + Those patches which target BPF, are assigned to a 'bpf' delegate for + further processing from BPF maintainers. The current queue with + patches under review can be found at: + + https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 + + Once the patches have been reviewed by the BPF community as a whole + and approved by the BPF maintainers, their status in patchwork will be + changed to 'Accepted' and the submitter will be notified by mail. This + means that the patches look good from a BPF perspective and have been + applied to one of the two BPF kernel trees. + + In case feedback from the community requires a respin of the patches, + their status in patchwork will be set to 'Changes Requested', and purged + from the current review queue. Likewise for cases where patches would + get rejected or are not applicable to the BPF trees (but assigned to + the 'bpf' delegate). + +Q: How do the changes make their way into Linux? + +A: There are two BPF kernel trees (git repositories). Once patches have + been accepted by the BPF maintainers, they will be applied to one + of the two BPF trees: + + https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ + https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ + + The bpf tree itself is for fixes only, whereas bpf-next for features, + cleanups or other kind of improvements ("next-like" content). This is + analogous to net and net-next trees for networking. Both bpf and + bpf-next will only have a master branch in order to simplify against + which branch patches should get rebased to. + + Accumulated BPF patches in the bpf tree will regularly get pulled + into the net kernel tree. Likewise, accumulated BPF patches accepted + into the bpf-next tree will make their way into net-next tree. net and + net-next are both run by David S. Miller. From there, they will go + into the kernel mainline tree run by Linus Torvalds. To read up on the + process of net and net-next being merged into the mainline tree, see + the netdev FAQ under: + + Documentation/networking/netdev-FAQ.txt + + Occasionally, to prevent merge conflicts, we might send pull requests + to other trees (e.g. tracing) with a small subset of the patches, but + net and net-next are always the main trees targeted for integration. + + The pull requests will contain a high-level summary of the accumulated + patches and can be searched on netdev kernel mailing list through the + following subject lines (yyyy-mm-dd is the date of the pull request): + + pull-request: bpf yyyy-mm-dd + pull-request: bpf-next yyyy-mm-dd + +Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be + applied to? + +A: The process is the very same as described in the netdev FAQ, so + please read up on it. The subject line must indicate whether the + patch is a fix or rather "next-like" content in order to let the + maintainers know whether it is targeted at bpf or bpf-next. + + For fixes eventually landing in bpf -> net tree, the subject must + look like: + + git format-patch --subject-prefix='PATCH bpf' start..finish + + For features/improvements/etc that should eventually land in + bpf-next -> net-next, the subject must look like: + + git format-patch --subject-prefix='PATCH bpf-next' start..finish + + If unsure whether the patch or patch series should go into bpf + or net directly, or bpf-next or net-next directly, it is not a + problem either if the subject line says net or net-next as target. + It is eventually up to the maintainers to do the delegation of + the patches. + + If it is clear that patches should go into bpf or bpf-next tree, + please make sure to rebase the patches against those trees in + order to reduce potential conflicts. + + In case the patch or patch series has to be reworked and sent out + again in a second or later revision, it is also required to add a + version number (v2, v3, ...) into the subject prefix: + + git format-patch --subject-prefix='PATCH net-next v2' start..finish + + When changes have been requested to the patch series, always send the + whole patch series again with the feedback incorporated (never send + individual diffs on top of the old series). + +Q: What does it mean when a patch gets applied to bpf or bpf-next tree? + +A: It means that the patch looks good for mainline inclusion from + a BPF point of view. + + Be aware that this is not a final verdict that the patch will + automatically get accepted into net or net-next trees eventually: + + On the netdev kernel mailing list reviews can come in at any point + in time. If discussions around a patch conclude that they cannot + get included as-is, we will either apply a follow-up fix or drop + them from the trees entirely. Therefore, we also reserve to rebase + the trees when deemed necessary. After all, the purpose of the tree + is to i) accumulate and stage BPF patches for integration into trees + like net and net-next, and ii) run extensive BPF test suite and + workloads on the patches before they make their way any further. + + Once the BPF pull request was accepted by David S. Miller, then + the patches end up in net or net-next tree, respectively, and + make their way from there further into mainline. Again, see the + netdev FAQ for additional information e.g. on how often they are + merged to mainline. + +Q: How long do I need to wait for feedback on my BPF patches? + +A: We try to keep the latency low. The usual time to feedback will + be around 2 or 3 business days. It may vary depending on the + complexity of changes and current patch load. + +Q: How often do you send pull requests to major kernel trees like + net or net-next? + +A: Pull requests will be sent out rather often in order to not + accumulate too many patches in bpf or bpf-next. + + As a rule of thumb, expect pull requests for each tree regularly + at the end of the week. In some cases pull requests could additionally + come also in the middle of the week depending on the current patch + load or urgency. + +Q: Are patches applied to bpf-next when the merge window is open? + +A: For the time when the merge window is open, bpf-next will not be + processed. This is roughly analogous to net-next patch processing, + so feel free to read up on the netdev FAQ about further details. + + During those two weeks of merge window, we might ask you to resend + your patch series once bpf-next is open again. Once Linus released + a v*-rc1 after the merge window, we continue processing of bpf-next. + + For non-subscribers to kernel mailing lists, there is also a status + page run by David S. Miller on net-next that provides guidance: + + http://vger.kernel.org/~davem/net-next.html + +Q: I made a BPF verifier change, do I need to add test cases for + BPF kernel selftests? + +A: If the patch has changes to the behavior of the verifier, then yes, + it is absolutely necessary to add test cases to the BPF kernel + selftests suite. If they are not present and we think they are + needed, then we might ask for them before accepting any changes. + + In particular, test_verifier.c is tracking a high number of BPF test + cases, including a lot of corner cases that LLVM BPF back end may + generate out of the restricted C code. Thus, adding test cases is + absolutely crucial to make sure future changes do not accidentally + affect prior use-cases. Thus, treat those test cases as: verifier + behavior that is not tracked in test_verifier.c could potentially + be subject to change. + +Q: When should I add code to samples/bpf/ and when to BPF kernel + selftests? + +A: In general, we prefer additions to BPF kernel selftests rather than + samples/bpf/. The rationale is very simple: kernel selftests are + regularly run by various bots to test for kernel regressions. + + The more test cases we add to BPF selftests, the better the coverage + and the less likely it is that those could accidentally break. It is + not that BPF kernel selftests cannot demo how a specific feature can + be used. + + That said, samples/bpf/ may be a good place for people to get started, + so it might be advisable that simple demos of features could go into + samples/bpf/, but advanced functional and corner-case testing rather + into kernel selftests. + + If your sample looks like a test case, then go for BPF kernel selftests + instead! + +Q: When should I add code to the bpftool? + +A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide + a central user space tool for debugging and introspection of BPF programs + and maps that are active in the kernel. If UAPI changes related to BPF + enable for dumping additional information of programs or maps, then + bpftool should be extended as well to support dumping them. + +Q: When should I add code to iproute2's BPF loader? + +A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the + convention is that those control-path related changes are added to + iproute2's BPF loader as well from user space side. This is not only + useful to have UAPI changes properly designed to be usable, but also + to make those changes available to a wider user base of major + downstream distributions. + +Q: Do you accept patches as well for iproute2's BPF loader? + +A: Patches for the iproute2's BPF loader have to be sent to: + + netdev@vger.kernel.org + + While those patches are not processed by the BPF kernel maintainers, + please keep them in Cc as well, so they can be reviewed. + + The official git repository for iproute2 is run by Stephen Hemminger + and can be found at: + + https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ + + The patches need to have a subject prefix of '[PATCH iproute2 master]' + or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the + target branch where the patch should be applied to. Meaning, if kernel + changes went into the net-next kernel tree, then the related iproute2 + changes need to go into the iproute2 net-next branch, otherwise they + can be targeted at master branch. The iproute2 net-next branch will get + merged into the master branch after the current iproute2 version from + master has been released. + + Like BPF, the patches end up in patchwork under the netdev project and + are delegated to 'shemminger' for further processing: + + http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 + +Q: What is the minimum requirement before I submit my BPF patches? + +A: When submitting patches, always take the time and properly test your + patches *prior* to submission. Never rush them! If maintainers find + that your patches have not been properly tested, it is a good way to + get them grumpy. Testing patch submissions is a hard requirement! + + Note, fixes that go to bpf tree *must* have a Fixes: tag included. The + same applies to fixes that target bpf-next, where the affected commit + is in net-next (or in some cases bpf-next). The Fixes: tag is crucial + in order to identify follow-up commits and tremendously helps for people + having to do backporting, so it is a must have! + + We also don't accept patches with an empty commit message. Take your + time and properly write up a high quality commit message, it is + essential! + + Think about it this way: other developers looking at your code a month + from now need to understand *why* a certain change has been done that + way, and whether there have been flaws in the analysis or assumptions + that the original author did. Thus providing a proper rationale and + describing the use-case for the changes is a must. + + Patch submissions with >1 patch must have a cover letter which includes + a high level description of the series. This high level summary will + then be placed into the merge commit by the BPF maintainers such that + it is also accessible from the git log for future reference. + +Q: What do I need to consider when adding a new instruction or feature + that would require BPF JIT and/or LLVM integration as well? + +A: We try hard to keep all BPF JITs up to date such that the same user + experience can be guaranteed when running BPF programs on different + architectures without having the program punt to the less efficient + interpreter in case the in-kernel BPF JIT is enabled. + + If you are unable to implement or test the required JIT changes for + certain architectures, please work together with the related BPF JIT + developers in order to get the feature implemented in a timely manner. + Please refer to the git log (arch/*/net/) to locate the necessary + people for helping out. + + Also always make sure to add BPF test cases (e.g. test_bpf.c and + test_verifier.c) for new instructions, so that they can receive + broad test coverage and help run-time testing the various BPF JITs. + + In case of new BPF instructions, once the changes have been accepted + into the Linux kernel, please implement support into LLVM's BPF back + end. See LLVM section below for further information. + +Stable submission: +------------------ + +Q: I need a specific BPF commit in stable kernels. What should I do? + +A: In case you need a specific fix in stable kernels, first check whether + the commit has already been applied in the related linux-*.y branches: + + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ + + If not the case, then drop an email to the BPF maintainers with the + netdev kernel mailing list in Cc and ask for the fix to be queued up: + + netdev@vger.kernel.org + + The process in general is the same as on netdev itself, see also the + netdev FAQ document. + +Q: Do you also backport to kernels not currently maintained as stable? + +A: No. If you need a specific BPF commit in kernels that are currently not + maintained by the stable maintainers, then you are on your own. + + The current stable and longterm stable kernels are all listed here: + + https://www.kernel.org/ + +Q: The BPF patch I am about to submit needs to go to stable as well. What + should I do? + +A: The same rules apply as with netdev patch submissions in general, see + netdev FAQ under: + + Documentation/networking/netdev-FAQ.txt + + Never add "Cc: stable@vger.kernel.org" to the patch description, but + ask the BPF maintainers to queue the patches instead. This can be done + with a note, for example, under the "---" part of the patch which does + not go into the git log. Alternatively, this can be done as a simple + request by mail instead. + +Q: Where do I find currently queued BPF patches that will be submitted + to stable? + +A: Once patches that fix critical bugs got applied into the bpf tree, they + are queued up for stable submission under: + + http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* + + They will be on hold there at minimum until the related commit made its + way into the mainline kernel tree. + + After having been under broader exposure, the queued patches will be + submitted by the BPF maintainers to the stable maintainers. + +Testing patches: +---------------- + +Q: Which BPF kernel selftests version should I run my kernel against? + +A: If you run a kernel xyz, then always run the BPF kernel selftests from + that kernel xyz as well. Do not expect that the BPF selftest from the + latest mainline tree will pass all the time. + + In particular, test_bpf.c and test_verifier.c have a large number of + test cases and are constantly updated with new BPF test sequences, or + existing ones are adapted to verifier changes e.g. due to verifier + becoming smarter and being able to better track certain things. + +LLVM: +----- + +Q: Where do I find LLVM with BPF support? + +A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. + + All major distributions these days ship LLVM with BPF back end enabled, + so for the majority of use-cases it is not required to compile LLVM by + hand anymore, just install the distribution provided package. + + LLVM's static compiler lists the supported targets through 'llc --version', + make sure BPF targets are listed. Example: + + $ llc --version + LLVM (http://llvm.org/): + LLVM version 6.0.0svn + Optimized build. + Default target: x86_64-unknown-linux-gnu + Host CPU: skylake + + Registered Targets: + bpf - BPF (host endian) + bpfeb - BPF (big endian) + bpfel - BPF (little endian) + x86 - 32-bit X86: Pentium-Pro and above + x86-64 - 64-bit X86: EM64T and AMD64 + + For developers in order to utilize the latest features added to LLVM's + BPF back end, it is advisable to run the latest LLVM releases. Support + for new BPF kernel features such as additions to the BPF instruction + set are often developed together. + + All LLVM releases can be found at: http://releases.llvm.org/ + +Q: Got it, so how do I build LLVM manually anyway? + +A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have + that set up, proceed with building the latest LLVM and clang version + from the git repositories: + + $ git clone http://llvm.org/git/llvm.git + $ cd llvm/tools + $ git clone --depth 1 http://llvm.org/git/clang.git + $ cd ..; mkdir build; cd build + $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ + -DBUILD_SHARED_LIBS=OFF \ + -DCMAKE_BUILD_TYPE=Release \ + -DLLVM_BUILD_RUNTIME=OFF + $ make -j $(getconf _NPROCESSORS_ONLN) + + The built binaries can then be found in the build/bin/ directory, where + you can point the PATH variable to. + +Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code + generation back end or about LLVM generated code that the verifier + refuses to accept? + +A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF + infrastructure and it ties deeply into verification of programs from the + kernel side. Therefore, any issues on either side need to be investigated + and fixed whenever necessary. + + Therefore, please make sure to bring them up at netdev kernel mailing + list and Cc BPF maintainers for LLVM and kernel bits: + + Yonghong Song + Alexei Starovoitov + Daniel Borkmann + + LLVM also has an issue tracker where BPF related bugs can be found: + + https://bugs.llvm.org/buglist.cgi?quicksearch=bpf + + However, it is better to reach out through mailing lists with having + maintainers in Cc. + +Q: I have added a new BPF instruction to the kernel, how can I integrate + it into LLVM? + +A: LLVM has a -mcpu selector for the BPF back end in order to allow the + selection of BPF instruction set extensions. By default the 'generic' + processor target is used, which is the base instruction set (v1) of BPF. + + LLVM has an option to select -mcpu=probe where it will probe the host + kernel for supported BPF instruction set extensions and selects the + optimal set automatically. + + For cross-compilation, a specific version can be select manually as well. + + $ llc -march bpf -mcpu=help + Available CPUs for this target: + + generic - Select the generic processor. + probe - Select the probe processor. + v1 - Select the v1 processor. + v2 - Select the v2 processor. + [...] + + Newly added BPF instructions to the Linux kernel need to follow the same + scheme, bump the instruction set version and implement probing for the + extensions such that -mcpu=probe users can benefit from the optimization + transparently when upgrading their kernels. + + If you are unable to implement support for the newly added BPF instruction + please reach out to BPF developers for help. + + By the way, the BPF kernel selftests run with -mcpu=probe for better + test coverage. + +Q: In some cases clang flag "-target bpf" is used but in other cases the + default clang target, which matches the underlying architecture, is used. + What is the difference and when I should use which? + +A: Although LLVM IR generation and optimization try to stay architecture + independent, "-target " still has some impact on generated code: + + - BPF program may recursively include header file(s) with file scope + inline assembly codes. The default target can handle this well, + while bpf target may fail if bpf backend assembler does not + understand these assembly codes, which is true in most cases. + + - When compiled without -g, additional elf sections, e.g., + .eh_frame and .rela.eh_frame, may be present in the object file + with default target, but not with bpf target. + + - The default target may turn a C switch statement into a switch table + lookup and jump operation. Since the switch table is placed + in the global readonly section, the bpf program will fail to load. + The bpf target does not support switch table optimization. + The clang option "-fno-jump-tables" can be used to disable + switch table generation. + + - For clang -target bpf, it is guaranteed that pointer or long / + unsigned long types will always have a width of 64 bit, no matter + whether underlying clang binary or default target (or kernel) is + 32 bit. However, when native clang target is used, then it will + compile these types based on the underlying architecture's conventions, + meaning in case of 32 bit architecture, pointer or long / unsigned + long types e.g. in BPF context structure will have width of 32 bit + while the BPF LLVM back end still operates in 64 bit. The native + target is mostly needed in tracing for the case of walking pt_regs + or other kernel structures where CPU's register width matters. + Otherwise, clang -target bpf is generally recommended. + + You should use default target when: + + - Your program includes a header file, e.g., ptrace.h, which eventually + pulls in some header files containing file scope host assembly codes. + - You can add "-fno-jump-tables" to work around the switch table issue. + + Otherwise, you can use bpf target. Additionally, you _must_ use bpf target + when: + + - Your program uses data structures with pointer or long / unsigned long + types that interface with BPF helpers or context data structures. Access + into these structures is verified by the BPF verifier and may result + in verification failures if the native architecture is not aligned with + the BPF architecture, e.g. 64-bit. An example of this is + BPF_PROG_TYPE_SK_MSG require '-target bpf' + +Happy BPF hacking! diff --git a/Documentation/bpf/bpf_devel_QA.txt b/Documentation/bpf/bpf_devel_QA.txt deleted file mode 100644 index da57601153a0..000000000000 --- a/Documentation/bpf/bpf_devel_QA.txt +++ /dev/null @@ -1,570 +0,0 @@ -This document provides information for the BPF subsystem about various -workflows related to reporting bugs, submitting patches, and queueing -patches for stable kernels. - -For general information about submitting patches, please refer to -Documentation/process/. This document only describes additional specifics -related to BPF. - -Reporting bugs: ---------------- - -Q: How do I report bugs for BPF kernel code? - -A: Since all BPF kernel development as well as bpftool and iproute2 BPF - loader development happens through the netdev kernel mailing list, - please report any found issues around BPF to the following mailing - list: - - netdev@vger.kernel.org - - This may also include issues related to XDP, BPF tracing, etc. - - Given netdev has a high volume of traffic, please also add the BPF - maintainers to Cc (from kernel MAINTAINERS file): - - Alexei Starovoitov - Daniel Borkmann - - In case a buggy commit has already been identified, make sure to keep - the actual commit authors in Cc as well for the report. They can - typically be identified through the kernel's git tree. - - Please do *not* report BPF issues to bugzilla.kernel.org since it - is a guarantee that the reported issue will be overlooked. - -Submitting patches: -------------------- - -Q: To which mailing list do I need to submit my BPF patches? - -A: Please submit your BPF patches to the netdev kernel mailing list: - - netdev@vger.kernel.org - - Historically, BPF came out of networking and has always been maintained - by the kernel networking community. Although these days BPF touches - many other subsystems as well, the patches are still routed mainly - through the networking community. - - In case your patch has changes in various different subsystems (e.g. - tracing, security, etc), make sure to Cc the related kernel mailing - lists and maintainers from there as well, so they are able to review - the changes and provide their Acked-by's to the patches. - -Q: Where can I find patches currently under discussion for BPF subsystem? - -A: All patches that are Cc'ed to netdev are queued for review under netdev - patchwork project: - - http://patchwork.ozlabs.org/project/netdev/list/ - - Those patches which target BPF, are assigned to a 'bpf' delegate for - further processing from BPF maintainers. The current queue with - patches under review can be found at: - - https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 - - Once the patches have been reviewed by the BPF community as a whole - and approved by the BPF maintainers, their status in patchwork will be - changed to 'Accepted' and the submitter will be notified by mail. This - means that the patches look good from a BPF perspective and have been - applied to one of the two BPF kernel trees. - - In case feedback from the community requires a respin of the patches, - their status in patchwork will be set to 'Changes Requested', and purged - from the current review queue. Likewise for cases where patches would - get rejected or are not applicable to the BPF trees (but assigned to - the 'bpf' delegate). - -Q: How do the changes make their way into Linux? - -A: There are two BPF kernel trees (git repositories). Once patches have - been accepted by the BPF maintainers, they will be applied to one - of the two BPF trees: - - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ - - The bpf tree itself is for fixes only, whereas bpf-next for features, - cleanups or other kind of improvements ("next-like" content). This is - analogous to net and net-next trees for networking. Both bpf and - bpf-next will only have a master branch in order to simplify against - which branch patches should get rebased to. - - Accumulated BPF patches in the bpf tree will regularly get pulled - into the net kernel tree. Likewise, accumulated BPF patches accepted - into the bpf-next tree will make their way into net-next tree. net and - net-next are both run by David S. Miller. From there, they will go - into the kernel mainline tree run by Linus Torvalds. To read up on the - process of net and net-next being merged into the mainline tree, see - the netdev FAQ under: - - Documentation/networking/netdev-FAQ.txt - - Occasionally, to prevent merge conflicts, we might send pull requests - to other trees (e.g. tracing) with a small subset of the patches, but - net and net-next are always the main trees targeted for integration. - - The pull requests will contain a high-level summary of the accumulated - patches and can be searched on netdev kernel mailing list through the - following subject lines (yyyy-mm-dd is the date of the pull request): - - pull-request: bpf yyyy-mm-dd - pull-request: bpf-next yyyy-mm-dd - -Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be - applied to? - -A: The process is the very same as described in the netdev FAQ, so - please read up on it. The subject line must indicate whether the - patch is a fix or rather "next-like" content in order to let the - maintainers know whether it is targeted at bpf or bpf-next. - - For fixes eventually landing in bpf -> net tree, the subject must - look like: - - git format-patch --subject-prefix='PATCH bpf' start..finish - - For features/improvements/etc that should eventually land in - bpf-next -> net-next, the subject must look like: - - git format-patch --subject-prefix='PATCH bpf-next' start..finish - - If unsure whether the patch or patch series should go into bpf - or net directly, or bpf-next or net-next directly, it is not a - problem either if the subject line says net or net-next as target. - It is eventually up to the maintainers to do the delegation of - the patches. - - If it is clear that patches should go into bpf or bpf-next tree, - please make sure to rebase the patches against those trees in - order to reduce potential conflicts. - - In case the patch or patch series has to be reworked and sent out - again in a second or later revision, it is also required to add a - version number (v2, v3, ...) into the subject prefix: - - git format-patch --subject-prefix='PATCH net-next v2' start..finish - - When changes have been requested to the patch series, always send the - whole patch series again with the feedback incorporated (never send - individual diffs on top of the old series). - -Q: What does it mean when a patch gets applied to bpf or bpf-next tree? - -A: It means that the patch looks good for mainline inclusion from - a BPF point of view. - - Be aware that this is not a final verdict that the patch will - automatically get accepted into net or net-next trees eventually: - - On the netdev kernel mailing list reviews can come in at any point - in time. If discussions around a patch conclude that they cannot - get included as-is, we will either apply a follow-up fix or drop - them from the trees entirely. Therefore, we also reserve to rebase - the trees when deemed necessary. After all, the purpose of the tree - is to i) accumulate and stage BPF patches for integration into trees - like net and net-next, and ii) run extensive BPF test suite and - workloads on the patches before they make their way any further. - - Once the BPF pull request was accepted by David S. Miller, then - the patches end up in net or net-next tree, respectively, and - make their way from there further into mainline. Again, see the - netdev FAQ for additional information e.g. on how often they are - merged to mainline. - -Q: How long do I need to wait for feedback on my BPF patches? - -A: We try to keep the latency low. The usual time to feedback will - be around 2 or 3 business days. It may vary depending on the - complexity of changes and current patch load. - -Q: How often do you send pull requests to major kernel trees like - net or net-next? - -A: Pull requests will be sent out rather often in order to not - accumulate too many patches in bpf or bpf-next. - - As a rule of thumb, expect pull requests for each tree regularly - at the end of the week. In some cases pull requests could additionally - come also in the middle of the week depending on the current patch - load or urgency. - -Q: Are patches applied to bpf-next when the merge window is open? - -A: For the time when the merge window is open, bpf-next will not be - processed. This is roughly analogous to net-next patch processing, - so feel free to read up on the netdev FAQ about further details. - - During those two weeks of merge window, we might ask you to resend - your patch series once bpf-next is open again. Once Linus released - a v*-rc1 after the merge window, we continue processing of bpf-next. - - For non-subscribers to kernel mailing lists, there is also a status - page run by David S. Miller on net-next that provides guidance: - - http://vger.kernel.org/~davem/net-next.html - -Q: I made a BPF verifier change, do I need to add test cases for - BPF kernel selftests? - -A: If the patch has changes to the behavior of the verifier, then yes, - it is absolutely necessary to add test cases to the BPF kernel - selftests suite. If they are not present and we think they are - needed, then we might ask for them before accepting any changes. - - In particular, test_verifier.c is tracking a high number of BPF test - cases, including a lot of corner cases that LLVM BPF back end may - generate out of the restricted C code. Thus, adding test cases is - absolutely crucial to make sure future changes do not accidentally - affect prior use-cases. Thus, treat those test cases as: verifier - behavior that is not tracked in test_verifier.c could potentially - be subject to change. - -Q: When should I add code to samples/bpf/ and when to BPF kernel - selftests? - -A: In general, we prefer additions to BPF kernel selftests rather than - samples/bpf/. The rationale is very simple: kernel selftests are - regularly run by various bots to test for kernel regressions. - - The more test cases we add to BPF selftests, the better the coverage - and the less likely it is that those could accidentally break. It is - not that BPF kernel selftests cannot demo how a specific feature can - be used. - - That said, samples/bpf/ may be a good place for people to get started, - so it might be advisable that simple demos of features could go into - samples/bpf/, but advanced functional and corner-case testing rather - into kernel selftests. - - If your sample looks like a test case, then go for BPF kernel selftests - instead! - -Q: When should I add code to the bpftool? - -A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide - a central user space tool for debugging and introspection of BPF programs - and maps that are active in the kernel. If UAPI changes related to BPF - enable for dumping additional information of programs or maps, then - bpftool should be extended as well to support dumping them. - -Q: When should I add code to iproute2's BPF loader? - -A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the - convention is that those control-path related changes are added to - iproute2's BPF loader as well from user space side. This is not only - useful to have UAPI changes properly designed to be usable, but also - to make those changes available to a wider user base of major - downstream distributions. - -Q: Do you accept patches as well for iproute2's BPF loader? - -A: Patches for the iproute2's BPF loader have to be sent to: - - netdev@vger.kernel.org - - While those patches are not processed by the BPF kernel maintainers, - please keep them in Cc as well, so they can be reviewed. - - The official git repository for iproute2 is run by Stephen Hemminger - and can be found at: - - https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ - - The patches need to have a subject prefix of '[PATCH iproute2 master]' - or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the - target branch where the patch should be applied to. Meaning, if kernel - changes went into the net-next kernel tree, then the related iproute2 - changes need to go into the iproute2 net-next branch, otherwise they - can be targeted at master branch. The iproute2 net-next branch will get - merged into the master branch after the current iproute2 version from - master has been released. - - Like BPF, the patches end up in patchwork under the netdev project and - are delegated to 'shemminger' for further processing: - - http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 - -Q: What is the minimum requirement before I submit my BPF patches? - -A: When submitting patches, always take the time and properly test your - patches *prior* to submission. Never rush them! If maintainers find - that your patches have not been properly tested, it is a good way to - get them grumpy. Testing patch submissions is a hard requirement! - - Note, fixes that go to bpf tree *must* have a Fixes: tag included. The - same applies to fixes that target bpf-next, where the affected commit - is in net-next (or in some cases bpf-next). The Fixes: tag is crucial - in order to identify follow-up commits and tremendously helps for people - having to do backporting, so it is a must have! - - We also don't accept patches with an empty commit message. Take your - time and properly write up a high quality commit message, it is - essential! - - Think about it this way: other developers looking at your code a month - from now need to understand *why* a certain change has been done that - way, and whether there have been flaws in the analysis or assumptions - that the original author did. Thus providing a proper rationale and - describing the use-case for the changes is a must. - - Patch submissions with >1 patch must have a cover letter which includes - a high level description of the series. This high level summary will - then be placed into the merge commit by the BPF maintainers such that - it is also accessible from the git log for future reference. - -Q: What do I need to consider when adding a new instruction or feature - that would require BPF JIT and/or LLVM integration as well? - -A: We try hard to keep all BPF JITs up to date such that the same user - experience can be guaranteed when running BPF programs on different - architectures without having the program punt to the less efficient - interpreter in case the in-kernel BPF JIT is enabled. - - If you are unable to implement or test the required JIT changes for - certain architectures, please work together with the related BPF JIT - developers in order to get the feature implemented in a timely manner. - Please refer to the git log (arch/*/net/) to locate the necessary - people for helping out. - - Also always make sure to add BPF test cases (e.g. test_bpf.c and - test_verifier.c) for new instructions, so that they can receive - broad test coverage and help run-time testing the various BPF JITs. - - In case of new BPF instructions, once the changes have been accepted - into the Linux kernel, please implement support into LLVM's BPF back - end. See LLVM section below for further information. - -Stable submission: ------------------- - -Q: I need a specific BPF commit in stable kernels. What should I do? - -A: In case you need a specific fix in stable kernels, first check whether - the commit has already been applied in the related linux-*.y branches: - - https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ - - If not the case, then drop an email to the BPF maintainers with the - netdev kernel mailing list in Cc and ask for the fix to be queued up: - - netdev@vger.kernel.org - - The process in general is the same as on netdev itself, see also the - netdev FAQ document. - -Q: Do you also backport to kernels not currently maintained as stable? - -A: No. If you need a specific BPF commit in kernels that are currently not - maintained by the stable maintainers, then you are on your own. - - The current stable and longterm stable kernels are all listed here: - - https://www.kernel.org/ - -Q: The BPF patch I am about to submit needs to go to stable as well. What - should I do? - -A: The same rules apply as with netdev patch submissions in general, see - netdev FAQ under: - - Documentation/networking/netdev-FAQ.txt - - Never add "Cc: stable@vger.kernel.org" to the patch description, but - ask the BPF maintainers to queue the patches instead. This can be done - with a note, for example, under the "---" part of the patch which does - not go into the git log. Alternatively, this can be done as a simple - request by mail instead. - -Q: Where do I find currently queued BPF patches that will be submitted - to stable? - -A: Once patches that fix critical bugs got applied into the bpf tree, they - are queued up for stable submission under: - - http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* - - They will be on hold there at minimum until the related commit made its - way into the mainline kernel tree. - - After having been under broader exposure, the queued patches will be - submitted by the BPF maintainers to the stable maintainers. - -Testing patches: ----------------- - -Q: Which BPF kernel selftests version should I run my kernel against? - -A: If you run a kernel xyz, then always run the BPF kernel selftests from - that kernel xyz as well. Do not expect that the BPF selftest from the - latest mainline tree will pass all the time. - - In particular, test_bpf.c and test_verifier.c have a large number of - test cases and are constantly updated with new BPF test sequences, or - existing ones are adapted to verifier changes e.g. due to verifier - becoming smarter and being able to better track certain things. - -LLVM: ------ - -Q: Where do I find LLVM with BPF support? - -A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. - - All major distributions these days ship LLVM with BPF back end enabled, - so for the majority of use-cases it is not required to compile LLVM by - hand anymore, just install the distribution provided package. - - LLVM's static compiler lists the supported targets through 'llc --version', - make sure BPF targets are listed. Example: - - $ llc --version - LLVM (http://llvm.org/): - LLVM version 6.0.0svn - Optimized build. - Default target: x86_64-unknown-linux-gnu - Host CPU: skylake - - Registered Targets: - bpf - BPF (host endian) - bpfeb - BPF (big endian) - bpfel - BPF (little endian) - x86 - 32-bit X86: Pentium-Pro and above - x86-64 - 64-bit X86: EM64T and AMD64 - - For developers in order to utilize the latest features added to LLVM's - BPF back end, it is advisable to run the latest LLVM releases. Support - for new BPF kernel features such as additions to the BPF instruction - set are often developed together. - - All LLVM releases can be found at: http://releases.llvm.org/ - -Q: Got it, so how do I build LLVM manually anyway? - -A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have - that set up, proceed with building the latest LLVM and clang version - from the git repositories: - - $ git clone http://llvm.org/git/llvm.git - $ cd llvm/tools - $ git clone --depth 1 http://llvm.org/git/clang.git - $ cd ..; mkdir build; cd build - $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \ - -DBUILD_SHARED_LIBS=OFF \ - -DCMAKE_BUILD_TYPE=Release \ - -DLLVM_BUILD_RUNTIME=OFF - $ make -j $(getconf _NPROCESSORS_ONLN) - - The built binaries can then be found in the build/bin/ directory, where - you can point the PATH variable to. - -Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code - generation back end or about LLVM generated code that the verifier - refuses to accept? - -A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF - infrastructure and it ties deeply into verification of programs from the - kernel side. Therefore, any issues on either side need to be investigated - and fixed whenever necessary. - - Therefore, please make sure to bring them up at netdev kernel mailing - list and Cc BPF maintainers for LLVM and kernel bits: - - Yonghong Song - Alexei Starovoitov - Daniel Borkmann - - LLVM also has an issue tracker where BPF related bugs can be found: - - https://bugs.llvm.org/buglist.cgi?quicksearch=bpf - - However, it is better to reach out through mailing lists with having - maintainers in Cc. - -Q: I have added a new BPF instruction to the kernel, how can I integrate - it into LLVM? - -A: LLVM has a -mcpu selector for the BPF back end in order to allow the - selection of BPF instruction set extensions. By default the 'generic' - processor target is used, which is the base instruction set (v1) of BPF. - - LLVM has an option to select -mcpu=probe where it will probe the host - kernel for supported BPF instruction set extensions and selects the - optimal set automatically. - - For cross-compilation, a specific version can be select manually as well. - - $ llc -march bpf -mcpu=help - Available CPUs for this target: - - generic - Select the generic processor. - probe - Select the probe processor. - v1 - Select the v1 processor. - v2 - Select the v2 processor. - [...] - - Newly added BPF instructions to the Linux kernel need to follow the same - scheme, bump the instruction set version and implement probing for the - extensions such that -mcpu=probe users can benefit from the optimization - transparently when upgrading their kernels. - - If you are unable to implement support for the newly added BPF instruction - please reach out to BPF developers for help. - - By the way, the BPF kernel selftests run with -mcpu=probe for better - test coverage. - -Q: In some cases clang flag "-target bpf" is used but in other cases the - default clang target, which matches the underlying architecture, is used. - What is the difference and when I should use which? - -A: Although LLVM IR generation and optimization try to stay architecture - independent, "-target " still has some impact on generated code: - - - BPF program may recursively include header file(s) with file scope - inline assembly codes. The default target can handle this well, - while bpf target may fail if bpf backend assembler does not - understand these assembly codes, which is true in most cases. - - - When compiled without -g, additional elf sections, e.g., - .eh_frame and .rela.eh_frame, may be present in the object file - with default target, but not with bpf target. - - - The default target may turn a C switch statement into a switch table - lookup and jump operation. Since the switch table is placed - in the global readonly section, the bpf program will fail to load. - The bpf target does not support switch table optimization. - The clang option "-fno-jump-tables" can be used to disable - switch table generation. - - - For clang -target bpf, it is guaranteed that pointer or long / - unsigned long types will always have a width of 64 bit, no matter - whether underlying clang binary or default target (or kernel) is - 32 bit. However, when native clang target is used, then it will - compile these types based on the underlying architecture's conventions, - meaning in case of 32 bit architecture, pointer or long / unsigned - long types e.g. in BPF context structure will have width of 32 bit - while the BPF LLVM back end still operates in 64 bit. The native - target is mostly needed in tracing for the case of walking pt_regs - or other kernel structures where CPU's register width matters. - Otherwise, clang -target bpf is generally recommended. - - You should use default target when: - - - Your program includes a header file, e.g., ptrace.h, which eventually - pulls in some header files containing file scope host assembly codes. - - You can add "-fno-jump-tables" to work around the switch table issue. - - Otherwise, you can use bpf target. Additionally, you _must_ use bpf target - when: - - - Your program uses data structures with pointer or long / unsigned long - types that interface with BPF helpers or context data structures. Access - into these structures is verified by the BPF verifier and may result - in verification failures if the native architecture is not aligned with - the BPF architecture, e.g. 64-bit. An example of this is - BPF_PROG_TYPE_SK_MSG require '-target bpf' - -Happy BPF hacking! -- cgit v1.2.3 From 1a6ac1d59dc3b4077c643c3be70f9e650e267afe Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Mon, 14 May 2018 15:42:22 +0200 Subject: bpf, doc: convert bpf_design_QA.rst to use RST formatting The RST formatting is done such that that when rendered or converted to different formats, an automatic index with links are created to the subsections. Thus, the questions are created as sections (or subsections), in-order to get the wanted auto-generated FAQ/QA index. Special thanks to Quentin Monnet who have reviewed and corrected both RST formatting and GitHub rendering issues in this file. Those commits have been squashed. I've manually tested that this also renders nicely if included as part of the kernel 'make htmldocs'. As the end-goal is for this to become more integrated with kernel-doc project/movement. Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexei Starovoitov --- Documentation/bpf/bpf_design_QA.rst | 223 +++++++++++++++++++++++------------- 1 file changed, 144 insertions(+), 79 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index f3e458a0bb2f..6780a6d81745 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -1,156 +1,221 @@ +============== +BPF Design Q&A +============== + BPF extensibility and applicability to networking, tracing, security in the linux kernel and several user space implementations of BPF virtual machine led to a number of misunderstanding on what BPF actually is. This short QA is an attempt to address that and outline a direction of where BPF is heading long term. +.. contents:: + :local: + :depth: 3 + +Questions and Answers +===================== + Q: Is BPF a generic instruction set similar to x64 and arm64? +------------------------------------------------------------- A: NO. Q: Is BPF a generic virtual machine ? +------------------------------------- A: NO. -BPF is generic instruction set _with_ C calling convention. +BPF is generic instruction set *with* C calling convention. +----------------------------------------------------------- Q: Why C calling convention was chosen? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + A: Because BPF programs are designed to run in the linux kernel - which is written in C, hence BPF defines instruction set compatible - with two most used architectures x64 and arm64 (and takes into - consideration important quirks of other architectures) and - defines calling convention that is compatible with C calling - convention of the linux kernel on those architectures. +which is written in C, hence BPF defines instruction set compatible +with two most used architectures x64 and arm64 (and takes into +consideration important quirks of other architectures) and +defines calling convention that is compatible with C calling +convention of the linux kernel on those architectures. Q: can multiple return values be supported in the future? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: NO. BPF allows only register R0 to be used as return value. Q: can more than 5 function arguments be supported in the future? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: NO. BPF calling convention only allows registers R1-R5 to be used - as arguments. BPF is not a standalone instruction set. - (unlike x64 ISA that allows msft, cdecl and other conventions) +as arguments. BPF is not a standalone instruction set. +(unlike x64 ISA that allows msft, cdecl and other conventions) Q: can BPF programs access instruction pointer or return address? +----------------------------------------------------------------- A: NO. Q: can BPF programs access stack pointer ? -A: NO. Only frame pointer (register R10) is accessible. - From compiler point of view it's necessary to have stack pointer. - For example LLVM defines register R11 as stack pointer in its - BPF backend, but it makes sure that generated code never uses it. +------------------------------------------ +A: NO. + +Only frame pointer (register R10) is accessible. +From compiler point of view it's necessary to have stack pointer. +For example LLVM defines register R11 as stack pointer in its +BPF backend, but it makes sure that generated code never uses it. Q: Does C-calling convention diminishes possible use cases? -A: YES. BPF design forces addition of major functionality in the form - of kernel helper functions and kernel objects like BPF maps with - seamless interoperability between them. It lets kernel call into - BPF programs and programs call kernel helpers with zero overhead. - As all of them were native C code. That is particularly the case - for JITed BPF programs that are indistinguishable from - native kernel C code. +----------------------------------------------------------- +A: YES. + +BPF design forces addition of major functionality in the form +of kernel helper functions and kernel objects like BPF maps with +seamless interoperability between them. It lets kernel call into +BPF programs and programs call kernel helpers with zero overhead. +As all of them were native C code. That is particularly the case +for JITed BPF programs that are indistinguishable from +native kernel C code. Q: Does it mean that 'innovative' extensions to BPF code are disallowed? -A: Soft yes. At least for now until BPF core has support for - bpf-to-bpf calls, indirect calls, loops, global variables, - jump tables, read only sections and all other normal constructs - that C code can produce. +------------------------------------------------------------------------ +A: Soft yes. + +At least for now until BPF core has support for +bpf-to-bpf calls, indirect calls, loops, global variables, +jump tables, read only sections and all other normal constructs +that C code can produce. Q: Can loops be supported in a safe way? -A: It's not clear yet. BPF developers are trying to find a way to - support bounded loops where the verifier can guarantee that - the program terminates in less than 4096 instructions. +---------------------------------------- +A: It's not clear yet. + +BPF developers are trying to find a way to +support bounded loops where the verifier can guarantee that +the program terminates in less than 4096 instructions. + +Instruction level questions +--------------------------- + +Q: LD_ABS and LD_IND instructions vs C code +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Q: How come LD_ABS and LD_IND instruction are present in BPF whereas - C code cannot express them and has to use builtin intrinsics? +C code cannot express them and has to use builtin intrinsics? + A: This is artifact of compatibility with classic BPF. Modern - networking code in BPF performs better without them. - See 'direct packet access'. +networking code in BPF performs better without them. +See 'direct packet access'. +Q: BPF instructions mapping not one-to-one to native CPU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Q: It seems not all BPF instructions are one-to-one to native CPU. - For example why BPF_JNE and other compare and jumps are not cpu-like? +For example why BPF_JNE and other compare and jumps are not cpu-like? + A: This was necessary to avoid introducing flags into ISA which are - impossible to make generic and efficient across CPU architectures. +impossible to make generic and efficient across CPU architectures. Q: why BPF_DIV instruction doesn't map to x64 div? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because if we picked one-to-one relationship to x64 it would have made - it more complicated to support on arm64 and other archs. Also it - needs div-by-zero runtime check. +it more complicated to support on arm64 and other archs. Also it +needs div-by-zero runtime check. Q: why there is no BPF_SDIV for signed divide operation? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because it would be rarely used. llvm errors in such case and - prints a suggestion to use unsigned divide instead +prints a suggestion to use unsigned divide instead Q: Why BPF has implicit prologue and epilogue? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because architectures like sparc have register windows and in general - there are enough subtle differences between architectures, so naive - store return address into stack won't work. Another reason is BPF has - to be safe from division by zero (and legacy exception path - of LD_ABS insn). Those instructions need to invoke epilogue and - return implicitly. +there are enough subtle differences between architectures, so naive +store return address into stack won't work. Another reason is BPF has +to be safe from division by zero (and legacy exception path +of LD_ABS insn). Those instructions need to invoke epilogue and +return implicitly. Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because classic BPF didn't have them and BPF authors felt that compiler - workaround would be acceptable. Turned out that programs lose performance - due to lack of these compare instructions and they were added. - These two instructions is a perfect example what kind of new BPF - instructions are acceptable and can be added in the future. - These two already had equivalent instructions in native CPUs. - New instructions that don't have one-to-one mapping to HW instructions - will not be accepted. - +workaround would be acceptable. Turned out that programs lose performance +due to lack of these compare instructions and they were added. +These two instructions is a perfect example what kind of new BPF +instructions are acceptable and can be added in the future. +These two already had equivalent instructions in native CPUs. +New instructions that don't have one-to-one mapping to HW instructions +will not be accepted. + +Q: BPF 32-bit subregister requirements +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF - registers which makes BPF inefficient virtual machine for 32-bit - CPU architectures and 32-bit HW accelerators. Can true 32-bit registers - be added to BPF in the future? +registers which makes BPF inefficient virtual machine for 32-bit +CPU architectures and 32-bit HW accelerators. Can true 32-bit registers +be added to BPF in the future? + A: NO. The first thing to improve performance on 32-bit archs is to teach - LLVM to generate code that uses 32-bit subregisters. Then second step - is to teach verifier to mark operations where zero-ing upper bits - is unnecessary. Then JITs can take advantage of those markings and - drastically reduce size of generated code and improve performance. +LLVM to generate code that uses 32-bit subregisters. Then second step +is to teach verifier to mark operations where zero-ing upper bits +is unnecessary. Then JITs can take advantage of those markings and +drastically reduce size of generated code and improve performance. Q: Does BPF have a stable ABI? +------------------------------ A: YES. BPF instructions, arguments to BPF programs, set of helper - functions and their arguments, recognized return codes are all part - of ABI. However when tracing programs are using bpf_probe_read() helper - to walk kernel internal datastructures and compile with kernel - internal headers these accesses can and will break with newer - kernels. The union bpf_attr -> kern_version is checked at load time - to prevent accidentally loading kprobe-based bpf programs written - for a different kernel. Networking programs don't do kern_version check. +functions and their arguments, recognized return codes are all part +of ABI. However when tracing programs are using bpf_probe_read() helper +to walk kernel internal datastructures and compile with kernel +internal headers these accesses can and will break with newer +kernels. The union bpf_attr -> kern_version is checked at load time +to prevent accidentally loading kprobe-based bpf programs written +for a different kernel. Networking programs don't do kern_version check. Q: How much stack space a BPF program uses? +------------------------------------------- A: Currently all program types are limited to 512 bytes of stack - space, but the verifier computes the actual amount of stack used - and both interpreter and most JITed code consume necessary amount. +space, but the verifier computes the actual amount of stack used +and both interpreter and most JITed code consume necessary amount. Q: Can BPF be offloaded to HW? +------------------------------ A: YES. BPF HW offload is supported by NFP driver. Q: Does classic BPF interpreter still exist? +-------------------------------------------- A: NO. Classic BPF programs are converted into extend BPF instructions. Q: Can BPF call arbitrary kernel functions? +------------------------------------------- A: NO. BPF programs can only call a set of helper functions which - is defined for every program type. +is defined for every program type. Q: Can BPF overwrite arbitrary kernel memory? -A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read() - and bpf_probe_read_str() helpers. Networking programs cannot read - arbitrary memory, since they don't have access to these helpers. - Programs can never read or write arbitrary memory directly. +--------------------------------------------- +A: NO. + +Tracing bpf programs can *read* arbitrary memory with bpf_probe_read() +and bpf_probe_read_str() helpers. Networking programs cannot read +arbitrary memory, since they don't have access to these helpers. +Programs can never read or write arbitrary memory directly. Q: Can BPF overwrite arbitrary user memory? -A: Sort-of. Tracing BPF programs can overwrite the user memory - of the current task with bpf_probe_write_user(). Every time such - program is loaded the kernel will print warning message, so - this helper is only useful for experiments and prototypes. - Tracing BPF programs are root only. +------------------------------------------- +A: Sort-of. + +Tracing BPF programs can overwrite the user memory +of the current task with bpf_probe_write_user(). Every time such +program is loaded the kernel will print warning message, so +this helper is only useful for experiments and prototypes. +Tracing BPF programs are root only. +Q: bpf_trace_printk() helper warning +------------------------------------ Q: When bpf_trace_printk() helper is used the kernel prints nasty - warning message. Why is that? +warning message. Why is that? + A: This is done to nudge program authors into better interfaces when - programs need to pass data to user space. Like bpf_perf_event_output() - can be used to efficiently stream data via perf ring buffer. - BPF maps can be used for asynchronous data sharing between kernel - and user space. bpf_trace_printk() should only be used for debugging. +programs need to pass data to user space. Like bpf_perf_event_output() +can be used to efficiently stream data via perf ring buffer. +BPF maps can be used for asynchronous data sharing between kernel +and user space. bpf_trace_printk() should only be used for debugging. +Q: New functionality via kernel modules? +---------------------------------------- Q: Can BPF functionality such as new program or map types, new - helpers, etc be added out of kernel module code? +helpers, etc be added out of kernel module code? + A: NO. -- cgit v1.2.3 From 542228384888f5ad11fa6ffd59947a29a1f4452e Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Mon, 14 May 2018 15:42:27 +0200 Subject: bpf, doc: convert bpf_devel_QA.rst to use RST formatting Same story as bpf_design_QA.rst RST format conversion. Again thanks to Quentin Monnet for fixes and patches that have been squashed. Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexei Starovoitov --- Documentation/bpf/bpf_devel_QA.rst | 799 +++++++++++++++++++------------------ 1 file changed, 420 insertions(+), 379 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index da57601153a0..2254bdeae990 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -1,424 +1,446 @@ +================================= +HOWTO interact with BPF subsystem +================================= + This document provides information for the BPF subsystem about various workflows related to reporting bugs, submitting patches, and queueing patches for stable kernels. For general information about submitting patches, please refer to -Documentation/process/. This document only describes additional specifics +`Documentation/process/`_. This document only describes additional specifics related to BPF. -Reporting bugs: ---------------- +.. contents:: + :local: + :depth: 2 -Q: How do I report bugs for BPF kernel code? +Reporting bugs +============== +Q: How do I report bugs for BPF kernel code? +-------------------------------------------- A: Since all BPF kernel development as well as bpftool and iproute2 BPF - loader development happens through the netdev kernel mailing list, - please report any found issues around BPF to the following mailing - list: +loader development happens through the netdev kernel mailing list, +please report any found issues around BPF to the following mailing +list: - netdev@vger.kernel.org + netdev@vger.kernel.org - This may also include issues related to XDP, BPF tracing, etc. +This may also include issues related to XDP, BPF tracing, etc. - Given netdev has a high volume of traffic, please also add the BPF - maintainers to Cc (from kernel MAINTAINERS file): +Given netdev has a high volume of traffic, please also add the BPF +maintainers to Cc (from kernel MAINTAINERS_ file): - Alexei Starovoitov - Daniel Borkmann +* Alexei Starovoitov +* Daniel Borkmann - In case a buggy commit has already been identified, make sure to keep - the actual commit authors in Cc as well for the report. They can - typically be identified through the kernel's git tree. +In case a buggy commit has already been identified, make sure to keep +the actual commit authors in Cc as well for the report. They can +typically be identified through the kernel's git tree. - Please do *not* report BPF issues to bugzilla.kernel.org since it - is a guarantee that the reported issue will be overlooked. +**Please do NOT report BPF issues to bugzilla.kernel.org since it +is a guarantee that the reported issue will be overlooked.** -Submitting patches: -------------------- +Submitting patches +================== Q: To which mailing list do I need to submit my BPF patches? - +------------------------------------------------------------ A: Please submit your BPF patches to the netdev kernel mailing list: - netdev@vger.kernel.org + netdev@vger.kernel.org - Historically, BPF came out of networking and has always been maintained - by the kernel networking community. Although these days BPF touches - many other subsystems as well, the patches are still routed mainly - through the networking community. +Historically, BPF came out of networking and has always been maintained +by the kernel networking community. Although these days BPF touches +many other subsystems as well, the patches are still routed mainly +through the networking community. - In case your patch has changes in various different subsystems (e.g. - tracing, security, etc), make sure to Cc the related kernel mailing - lists and maintainers from there as well, so they are able to review - the changes and provide their Acked-by's to the patches. +In case your patch has changes in various different subsystems (e.g. +tracing, security, etc), make sure to Cc the related kernel mailing +lists and maintainers from there as well, so they are able to review +the changes and provide their Acked-by's to the patches. Q: Where can I find patches currently under discussion for BPF subsystem? - +------------------------------------------------------------------------- A: All patches that are Cc'ed to netdev are queued for review under netdev - patchwork project: +patchwork project: - http://patchwork.ozlabs.org/project/netdev/list/ + http://patchwork.ozlabs.org/project/netdev/list/ - Those patches which target BPF, are assigned to a 'bpf' delegate for - further processing from BPF maintainers. The current queue with - patches under review can be found at: +Those patches which target BPF, are assigned to a 'bpf' delegate for +further processing from BPF maintainers. The current queue with +patches under review can be found at: - https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 + https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147 - Once the patches have been reviewed by the BPF community as a whole - and approved by the BPF maintainers, their status in patchwork will be - changed to 'Accepted' and the submitter will be notified by mail. This - means that the patches look good from a BPF perspective and have been - applied to one of the two BPF kernel trees. +Once the patches have been reviewed by the BPF community as a whole +and approved by the BPF maintainers, their status in patchwork will be +changed to 'Accepted' and the submitter will be notified by mail. This +means that the patches look good from a BPF perspective and have been +applied to one of the two BPF kernel trees. - In case feedback from the community requires a respin of the patches, - their status in patchwork will be set to 'Changes Requested', and purged - from the current review queue. Likewise for cases where patches would - get rejected or are not applicable to the BPF trees (but assigned to - the 'bpf' delegate). +In case feedback from the community requires a respin of the patches, +their status in patchwork will be set to 'Changes Requested', and purged +from the current review queue. Likewise for cases where patches would +get rejected or are not applicable to the BPF trees (but assigned to +the 'bpf' delegate). Q: How do the changes make their way into Linux? - +------------------------------------------------ A: There are two BPF kernel trees (git repositories). Once patches have - been accepted by the BPF maintainers, they will be applied to one - of the two BPF trees: +been accepted by the BPF maintainers, they will be applied to one +of the two BPF trees: - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ - https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ + * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/ + * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ - The bpf tree itself is for fixes only, whereas bpf-next for features, - cleanups or other kind of improvements ("next-like" content). This is - analogous to net and net-next trees for networking. Both bpf and - bpf-next will only have a master branch in order to simplify against - which branch patches should get rebased to. +The bpf tree itself is for fixes only, whereas bpf-next for features, +cleanups or other kind of improvements ("next-like" content). This is +analogous to net and net-next trees for networking. Both bpf and +bpf-next will only have a master branch in order to simplify against +which branch patches should get rebased to. - Accumulated BPF patches in the bpf tree will regularly get pulled - into the net kernel tree. Likewise, accumulated BPF patches accepted - into the bpf-next tree will make their way into net-next tree. net and - net-next are both run by David S. Miller. From there, they will go - into the kernel mainline tree run by Linus Torvalds. To read up on the - process of net and net-next being merged into the mainline tree, see - the netdev FAQ under: +Accumulated BPF patches in the bpf tree will regularly get pulled +into the net kernel tree. Likewise, accumulated BPF patches accepted +into the bpf-next tree will make their way into net-next tree. net and +net-next are both run by David S. Miller. From there, they will go +into the kernel mainline tree run by Linus Torvalds. To read up on the +process of net and net-next being merged into the mainline tree, see +the `netdev FAQ`_ under: - Documentation/networking/netdev-FAQ.txt + `Documentation/networking/netdev-FAQ.txt`_ - Occasionally, to prevent merge conflicts, we might send pull requests - to other trees (e.g. tracing) with a small subset of the patches, but - net and net-next are always the main trees targeted for integration. +Occasionally, to prevent merge conflicts, we might send pull requests +to other trees (e.g. tracing) with a small subset of the patches, but +net and net-next are always the main trees targeted for integration. - The pull requests will contain a high-level summary of the accumulated - patches and can be searched on netdev kernel mailing list through the - following subject lines (yyyy-mm-dd is the date of the pull request): +The pull requests will contain a high-level summary of the accumulated +patches and can be searched on netdev kernel mailing list through the +following subject lines (``yyyy-mm-dd`` is the date of the pull +request):: - pull-request: bpf yyyy-mm-dd - pull-request: bpf-next yyyy-mm-dd + pull-request: bpf yyyy-mm-dd + pull-request: bpf-next yyyy-mm-dd -Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be - applied to? +Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to? +--------------------------------------------------------------------------------- -A: The process is the very same as described in the netdev FAQ, so - please read up on it. The subject line must indicate whether the - patch is a fix or rather "next-like" content in order to let the - maintainers know whether it is targeted at bpf or bpf-next. +A: The process is the very same as described in the `netdev FAQ`_, so +please read up on it. The subject line must indicate whether the +patch is a fix or rather "next-like" content in order to let the +maintainers know whether it is targeted at bpf or bpf-next. - For fixes eventually landing in bpf -> net tree, the subject must - look like: +For fixes eventually landing in bpf -> net tree, the subject must +look like:: - git format-patch --subject-prefix='PATCH bpf' start..finish + git format-patch --subject-prefix='PATCH bpf' start..finish - For features/improvements/etc that should eventually land in - bpf-next -> net-next, the subject must look like: +For features/improvements/etc that should eventually land in +bpf-next -> net-next, the subject must look like:: - git format-patch --subject-prefix='PATCH bpf-next' start..finish + git format-patch --subject-prefix='PATCH bpf-next' start..finish - If unsure whether the patch or patch series should go into bpf - or net directly, or bpf-next or net-next directly, it is not a - problem either if the subject line says net or net-next as target. - It is eventually up to the maintainers to do the delegation of - the patches. +If unsure whether the patch or patch series should go into bpf +or net directly, or bpf-next or net-next directly, it is not a +problem either if the subject line says net or net-next as target. +It is eventually up to the maintainers to do the delegation of +the patches. - If it is clear that patches should go into bpf or bpf-next tree, - please make sure to rebase the patches against those trees in - order to reduce potential conflicts. +If it is clear that patches should go into bpf or bpf-next tree, +please make sure to rebase the patches against those trees in +order to reduce potential conflicts. - In case the patch or patch series has to be reworked and sent out - again in a second or later revision, it is also required to add a - version number (v2, v3, ...) into the subject prefix: +In case the patch or patch series has to be reworked and sent out +again in a second or later revision, it is also required to add a +version number (``v2``, ``v3``, ...) into the subject prefix:: - git format-patch --subject-prefix='PATCH net-next v2' start..finish + git format-patch --subject-prefix='PATCH net-next v2' start..finish - When changes have been requested to the patch series, always send the - whole patch series again with the feedback incorporated (never send - individual diffs on top of the old series). +When changes have been requested to the patch series, always send the +whole patch series again with the feedback incorporated (never send +individual diffs on top of the old series). Q: What does it mean when a patch gets applied to bpf or bpf-next tree? - +----------------------------------------------------------------------- A: It means that the patch looks good for mainline inclusion from - a BPF point of view. - - Be aware that this is not a final verdict that the patch will - automatically get accepted into net or net-next trees eventually: - - On the netdev kernel mailing list reviews can come in at any point - in time. If discussions around a patch conclude that they cannot - get included as-is, we will either apply a follow-up fix or drop - them from the trees entirely. Therefore, we also reserve to rebase - the trees when deemed necessary. After all, the purpose of the tree - is to i) accumulate and stage BPF patches for integration into trees - like net and net-next, and ii) run extensive BPF test suite and - workloads on the patches before they make their way any further. - - Once the BPF pull request was accepted by David S. Miller, then - the patches end up in net or net-next tree, respectively, and - make their way from there further into mainline. Again, see the - netdev FAQ for additional information e.g. on how often they are - merged to mainline. +a BPF point of view. -Q: How long do I need to wait for feedback on my BPF patches? +Be aware that this is not a final verdict that the patch will +automatically get accepted into net or net-next trees eventually: + +On the netdev kernel mailing list reviews can come in at any point +in time. If discussions around a patch conclude that they cannot +get included as-is, we will either apply a follow-up fix or drop +them from the trees entirely. Therefore, we also reserve to rebase +the trees when deemed necessary. After all, the purpose of the tree +is to: + +i) accumulate and stage BPF patches for integration into trees + like net and net-next, and +ii) run extensive BPF test suite and + workloads on the patches before they make their way any further. + +Once the BPF pull request was accepted by David S. Miller, then +the patches end up in net or net-next tree, respectively, and +make their way from there further into mainline. Again, see the +`netdev FAQ`_ for additional information e.g. on how often they are +merged to mainline. + +Q: How long do I need to wait for feedback on my BPF patches? +------------------------------------------------------------- A: We try to keep the latency low. The usual time to feedback will - be around 2 or 3 business days. It may vary depending on the - complexity of changes and current patch load. +be around 2 or 3 business days. It may vary depending on the +complexity of changes and current patch load. -Q: How often do you send pull requests to major kernel trees like - net or net-next? +Q: How often do you send pull requests to major kernel trees like net or net-next? +---------------------------------------------------------------------------------- A: Pull requests will be sent out rather often in order to not - accumulate too many patches in bpf or bpf-next. +accumulate too many patches in bpf or bpf-next. - As a rule of thumb, expect pull requests for each tree regularly - at the end of the week. In some cases pull requests could additionally - come also in the middle of the week depending on the current patch - load or urgency. +As a rule of thumb, expect pull requests for each tree regularly +at the end of the week. In some cases pull requests could additionally +come also in the middle of the week depending on the current patch +load or urgency. Q: Are patches applied to bpf-next when the merge window is open? - +----------------------------------------------------------------- A: For the time when the merge window is open, bpf-next will not be - processed. This is roughly analogous to net-next patch processing, - so feel free to read up on the netdev FAQ about further details. +processed. This is roughly analogous to net-next patch processing, +so feel free to read up on the `netdev FAQ`_ about further details. - During those two weeks of merge window, we might ask you to resend - your patch series once bpf-next is open again. Once Linus released - a v*-rc1 after the merge window, we continue processing of bpf-next. +During those two weeks of merge window, we might ask you to resend +your patch series once bpf-next is open again. Once Linus released +a ``v*-rc1`` after the merge window, we continue processing of bpf-next. - For non-subscribers to kernel mailing lists, there is also a status - page run by David S. Miller on net-next that provides guidance: +For non-subscribers to kernel mailing lists, there is also a status +page run by David S. Miller on net-next that provides guidance: - http://vger.kernel.org/~davem/net-next.html + http://vger.kernel.org/~davem/net-next.html +Q: Verifier changes and test cases +---------------------------------- Q: I made a BPF verifier change, do I need to add test cases for - BPF kernel selftests? +BPF kernel selftests_? A: If the patch has changes to the behavior of the verifier, then yes, - it is absolutely necessary to add test cases to the BPF kernel - selftests suite. If they are not present and we think they are - needed, then we might ask for them before accepting any changes. - - In particular, test_verifier.c is tracking a high number of BPF test - cases, including a lot of corner cases that LLVM BPF back end may - generate out of the restricted C code. Thus, adding test cases is - absolutely crucial to make sure future changes do not accidentally - affect prior use-cases. Thus, treat those test cases as: verifier - behavior that is not tracked in test_verifier.c could potentially - be subject to change. - -Q: When should I add code to samples/bpf/ and when to BPF kernel - selftests? - -A: In general, we prefer additions to BPF kernel selftests rather than - samples/bpf/. The rationale is very simple: kernel selftests are - regularly run by various bots to test for kernel regressions. - - The more test cases we add to BPF selftests, the better the coverage - and the less likely it is that those could accidentally break. It is - not that BPF kernel selftests cannot demo how a specific feature can - be used. - - That said, samples/bpf/ may be a good place for people to get started, - so it might be advisable that simple demos of features could go into - samples/bpf/, but advanced functional and corner-case testing rather - into kernel selftests. - - If your sample looks like a test case, then go for BPF kernel selftests - instead! +it is absolutely necessary to add test cases to the BPF kernel +selftests_ suite. If they are not present and we think they are +needed, then we might ask for them before accepting any changes. + +In particular, test_verifier.c is tracking a high number of BPF test +cases, including a lot of corner cases that LLVM BPF back end may +generate out of the restricted C code. Thus, adding test cases is +absolutely crucial to make sure future changes do not accidentally +affect prior use-cases. Thus, treat those test cases as: verifier +behavior that is not tracked in test_verifier.c could potentially +be subject to change. + +Q: samples/bpf preference vs selftests? +--------------------------------------- +Q: When should I add code to `samples/bpf/`_ and when to BPF kernel +selftests_ ? + +A: In general, we prefer additions to BPF kernel selftests_ rather than +`samples/bpf/`_. The rationale is very simple: kernel selftests are +regularly run by various bots to test for kernel regressions. + +The more test cases we add to BPF selftests, the better the coverage +and the less likely it is that those could accidentally break. It is +not that BPF kernel selftests cannot demo how a specific feature can +be used. + +That said, `samples/bpf/`_ may be a good place for people to get started, +so it might be advisable that simple demos of features could go into +`samples/bpf/`_, but advanced functional and corner-case testing rather +into kernel selftests. + +If your sample looks like a test case, then go for BPF kernel selftests +instead! Q: When should I add code to the bpftool? - +----------------------------------------- A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide - a central user space tool for debugging and introspection of BPF programs - and maps that are active in the kernel. If UAPI changes related to BPF - enable for dumping additional information of programs or maps, then - bpftool should be extended as well to support dumping them. +a central user space tool for debugging and introspection of BPF programs +and maps that are active in the kernel. If UAPI changes related to BPF +enable for dumping additional information of programs or maps, then +bpftool should be extended as well to support dumping them. Q: When should I add code to iproute2's BPF loader? - -A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the - convention is that those control-path related changes are added to - iproute2's BPF loader as well from user space side. This is not only - useful to have UAPI changes properly designed to be usable, but also - to make those changes available to a wider user base of major - downstream distributions. +--------------------------------------------------- +A: For UAPI changes related to the XDP or tc layer (e.g. ``cls_bpf``), +the convention is that those control-path related changes are added to +iproute2's BPF loader as well from user space side. This is not only +useful to have UAPI changes properly designed to be usable, but also +to make those changes available to a wider user base of major +downstream distributions. Q: Do you accept patches as well for iproute2's BPF loader? - +----------------------------------------------------------- A: Patches for the iproute2's BPF loader have to be sent to: - netdev@vger.kernel.org + netdev@vger.kernel.org - While those patches are not processed by the BPF kernel maintainers, - please keep them in Cc as well, so they can be reviewed. +While those patches are not processed by the BPF kernel maintainers, +please keep them in Cc as well, so they can be reviewed. - The official git repository for iproute2 is run by Stephen Hemminger - and can be found at: +The official git repository for iproute2 is run by Stephen Hemminger +and can be found at: - https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ + https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/ - The patches need to have a subject prefix of '[PATCH iproute2 master]' - or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the - target branch where the patch should be applied to. Meaning, if kernel - changes went into the net-next kernel tree, then the related iproute2 - changes need to go into the iproute2 net-next branch, otherwise they - can be targeted at master branch. The iproute2 net-next branch will get - merged into the master branch after the current iproute2 version from - master has been released. +The patches need to have a subject prefix of '``[PATCH iproute2 +master]``' or '``[PATCH iproute2 net-next]``'. '``master``' or +'``net-next``' describes the target branch where the patch should be +applied to. Meaning, if kernel changes went into the net-next kernel +tree, then the related iproute2 changes need to go into the iproute2 +net-next branch, otherwise they can be targeted at master branch. The +iproute2 net-next branch will get merged into the master branch after +the current iproute2 version from master has been released. - Like BPF, the patches end up in patchwork under the netdev project and - are delegated to 'shemminger' for further processing: +Like BPF, the patches end up in patchwork under the netdev project and +are delegated to 'shemminger' for further processing: - http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 + http://patchwork.ozlabs.org/project/netdev/list/?delegate=389 Q: What is the minimum requirement before I submit my BPF patches? - +------------------------------------------------------------------ A: When submitting patches, always take the time and properly test your - patches *prior* to submission. Never rush them! If maintainers find - that your patches have not been properly tested, it is a good way to - get them grumpy. Testing patch submissions is a hard requirement! - - Note, fixes that go to bpf tree *must* have a Fixes: tag included. The - same applies to fixes that target bpf-next, where the affected commit - is in net-next (or in some cases bpf-next). The Fixes: tag is crucial - in order to identify follow-up commits and tremendously helps for people - having to do backporting, so it is a must have! - - We also don't accept patches with an empty commit message. Take your - time and properly write up a high quality commit message, it is - essential! - - Think about it this way: other developers looking at your code a month - from now need to understand *why* a certain change has been done that - way, and whether there have been flaws in the analysis or assumptions - that the original author did. Thus providing a proper rationale and - describing the use-case for the changes is a must. - - Patch submissions with >1 patch must have a cover letter which includes - a high level description of the series. This high level summary will - then be placed into the merge commit by the BPF maintainers such that - it is also accessible from the git log for future reference. - +patches *prior* to submission. Never rush them! If maintainers find +that your patches have not been properly tested, it is a good way to +get them grumpy. Testing patch submissions is a hard requirement! + +Note, fixes that go to bpf tree *must* have a ``Fixes:`` tag included. +The same applies to fixes that target bpf-next, where the affected +commit is in net-next (or in some cases bpf-next). The ``Fixes:`` tag is +crucial in order to identify follow-up commits and tremendously helps +for people having to do backporting, so it is a must have! + +We also don't accept patches with an empty commit message. Take your +time and properly write up a high quality commit message, it is +essential! + +Think about it this way: other developers looking at your code a month +from now need to understand *why* a certain change has been done that +way, and whether there have been flaws in the analysis or assumptions +that the original author did. Thus providing a proper rationale and +describing the use-case for the changes is a must. + +Patch submissions with >1 patch must have a cover letter which includes +a high level description of the series. This high level summary will +then be placed into the merge commit by the BPF maintainers such that +it is also accessible from the git log for future reference. + +Q: Features changing BPF JIT and/or LLVM +---------------------------------------- Q: What do I need to consider when adding a new instruction or feature - that would require BPF JIT and/or LLVM integration as well? +that would require BPF JIT and/or LLVM integration as well? A: We try hard to keep all BPF JITs up to date such that the same user - experience can be guaranteed when running BPF programs on different - architectures without having the program punt to the less efficient - interpreter in case the in-kernel BPF JIT is enabled. +experience can be guaranteed when running BPF programs on different +architectures without having the program punt to the less efficient +interpreter in case the in-kernel BPF JIT is enabled. - If you are unable to implement or test the required JIT changes for - certain architectures, please work together with the related BPF JIT - developers in order to get the feature implemented in a timely manner. - Please refer to the git log (arch/*/net/) to locate the necessary - people for helping out. +If you are unable to implement or test the required JIT changes for +certain architectures, please work together with the related BPF JIT +developers in order to get the feature implemented in a timely manner. +Please refer to the git log (``arch/*/net/``) to locate the necessary +people for helping out. - Also always make sure to add BPF test cases (e.g. test_bpf.c and - test_verifier.c) for new instructions, so that they can receive - broad test coverage and help run-time testing the various BPF JITs. +Also always make sure to add BPF test cases (e.g. test_bpf.c and +test_verifier.c) for new instructions, so that they can receive +broad test coverage and help run-time testing the various BPF JITs. - In case of new BPF instructions, once the changes have been accepted - into the Linux kernel, please implement support into LLVM's BPF back - end. See LLVM section below for further information. +In case of new BPF instructions, once the changes have been accepted +into the Linux kernel, please implement support into LLVM's BPF back +end. See LLVM_ section below for further information. -Stable submission: ------------------- +Stable submission +================= Q: I need a specific BPF commit in stable kernels. What should I do? - +-------------------------------------------------------------------- A: In case you need a specific fix in stable kernels, first check whether - the commit has already been applied in the related linux-*.y branches: +the commit has already been applied in the related ``linux-*.y`` branches: - https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ + https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/ - If not the case, then drop an email to the BPF maintainers with the - netdev kernel mailing list in Cc and ask for the fix to be queued up: +If not the case, then drop an email to the BPF maintainers with the +netdev kernel mailing list in Cc and ask for the fix to be queued up: - netdev@vger.kernel.org + netdev@vger.kernel.org - The process in general is the same as on netdev itself, see also the - netdev FAQ document. +The process in general is the same as on netdev itself, see also the +`netdev FAQ`_ document. Q: Do you also backport to kernels not currently maintained as stable? - +---------------------------------------------------------------------- A: No. If you need a specific BPF commit in kernels that are currently not - maintained by the stable maintainers, then you are on your own. +maintained by the stable maintainers, then you are on your own. - The current stable and longterm stable kernels are all listed here: +The current stable and longterm stable kernels are all listed here: - https://www.kernel.org/ + https://www.kernel.org/ -Q: The BPF patch I am about to submit needs to go to stable as well. What - should I do? +Q: The BPF patch I am about to submit needs to go to stable as well +------------------------------------------------------------------- +What should I do? A: The same rules apply as with netdev patch submissions in general, see - netdev FAQ under: +`netdev FAQ`_ under: - Documentation/networking/netdev-FAQ.txt + `Documentation/networking/netdev-FAQ.txt`_ - Never add "Cc: stable@vger.kernel.org" to the patch description, but - ask the BPF maintainers to queue the patches instead. This can be done - with a note, for example, under the "---" part of the patch which does - not go into the git log. Alternatively, this can be done as a simple - request by mail instead. +Never add "``Cc: stable@vger.kernel.org``" to the patch description, but +ask the BPF maintainers to queue the patches instead. This can be done +with a note, for example, under the ``---`` part of the patch which does +not go into the git log. Alternatively, this can be done as a simple +request by mail instead. +Q: Queue stable patches +----------------------- Q: Where do I find currently queued BPF patches that will be submitted - to stable? +to stable? A: Once patches that fix critical bugs got applied into the bpf tree, they - are queued up for stable submission under: +are queued up for stable submission under: - http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* + http://patchwork.ozlabs.org/bundle/bpf/stable/?state=* - They will be on hold there at minimum until the related commit made its - way into the mainline kernel tree. +They will be on hold there at minimum until the related commit made its +way into the mainline kernel tree. - After having been under broader exposure, the queued patches will be - submitted by the BPF maintainers to the stable maintainers. +After having been under broader exposure, the queued patches will be +submitted by the BPF maintainers to the stable maintainers. -Testing patches: ----------------- +Testing patches +=============== Q: Which BPF kernel selftests version should I run my kernel against? +--------------------------------------------------------------------- +A: If you run a kernel ``xyz``, then always run the BPF kernel selftests +from that kernel ``xyz`` as well. Do not expect that the BPF selftest +from the latest mainline tree will pass all the time. -A: If you run a kernel xyz, then always run the BPF kernel selftests from - that kernel xyz as well. Do not expect that the BPF selftest from the - latest mainline tree will pass all the time. - - In particular, test_bpf.c and test_verifier.c have a large number of - test cases and are constantly updated with new BPF test sequences, or - existing ones are adapted to verifier changes e.g. due to verifier - becoming smarter and being able to better track certain things. +In particular, test_bpf.c and test_verifier.c have a large number of +test cases and are constantly updated with new BPF test sequences, or +existing ones are adapted to verifier changes e.g. due to verifier +becoming smarter and being able to better track certain things. -LLVM: ------ +LLVM +==== Q: Where do I find LLVM with BPF support? - +----------------------------------------- A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. - All major distributions these days ship LLVM with BPF back end enabled, - so for the majority of use-cases it is not required to compile LLVM by - hand anymore, just install the distribution provided package. +All major distributions these days ship LLVM with BPF back end enabled, +so for the majority of use-cases it is not required to compile LLVM by +hand anymore, just install the distribution provided package. - LLVM's static compiler lists the supported targets through 'llc --version', - make sure BPF targets are listed. Example: +LLVM's static compiler lists the supported targets through +``llc --version``, make sure BPF targets are listed. Example:: $ llc --version LLVM (http://llvm.org/): @@ -434,18 +456,18 @@ A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1. x86 - 32-bit X86: Pentium-Pro and above x86-64 - 64-bit X86: EM64T and AMD64 - For developers in order to utilize the latest features added to LLVM's - BPF back end, it is advisable to run the latest LLVM releases. Support - for new BPF kernel features such as additions to the BPF instruction - set are often developed together. +For developers in order to utilize the latest features added to LLVM's +BPF back end, it is advisable to run the latest LLVM releases. Support +for new BPF kernel features such as additions to the BPF instruction +set are often developed together. - All LLVM releases can be found at: http://releases.llvm.org/ +All LLVM releases can be found at: http://releases.llvm.org/ Q: Got it, so how do I build LLVM manually anyway? - +-------------------------------------------------- A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have - that set up, proceed with building the latest LLVM and clang version - from the git repositories: +that set up, proceed with building the latest LLVM and clang version +from the git repositories:: $ git clone http://llvm.org/git/llvm.git $ cd llvm/tools @@ -457,44 +479,51 @@ A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have -DLLVM_BUILD_RUNTIME=OFF $ make -j $(getconf _NPROCESSORS_ONLN) - The built binaries can then be found in the build/bin/ directory, where - you can point the PATH variable to. +The built binaries can then be found in the build/bin/ directory, where +you can point the PATH variable to. +Q: Reporting LLVM BPF issues +---------------------------- Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code - generation back end or about LLVM generated code that the verifier - refuses to accept? +generation back end or about LLVM generated code that the verifier +refuses to accept? + +A: Yes, please do! -A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF - infrastructure and it ties deeply into verification of programs from the - kernel side. Therefore, any issues on either side need to be investigated - and fixed whenever necessary. +LLVM's BPF back end is a key piece of the whole BPF +infrastructure and it ties deeply into verification of programs from the +kernel side. Therefore, any issues on either side need to be investigated +and fixed whenever necessary. - Therefore, please make sure to bring them up at netdev kernel mailing - list and Cc BPF maintainers for LLVM and kernel bits: +Therefore, please make sure to bring them up at netdev kernel mailing +list and Cc BPF maintainers for LLVM and kernel bits: - Yonghong Song - Alexei Starovoitov - Daniel Borkmann +* Yonghong Song +* Alexei Starovoitov +* Daniel Borkmann - LLVM also has an issue tracker where BPF related bugs can be found: +LLVM also has an issue tracker where BPF related bugs can be found: - https://bugs.llvm.org/buglist.cgi?quicksearch=bpf + https://bugs.llvm.org/buglist.cgi?quicksearch=bpf - However, it is better to reach out through mailing lists with having - maintainers in Cc. +However, it is better to reach out through mailing lists with having +maintainers in Cc. +Q: New BPF instruction for kernel and LLVM +------------------------------------------ Q: I have added a new BPF instruction to the kernel, how can I integrate - it into LLVM? +it into LLVM? -A: LLVM has a -mcpu selector for the BPF back end in order to allow the - selection of BPF instruction set extensions. By default the 'generic' - processor target is used, which is the base instruction set (v1) of BPF. +A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow +the selection of BPF instruction set extensions. By default the +``generic`` processor target is used, which is the base instruction set +(v1) of BPF. - LLVM has an option to select -mcpu=probe where it will probe the host - kernel for supported BPF instruction set extensions and selects the - optimal set automatically. +LLVM has an option to select ``-mcpu=probe`` where it will probe the host +kernel for supported BPF instruction set extensions and selects the +optimal set automatically. - For cross-compilation, a specific version can be select manually as well. +For cross-compilation, a specific version can be select manually as well :: $ llc -march bpf -mcpu=help Available CPUs for this target: @@ -505,66 +534,78 @@ A: LLVM has a -mcpu selector for the BPF back end in order to allow the v2 - Select the v2 processor. [...] - Newly added BPF instructions to the Linux kernel need to follow the same - scheme, bump the instruction set version and implement probing for the - extensions such that -mcpu=probe users can benefit from the optimization - transparently when upgrading their kernels. +Newly added BPF instructions to the Linux kernel need to follow the same +scheme, bump the instruction set version and implement probing for the +extensions such that ``-mcpu=probe`` users can benefit from the +optimization transparently when upgrading their kernels. - If you are unable to implement support for the newly added BPF instruction - please reach out to BPF developers for help. +If you are unable to implement support for the newly added BPF instruction +please reach out to BPF developers for help. - By the way, the BPF kernel selftests run with -mcpu=probe for better - test coverage. +By the way, the BPF kernel selftests run with ``-mcpu=probe`` for better +test coverage. -Q: In some cases clang flag "-target bpf" is used but in other cases the - default clang target, which matches the underlying architecture, is used. - What is the difference and when I should use which? +Q: clang flag for target bpf? +----------------------------- +Q: In some cases clang flag ``-target bpf`` is used but in other cases the +default clang target, which matches the underlying architecture, is used. +What is the difference and when I should use which? A: Although LLVM IR generation and optimization try to stay architecture - independent, "-target " still has some impact on generated code: - - - BPF program may recursively include header file(s) with file scope - inline assembly codes. The default target can handle this well, - while bpf target may fail if bpf backend assembler does not - understand these assembly codes, which is true in most cases. - - - When compiled without -g, additional elf sections, e.g., - .eh_frame and .rela.eh_frame, may be present in the object file - with default target, but not with bpf target. - - - The default target may turn a C switch statement into a switch table - lookup and jump operation. Since the switch table is placed - in the global readonly section, the bpf program will fail to load. - The bpf target does not support switch table optimization. - The clang option "-fno-jump-tables" can be used to disable - switch table generation. - - - For clang -target bpf, it is guaranteed that pointer or long / - unsigned long types will always have a width of 64 bit, no matter - whether underlying clang binary or default target (or kernel) is - 32 bit. However, when native clang target is used, then it will - compile these types based on the underlying architecture's conventions, - meaning in case of 32 bit architecture, pointer or long / unsigned - long types e.g. in BPF context structure will have width of 32 bit - while the BPF LLVM back end still operates in 64 bit. The native - target is mostly needed in tracing for the case of walking pt_regs - or other kernel structures where CPU's register width matters. - Otherwise, clang -target bpf is generally recommended. - - You should use default target when: - - - Your program includes a header file, e.g., ptrace.h, which eventually - pulls in some header files containing file scope host assembly codes. - - You can add "-fno-jump-tables" to work around the switch table issue. - - Otherwise, you can use bpf target. Additionally, you _must_ use bpf target - when: - - - Your program uses data structures with pointer or long / unsigned long - types that interface with BPF helpers or context data structures. Access - into these structures is verified by the BPF verifier and may result - in verification failures if the native architecture is not aligned with - the BPF architecture, e.g. 64-bit. An example of this is - BPF_PROG_TYPE_SK_MSG require '-target bpf' +independent, ``-target `` still has some impact on generated code: + +- BPF program may recursively include header file(s) with file scope + inline assembly codes. The default target can handle this well, + while ``bpf`` target may fail if bpf backend assembler does not + understand these assembly codes, which is true in most cases. + +- When compiled without ``-g``, additional elf sections, e.g., + .eh_frame and .rela.eh_frame, may be present in the object file + with default target, but not with ``bpf`` target. + +- The default target may turn a C switch statement into a switch table + lookup and jump operation. Since the switch table is placed + in the global readonly section, the bpf program will fail to load. + The bpf target does not support switch table optimization. + The clang option ``-fno-jump-tables`` can be used to disable + switch table generation. + +- For clang ``-target bpf``, it is guaranteed that pointer or long / + unsigned long types will always have a width of 64 bit, no matter + whether underlying clang binary or default target (or kernel) is + 32 bit. However, when native clang target is used, then it will + compile these types based on the underlying architecture's conventions, + meaning in case of 32 bit architecture, pointer or long / unsigned + long types e.g. in BPF context structure will have width of 32 bit + while the BPF LLVM back end still operates in 64 bit. The native + target is mostly needed in tracing for the case of walking ``pt_regs`` + or other kernel structures where CPU's register width matters. + Otherwise, ``clang -target bpf`` is generally recommended. + +You should use default target when: + +- Your program includes a header file, e.g., ptrace.h, which eventually + pulls in some header files containing file scope host assembly codes. + +- You can add ``-fno-jump-tables`` to work around the switch table issue. + +Otherwise, you can use ``bpf`` target. Additionally, you *must* use bpf target +when: + +- Your program uses data structures with pointer or long / unsigned long + types that interface with BPF helpers or context data structures. Access + into these structures is verified by the BPF verifier and may result + in verification failures if the native architecture is not aligned with + the BPF architecture, e.g. 64-bit. An example of this is + BPF_PROG_TYPE_SK_MSG require ``-target bpf`` + + +.. Links +.. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/ +.. _MAINTAINERS: ../../MAINTAINERS +.. _Documentation/networking/netdev-FAQ.txt: ../networking/netdev-FAQ.txt +.. _netdev FAQ: ../networking/netdev-FAQ.txt +.. _samples/bpf/: ../../samples/bpf/ +.. _selftests: ../../tools/testing/selftests/bpf/ Happy BPF hacking! -- cgit v1.2.3 From b7a27c3aafa252d1e416c223cb97c123de4ed28a Mon Sep 17 00:00:00 2001 From: Jesper Dangaard Brouer Date: Mon, 14 May 2018 15:42:32 +0200 Subject: bpf, doc: howto use/run the BPF selftests I always forget howto run the BPF selftests. Thus, lets add that info to the QA document. Documentation was based on Cilium's documentation: http://cilium.readthedocs.io/en/latest/bpf/#verifying-the-setup Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Alexei Starovoitov --- Documentation/bpf/bpf_devel_QA.rst | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) (limited to 'Documentation') diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index 2254bdeae990..0e7c1d946e83 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -417,6 +417,33 @@ submitted by the BPF maintainers to the stable maintainers. Testing patches =============== +Q: How to run BPF selftests +--------------------------- +A: After you have booted into the newly compiled kernel, navigate to +the BPF selftests_ suite in order to test BPF functionality (current +working directory points to the root of the cloned git tree):: + + $ cd tools/testing/selftests/bpf/ + $ make + +To run the verifier tests:: + + $ sudo ./test_verifier + +The verifier tests print out all the current checks being +performed. The summary at the end of running all tests will dump +information of test successes and failures:: + + Summary: 418 PASSED, 0 FAILED + +In order to run through all BPF selftests, the following command is +needed:: + + $ sudo make run_tests + +See the kernels selftest `Documentation/dev-tools/kselftest.rst`_ +document for further documentation. + Q: Which BPF kernel selftests version should I run my kernel against? --------------------------------------------------------------------- A: If you run a kernel ``xyz``, then always run the BPF kernel selftests @@ -607,5 +634,7 @@ when: .. _netdev FAQ: ../networking/netdev-FAQ.txt .. _samples/bpf/: ../../samples/bpf/ .. _selftests: ../../tools/testing/selftests/bpf/ +.. _Documentation/dev-tools/kselftest.rst: + https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html Happy BPF hacking! -- cgit v1.2.3 From cd1436a26718b2c33a290e5db24d1507887626e6 Mon Sep 17 00:00:00 2001 From: Alexandre Belloni Date: Mon, 14 May 2018 22:04:54 +0200 Subject: dt-bindings: net: add DT bindings for Microsemi MIIM DT bindings for the Microsemi MII Management Controller found on Microsemi SoCs Reviewed-by: Florian Fainelli Reviewed-by: Rob Herring Signed-off-by: Alexandre Belloni Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- .../devicetree/bindings/net/mscc-miim.txt | 26 ++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/mscc-miim.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/mscc-miim.txt b/Documentation/devicetree/bindings/net/mscc-miim.txt new file mode 100644 index 000000000000..7104679cf59d --- /dev/null +++ b/Documentation/devicetree/bindings/net/mscc-miim.txt @@ -0,0 +1,26 @@ +Microsemi MII Management Controller (MIIM) / MDIO +================================================= + +Properties: +- compatible: must be "mscc,ocelot-miim" +- reg: The base address of the MDIO bus controller register bank. Optionally, a + second register bank can be defined if there is an associated reset register + for internal PHYs +- #address-cells: Must be <1>. +- #size-cells: Must be <0>. MDIO addresses have no size component. +- interrupts: interrupt specifier (refer to the interrupt binding) + +Typically an MDIO bus might have several children. + +Example: + mdio@107009c { + #address-cells = <1>; + #size-cells = <0>; + compatible = "mscc,ocelot-miim"; + reg = <0x107009c 0x36>, <0x10700f0 0x8>; + interrupts = <14>; + + phy0: ethernet-phy@0 { + reg = <0>; + }; + }; -- cgit v1.2.3 From 44b801e0f0970cd8ce124cb0e292108a455fac30 Mon Sep 17 00:00:00 2001 From: Alexandre Belloni Date: Mon, 14 May 2018 22:04:56 +0200 Subject: dt-bindings: net: add DT bindings for Microsemi Ocelot Switch DT bindings for the Ethernet switch found on Microsemi Ocelot platforms. Reviewed-by: Rob Herring Signed-off-by: Alexandre Belloni Reviewed-by: Andrew Lunn Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- .../devicetree/bindings/net/mscc-ocelot.txt | 82 ++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/mscc-ocelot.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/mscc-ocelot.txt b/Documentation/devicetree/bindings/net/mscc-ocelot.txt new file mode 100644 index 000000000000..0a84711abece --- /dev/null +++ b/Documentation/devicetree/bindings/net/mscc-ocelot.txt @@ -0,0 +1,82 @@ +Microsemi Ocelot network Switch +=============================== + +The Microsemi Ocelot network switch can be found on Microsemi SoCs (VSC7513, +VSC7514) + +Required properties: +- compatible: Should be "mscc,vsc7514-switch" +- reg: Must contain an (offset, length) pair of the register set for each + entry in reg-names. +- reg-names: Must include the following entries: + - "sys" + - "rew" + - "qs" + - "hsio" + - "qsys" + - "ana" + - "portX" with X from 0 to the number of last port index available on that + switch +- interrupts: Should contain the switch interrupts for frame extraction and + frame injection +- interrupt-names: should contain the interrupt names: "xtr", "inj" +- ethernet-ports: A container for child nodes representing switch ports. + +The ethernet-ports container has the following properties + +Required properties: + +- #address-cells: Must be 1 +- #size-cells: Must be 0 + +Each port node must have the following mandatory properties: +- reg: Describes the port address in the switch + +Port nodes may also contain the following optional standardised +properties, described in binding documents: + +- phy-handle: Phandle to a PHY on an MDIO bus. See + Documentation/devicetree/bindings/net/ethernet.txt for details. + +Example: + + switch@1010000 { + compatible = "mscc,vsc7514-switch"; + reg = <0x1010000 0x10000>, + <0x1030000 0x10000>, + <0x1080000 0x100>, + <0x10d0000 0x10000>, + <0x11e0000 0x100>, + <0x11f0000 0x100>, + <0x1200000 0x100>, + <0x1210000 0x100>, + <0x1220000 0x100>, + <0x1230000 0x100>, + <0x1240000 0x100>, + <0x1250000 0x100>, + <0x1260000 0x100>, + <0x1270000 0x100>, + <0x1280000 0x100>, + <0x1800000 0x80000>, + <0x1880000 0x10000>; + reg-names = "sys", "rew", "qs", "hsio", "port0", + "port1", "port2", "port3", "port4", "port5", + "port6", "port7", "port8", "port9", "port10", + "qsys", "ana"; + interrupts = <21 22>; + interrupt-names = "xtr", "inj"; + + ethernet-ports { + #address-cells = <1>; + #size-cells = <0>; + + port0: port@0 { + reg = <0>; + phy-handle = <&phy0>; + }; + port1: port@1 { + reg = <1>; + phy-handle = <&phy1>; + }; + }; + }; -- cgit v1.2.3 From 1386c36b30388f46a95100924bfcae75160db715 Mon Sep 17 00:00:00 2001 From: Debabrata Banerjee Date: Mon, 14 May 2018 14:48:10 -0400 Subject: bonding: allow carrier and link status to determine link state In a mixed environment it may be difficult to tell if your hardware support carrier, if it does not it can always report true. With a new use_carrier option of 2, we can check both carrier and link status sequentially, instead of one or the other Signed-off-by: Debabrata Banerjee Signed-off-by: David S. Miller --- Documentation/networking/bonding.txt | 4 ++-- drivers/net/bonding/bond_main.c | 12 ++++++++---- drivers/net/bonding/bond_options.c | 7 ++++--- 3 files changed, 14 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index c13214d073a4..86d07fbb592d 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -828,8 +828,8 @@ use_carrier MII / ETHTOOL ioctl method to determine the link state. A value of 1 enables the use of netif_carrier_ok(), a value of - 0 will use the deprecated MII / ETHTOOL ioctls. The default - value is 1. + 0 will use the deprecated MII / ETHTOOL ioctls. A value of 2 + will check both. The default value is 1. xmit_hash_policy diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a4cd7f6bfd4d..e4c253dc7dfb 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, " "in milliseconds"); module_param(use_carrier, int, 0); MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; " - "0 for off, 1 for on (default)"); + "0 for off, 1 for on (default), 2 for carrier then legacy checks"); module_param(mode, charp, 0); MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, " "1 for active-backup, 2 for balance-xor, " @@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond, int (*ioctl)(struct net_device *, struct ifreq *, int); struct ifreq ifr; struct mii_ioctl_data *mii; + bool carrier = true; if (!reporting && !netif_running(slave_dev)) return 0; if (bond->params.use_carrier) - return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; + carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; + + if (!carrier) + return carrier; /* Try to get link status using Ethtool first. */ if (slave_dev->ethtool_ops->get_link) @@ -4403,8 +4407,8 @@ static int bond_check_params(struct bond_params *params) downdelay = 0; } - if ((use_carrier != 0) && (use_carrier != 1)) { - pr_warn("Warning: use_carrier module parameter (%d), not of valid value (0/1), so it was set to 1\n", + if (use_carrier < 0 || use_carrier > 2) { + pr_warn("Warning: use_carrier module parameter (%d), not of valid value (0-2), so it was set to 1\n", use_carrier); use_carrier = 1; } diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index 8a945c9341d6..dba6cef05134 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -164,9 +164,10 @@ static const struct bond_opt_value bond_primary_reselect_tbl[] = { }; static const struct bond_opt_value bond_use_carrier_tbl[] = { - { "off", 0, 0}, - { "on", 1, BOND_VALFLAG_DEFAULT}, - { NULL, -1, 0} + { "off", 0, 0}, + { "on", 1, BOND_VALFLAG_DEFAULT}, + { "both", 2, 0}, + { NULL, -1, 0} }; static const struct bond_opt_value bond_all_slaves_active_tbl[] = { -- cgit v1.2.3 From b3c898e20b1881b0876c3e811c58b039b37dd5fd Mon Sep 17 00:00:00 2001 From: Debabrata Banerjee Date: Wed, 16 May 2018 14:02:13 -0400 Subject: Revert "bonding: allow carrier and link status to determine link state" This reverts commit 1386c36b30388f46a95100924bfcae75160db715. We don't want to encourage drivers to not report carrier status correctly, therefore remove this commit. Signed-off-by: Debabrata Banerjee Signed-off-by: David S. Miller --- Documentation/networking/bonding.txt | 4 ++-- drivers/net/bonding/bond_main.c | 12 ++++-------- drivers/net/bonding/bond_options.c | 7 +++---- 3 files changed, 9 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 86d07fbb592d..c13214d073a4 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -828,8 +828,8 @@ use_carrier MII / ETHTOOL ioctl method to determine the link state. A value of 1 enables the use of netif_carrier_ok(), a value of - 0 will use the deprecated MII / ETHTOOL ioctls. A value of 2 - will check both. The default value is 1. + 0 will use the deprecated MII / ETHTOOL ioctls. The default + value is 1. xmit_hash_policy diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e4c253dc7dfb..a4cd7f6bfd4d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, " "in milliseconds"); module_param(use_carrier, int, 0); MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; " - "0 for off, 1 for on (default), 2 for carrier then legacy checks"); + "0 for off, 1 for on (default)"); module_param(mode, charp, 0); MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, " "1 for active-backup, 2 for balance-xor, " @@ -434,16 +434,12 @@ static int bond_check_dev_link(struct bonding *bond, int (*ioctl)(struct net_device *, struct ifreq *, int); struct ifreq ifr; struct mii_ioctl_data *mii; - bool carrier = true; if (!reporting && !netif_running(slave_dev)) return 0; if (bond->params.use_carrier) - carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; - - if (!carrier) - return carrier; + return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; /* Try to get link status using Ethtool first. */ if (slave_dev->ethtool_ops->get_link) @@ -4407,8 +4403,8 @@ static int bond_check_params(struct bond_params *params) downdelay = 0; } - if (use_carrier < 0 || use_carrier > 2) { - pr_warn("Warning: use_carrier module parameter (%d), not of valid value (0-2), so it was set to 1\n", + if ((use_carrier != 0) && (use_carrier != 1)) { + pr_warn("Warning: use_carrier module parameter (%d), not of valid value (0/1), so it was set to 1\n", use_carrier); use_carrier = 1; } diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index dba6cef05134..8a945c9341d6 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -164,10 +164,9 @@ static const struct bond_opt_value bond_primary_reselect_tbl[] = { }; static const struct bond_opt_value bond_use_carrier_tbl[] = { - { "off", 0, 0}, - { "on", 1, BOND_VALFLAG_DEFAULT}, - { "both", 2, 0}, - { NULL, -1, 0} + { "off", 0, 0}, + { "on", 1, BOND_VALFLAG_DEFAULT}, + { NULL, -1, 0} }; static const struct bond_opt_value bond_all_slaves_active_tbl[] = { -- cgit v1.2.3 From 20b654dfe1beaca60ab51894ff405a049248433d Mon Sep 17 00:00:00 2001 From: Yuchung Cheng Date: Wed, 16 May 2018 16:40:10 -0700 Subject: tcp: support DUPACK threshold in RACK This patch adds support for the classic DUPACK threshold rule (#DupThresh) in RACK. When the number of packets SACKed is greater or equal to the threshold, RACK sets the reordering window to zero which would immediately mark all the unsacked packets below the highest SACKed sequence lost. Since this approach is known to not work well with reordering, RACK only uses it if no reordering has been observed. The DUPACK threshold rule is a particularly useful extension to the fast recoveries triggered by RACK reordering timer. For example data-center transfers where the RTT is much smaller than a timer tick, or high RTT path where the default RTT/4 may take too long. Note that this patch differs slightly from RFC6675. RFC6675 considers a packet lost when at least #DupThresh higher-sequence packets are SACKed. With RACK, for connections that have seen reordering, RACK continues to use a dynamically-adaptive time-based reordering window to detect losses. But for connections on which we have not yet seen reordering, this patch considers a packet lost when at least one higher sequence packet is SACKed and the total number of SACKed packets is at least DupThresh. For example, suppose a connection has not seen reordering, and sends 10 packets, and packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2 lost. RACK considers packets 1, 2, 4, 6 lost. There is some small risk of spurious retransmits here due to reordering. However, this is mostly limited to the first flight of a connection on which the sender receives SACKs from reordering. And RFC 6675 and FACK loss detection have a similar risk on the first flight with reordering (it's just that the risk of spurious retransmits from reordering was slightly narrower for those older algorithms due to the margin of 3*MSS). Also the minimum reordering window is reduced from 1 msec to 0 to recover quicker on short RTT transfers. Therefore RACK is more aggressive in marking packets lost during recovery to reduce the reordering window timeouts. Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 1 + include/net/tcp.h | 1 + net/ipv4/tcp_recovery.c | 40 +++++++++++++++++++++++----------- 3 files changed, 29 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 59afc9a10b4f..13bbac50dc8b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -451,6 +451,7 @@ tcp_recovery - INTEGER RACK: 0x1 enables the RACK loss detection for fast detection of lost retransmissions and tail drops. RACK: 0x2 makes RACK's reordering window static (min_rtt/4). + RACK: 0x4 disables RACK's DUPACK threshold heuristic Default: 0x1 diff --git a/include/net/tcp.h b/include/net/tcp.h index a08eab58ef70..994f869d793c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -245,6 +245,7 @@ extern long sysctl_tcp_mem[3]; #define TCP_RACK_LOSS_DETECTION 0x1 /* Use RACK to detect losses */ #define TCP_RACK_STATIC_REO_WND 0x2 /* Use static RACK reo wnd */ +#define TCP_RACK_NO_DUPTHRESH 0x4 /* Do not use DUPACK threshold in RACK */ extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index 3a81720ac0c4..1c1bdf12a96f 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -21,6 +21,32 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2) return t1 > t2 || (t1 == t2 && after(seq1, seq2)); } +u32 tcp_rack_reo_wnd(const struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (!tp->rack.reord) { + /* If reordering has not been observed, be aggressive during + * the recovery or starting the recovery by DUPACK threshold. + */ + if (inet_csk(sk)->icsk_ca_state >= TCP_CA_Recovery) + return 0; + + if (tp->sacked_out >= tp->reordering && + !(sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_NO_DUPTHRESH)) + return 0; + } + + /* To be more reordering resilient, allow min_rtt/4 settling delay. + * Use min_rtt instead of the smoothed RTT because reordering is + * often a path property and less related to queuing or delayed ACKs. + * Upon receiving DSACKs, linearly increase the window up to the + * smoothed RTT. + */ + return min((tcp_min_rtt(tp) >> 2) * tp->rack.reo_wnd_steps, + tp->srtt_us >> 3); +} + /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01): * * Marks a packet lost, if some packet sent later has been (s)acked. @@ -44,23 +70,11 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2) static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout) { struct tcp_sock *tp = tcp_sk(sk); - u32 min_rtt = tcp_min_rtt(tp); struct sk_buff *skb, *n; u32 reo_wnd; *reo_timeout = 0; - /* To be more reordering resilient, allow min_rtt/4 settling delay - * (lower-bounded to 1000uS). We use min_rtt instead of the smoothed - * RTT because reordering is often a path property and less related - * to queuing or delayed ACKs. - */ - reo_wnd = 1000; - if ((tp->rack.reord || inet_csk(sk)->icsk_ca_state < TCP_CA_Recovery) && - min_rtt != ~0U) { - reo_wnd = max((min_rtt >> 2) * tp->rack.reo_wnd_steps, reo_wnd); - reo_wnd = min(reo_wnd, tp->srtt_us >> 3); - } - + reo_wnd = tcp_rack_reo_wnd(sk); list_for_each_entry_safe(skb, n, &tp->tsorted_sent_queue, tcp_tsorted_anchor) { struct tcp_skb_cb *scb = TCP_SKB_CB(skb); -- cgit v1.2.3 From b38a51fec1c1f693f03b1aa19d0622123634d4b7 Mon Sep 17 00:00:00 2001 From: Yuchung Cheng Date: Wed, 16 May 2018 16:40:11 -0700 Subject: tcp: disable RFC6675 loss detection This patch disables RFC6675 loss detection and make sysctl net.ipv4.tcp_recovery = 1 controls a binary choice between RACK (1) or RFC6675 (0). Signed-off-by: Yuchung Cheng Signed-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 3 ++- net/ipv4/tcp_input.c | 12 ++++++++---- 2 files changed, 10 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 13bbac50dc8b..ea304a23c8d7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -449,7 +449,8 @@ tcp_recovery - INTEGER features. RACK: 0x1 enables the RACK loss detection for fast detection of lost - retransmissions and tail drops. + retransmissions and tail drops. It also subsumes and disables + RFC6675 recovery for SACK connections. RACK: 0x2 makes RACK's reordering window static (min_rtt/4). RACK: 0x4 disables RACK's DUPACK threshold heuristic diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b188e0d75edd..ccbe04f80040 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2035,6 +2035,11 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp) return tp->sacked_out + 1; } +static bool tcp_is_rack(const struct sock *sk) +{ + return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION; +} + /* Linux NewReno/SACK/ECN state machine. * -------------------------------------- * @@ -2141,7 +2146,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag) return true; /* Not-A-Trick#2 : Classic rule... */ - if (tcp_dupack_heuristics(tp) > tp->reordering) + if (!tcp_is_rack(sk) && tcp_dupack_heuristics(tp) > tp->reordering) return true; return false; @@ -2722,8 +2727,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag) { struct tcp_sock *tp = tcp_sk(sk); - /* Use RACK to detect loss */ - if (sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) { + if (tcp_is_rack(sk)) { u32 prior_retrans = tp->retrans_out; tcp_rack_mark_lost(sk); @@ -2862,7 +2866,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, fast_rexmit = 1; } - if (do_lost) + if (!tcp_is_rack(sk) && do_lost) tcp_update_scoreboard(sk, fast_rexmit); *rexmit = REXMIT_LOST; } -- cgit v1.2.3 From 52b0900f6fb4dfd92bdf1db3e8e380f572549272 Mon Sep 17 00:00:00 2001 From: Thierry Escande Date: Thu, 29 Mar 2018 21:15:23 +0200 Subject: dt-bindings: net: bluetooth: Add qualcomm-bluetooth Add binding document for serial bluetooth chips using Qualcomm protocol. Signed-off-by: Thierry Escande Reviewed-by: Rob Herring Signed-off-by: Marcel Holtmann --- .../devicetree/bindings/net/qualcomm-bluetooth.txt | 30 ++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt new file mode 100644 index 000000000000..0ea18a53cc29 --- /dev/null +++ b/Documentation/devicetree/bindings/net/qualcomm-bluetooth.txt @@ -0,0 +1,30 @@ +Qualcomm Bluetooth Chips +--------------------- + +This documents the binding structure and common properties for serial +attached Qualcomm devices. + +Serial attached Qualcomm devices shall be a child node of the host UART +device the slave device is attached to. + +Required properties: + - compatible: should contain one of the following: + * "qcom,qca6174-bt" + +Optional properties: + - enable-gpios: gpio specifier used to enable chip + - clocks: clock provided to the controller (SUSCLK_32KHZ) + +Example: + +serial@7570000 { + label = "BT-UART"; + status = "okay"; + + bluetooth { + compatible = "qcom,qca6174-bt"; + + enable-gpios = <&pm8994_gpios 19 GPIO_ACTIVE_HIGH>; + clocks = <&divclk4>; + }; +}; -- cgit v1.2.3 From 6d82aa242092d73c6d2e210cfaf0ebfbe6de1ccf Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 17 May 2018 14:47:28 -0700 Subject: tcp: add tcp_comp_sack_delay_ns sysctl This per netns sysctl allows for TCP SACK compression fine-tuning. Its default value is 1,000,000, or 1 ms to meet TSO autosizing period. Signed-off-by: Eric Dumazet Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 7 +++++++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 7 +++++++ net/ipv4/tcp_input.c | 4 ++-- net/ipv4/tcp_ipv4.c | 1 + 5 files changed, 18 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ea304a23c8d7..7ba952959bca 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -525,6 +525,13 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max tcp_sack - BOOLEAN Enable select acknowledgments (SACKS). +tcp_comp_sack_delay_ns - LONG INTEGER + TCP tries to reduce number of SACK sent, using a timer + based on 5% of SRTT, capped by this sysctl, in nano seconds. + The default is 1ms, based on TSO autosizing period. + + Default : 1,000,000 ns (1 ms) + tcp_slow_start_after_idle - BOOLEAN If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 8491bc9c86b1..927318243cfa 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -160,6 +160,7 @@ struct netns_ipv4 { int sysctl_tcp_pacing_ca_ratio; int sysctl_tcp_wmem[3]; int sysctl_tcp_rmem[3]; + unsigned long sysctl_tcp_comp_sack_delay_ns; struct inet_timewait_death_row tcp_death_row; int sysctl_max_syn_backlog; int sysctl_tcp_fastopen; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 4b195bac8ac0..11fbfdc1566e 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1151,6 +1151,13 @@ static struct ctl_table ipv4_net_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &one, }, + { + .procname = "tcp_comp_sack_delay_ns", + .data = &init_net.ipv4.sysctl_tcp_comp_sack_delay_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + }, { .procname = "udp_rmem_min", .data = &init_net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cc2ac5346b92..6a1dae38c955 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5113,13 +5113,13 @@ send_now: if (hrtimer_is_queued(&tp->compressed_ack_timer)) return; - /* compress ack timer : 5 % of rtt, but no more than 1 ms */ + /* compress ack timer : 5 % of rtt, but no more than tcp_comp_sack_delay_ns */ rtt = tp->rcv_rtt_est.rtt_us; if (tp->srtt_us && tp->srtt_us < rtt) rtt = tp->srtt_us; - delay = min_t(unsigned long, NSEC_PER_MSEC, + delay = min_t(unsigned long, sock_net(sk)->ipv4.sysctl_tcp_comp_sack_delay_ns, rtt * (NSEC_PER_USEC >> 3)/20); sock_hold(sk); hrtimer_start(&tp->compressed_ack_timer, ns_to_ktime(delay), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index caf23de88f8a..a3f4647341db 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2572,6 +2572,7 @@ static int __net_init tcp_sk_init(struct net *net) init_net.ipv4.sysctl_tcp_wmem, sizeof(init_net.ipv4.sysctl_tcp_wmem)); } + net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC; net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE; spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; -- cgit v1.2.3 From 9c21d2fc41c0c8930600c9dd83eadac2e336fcfa Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 17 May 2018 14:47:29 -0700 Subject: tcp: add tcp_comp_sack_nr sysctl This per netns sysctl allows for TCP SACK compression fine-tuning. This limits number of SACK that can be compressed. Using 0 disables SACK compression. Signed-off-by: Eric Dumazet Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 6 ++++++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++ net/ipv4/tcp_input.c | 3 ++- net/ipv4/tcp_ipv4.c | 1 + 5 files changed, 20 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 7ba952959bca..924bd51327b7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -532,6 +532,12 @@ tcp_comp_sack_delay_ns - LONG INTEGER Default : 1,000,000 ns (1 ms) +tcp_comp_sack_nr - INTEGER + Max numer of SACK that can be compressed. + Using 0 disables SACK compression. + + Detault : 44 + tcp_slow_start_after_idle - BOOLEAN If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 927318243cfa..661348f23ea5 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -160,6 +160,7 @@ struct netns_ipv4 { int sysctl_tcp_pacing_ca_ratio; int sysctl_tcp_wmem[3]; int sysctl_tcp_rmem[3]; + int sysctl_tcp_comp_sack_nr; unsigned long sysctl_tcp_comp_sack_delay_ns; struct inet_timewait_death_row tcp_death_row; int sysctl_max_syn_backlog; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 11fbfdc1566e..d2eed3ddcb0a 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -46,6 +46,7 @@ static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int ip_ping_group_range_min[] = { 0, 0 }; static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; +static int comp_sack_nr_max = 255; /* obsolete */ static int sysctl_tcp_low_latency __read_mostly; @@ -1158,6 +1159,15 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = proc_doulongvec_minmax, }, + { + .procname = "tcp_comp_sack_nr", + .data = &init_net.ipv4.sysctl_tcp_comp_sack_nr, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &comp_sack_nr_max, + }, { .procname = "udp_rmem_min", .data = &init_net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6a1dae38c955..aebb29ab2fdf 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5106,7 +5106,8 @@ send_now: return; } - if (!tcp_is_sack(tp) || tp->compressed_ack >= 44) + if (!tcp_is_sack(tp) || + tp->compressed_ack >= sock_net(sk)->ipv4.sysctl_tcp_comp_sack_nr) goto send_now; tp->compressed_ack++; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a3f4647341db..adbdb503db0c 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2573,6 +2573,7 @@ static int __net_init tcp_sk_init(struct net *net) sizeof(init_net.ipv4.sysctl_tcp_wmem)); } net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC; + net->ipv4.sysctl_tcp_comp_sack_nr = 44; net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE; spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; -- cgit v1.2.3 From 3eb9c2ad1db04913041b78e0b5e543527128b90b Mon Sep 17 00:00:00 2001 From: Sergei Shtylyov Date: Fri, 18 May 2018 21:32:46 +0300 Subject: sh_eth: add R8A77980 support Finally, add support for the DT probing of the R-Car V3H (AKA R8A77980) -- it's the only R-Car gen3 SoC having the GEther controller -- others have only EtherAVB... Based on the original (and large) patch by Vladimir Barinov. Signed-off-by: Vladimir Barinov Signed-off-by: Sergei Shtylyov Reviewed-by: Simon Horman Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/sh_eth.txt | 1 + drivers/net/ethernet/renesas/sh_eth.c | 44 ++++++++++++++++++++++++ 2 files changed, 45 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/sh_eth.txt b/Documentation/devicetree/bindings/net/sh_eth.txt index 5172799a7f1a..82a4cf2c145d 100644 --- a/Documentation/devicetree/bindings/net/sh_eth.txt +++ b/Documentation/devicetree/bindings/net/sh_eth.txt @@ -14,6 +14,7 @@ Required properties: "renesas,ether-r8a7791" if the device is a part of R8A7791 SoC. "renesas,ether-r8a7793" if the device is a part of R8A7793 SoC. "renesas,ether-r8a7794" if the device is a part of R8A7794 SoC. + "renesas,gether-r8a77980" if the device is a part of R8A77980 SoC. "renesas,ether-r7s72100" if the device is a part of R7S72100 SoC. "renesas,rcar-gen1-ether" for a generic R-Car Gen1 device. "renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1 diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c index 800c196510eb..83148ca61317 100644 --- a/drivers/net/ethernet/renesas/sh_eth.c +++ b/drivers/net/ethernet/renesas/sh_eth.c @@ -753,6 +753,49 @@ static struct sh_eth_cpu_data rcar_gen2_data = { .rmiimode = 1, .magic = 1, }; + +/* R8A77980 */ +static struct sh_eth_cpu_data r8a77980_data = { + .soft_reset = sh_eth_soft_reset_gether, + + .set_duplex = sh_eth_set_duplex, + .set_rate = sh_eth_set_rate_gether, + + .register_type = SH_ETH_REG_GIGABIT, + + .edtrr_trns = EDTRR_TRNS_GETHER, + .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD, + .ecsipr_value = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP | + ECSIPR_MPDIP, + .eesipr_value = EESIPR_RFCOFIP | EESIPR_ECIIP | + EESIPR_FTCIP | EESIPR_TDEIP | EESIPR_TFUFIP | + EESIPR_FRIP | EESIPR_RDEIP | EESIPR_RFOFIP | + EESIPR_RMAFIP | EESIPR_RRFIP | + EESIPR_RTLFIP | EESIPR_RTSFIP | + EESIPR_PREIP | EESIPR_CERFIP, + + .tx_check = EESR_FTC | EESR_CD | EESR_RTO, + .eesr_err_check = EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT | + EESR_RFE | EESR_RDE | EESR_RFRMER | + EESR_TFE | EESR_TDE | EESR_ECI, + .fdr_value = 0x0000070f, + + .apr = 1, + .mpr = 1, + .tpauser = 1, + .bculr = 1, + .hw_swap = 1, + .nbst = 1, + .rpadir = 1, + .rpadir_value = 2 << 16, + .no_trimd = 1, + .no_ade = 1, + .xdfar_rw = 1, + .hw_checksum = 1, + .select_mii = 1, + .magic = 1, + .cexcr = 1, +}; #endif /* CONFIG_OF */ static void sh_eth_set_rate_sh7724(struct net_device *ndev) @@ -3134,6 +3177,7 @@ static const struct of_device_id sh_eth_match_table[] = { { .compatible = "renesas,ether-r8a7791", .data = &rcar_gen2_data }, { .compatible = "renesas,ether-r8a7793", .data = &rcar_gen2_data }, { .compatible = "renesas,ether-r8a7794", .data = &rcar_gen2_data }, + { .compatible = "renesas,gether-r8a77980", .data = &r8a77980_data }, { .compatible = "renesas,ether-r7s72100", .data = &r7s72100_data }, { .compatible = "renesas,rcar-gen1-ether", .data = &rcar_gen1_data }, { .compatible = "renesas,rcar-gen2-ether", .data = &rcar_gen2_data }, -- cgit v1.2.3 From 3e48439347d56cd1b3d3d07a93bed2b0c6b129fe Mon Sep 17 00:00:00 2001 From: Antoine Tenart Date: Tue, 22 May 2018 12:18:01 +0200 Subject: Documentation/bindings: net: the sfp i2c-bus property is now mandatory The i2c-bus property for sfp modules was made mandatory. Update the documentation to keep it in sync with the driver's behaviour. Signed-off-by: Antoine Tenart Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/sff,sfp.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/sff,sfp.txt b/Documentation/devicetree/bindings/net/sff,sfp.txt index 929591d52ed6..832139919f20 100644 --- a/Documentation/devicetree/bindings/net/sff,sfp.txt +++ b/Documentation/devicetree/bindings/net/sff,sfp.txt @@ -7,11 +7,11 @@ Required properties: "sff,sfp" for SFP modules "sff,sff" for soldered down SFF modules -Optional Properties: - - i2c-bus : phandle of an I2C bus controller for the SFP two wire serial interface +Optional Properties: + - mod-def0-gpios : GPIO phandle and a specifier of the MOD-DEF0 (AKA Mod_ABS) module presence input gpio signal, active (module absent) high. Must not be present for SFF modules -- cgit v1.2.3 From 218bbea11a777c156eb7bcbdc72867b32ae10985 Mon Sep 17 00:00:00 2001 From: Michal Vokáč Date: Wed, 23 May 2018 08:20:18 +0200 Subject: net: dsa: qca8k: Add QCA8334 binding documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add support for the four-port variant of the Qualcomm QCA833x switch. The CPU port default link settings can be reconfigured using a fixed-link sub-node. Signed-off-by: Michal Vokáč Reviewed-by: Rob Herring Reviewed-by: Andrew Lunn Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller --- .../devicetree/bindings/net/dsa/qca8k.txt | 23 +++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt b/Documentation/devicetree/bindings/net/dsa/qca8k.txt index 9c67ee4890d7..bbcb255c3150 100644 --- a/Documentation/devicetree/bindings/net/dsa/qca8k.txt +++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt @@ -2,7 +2,10 @@ Required properties: -- compatible: should be "qca,qca8337" +- compatible: should be one of: + "qca,qca8334" + "qca,qca8337" + - #size-cells: must be 0 - #address-cells: must be 1 @@ -14,6 +17,20 @@ port and PHY id, each subnode describing a port needs to have a valid phandle referencing the internal PHY connected to it. The CPU port of this switch is always port 0. +A CPU port node has the following optional node: + +- fixed-link : Fixed-link subnode describing a link to a non-MDIO + managed entity. See + Documentation/devicetree/bindings/net/fixed-link.txt + for details. + +For QCA8K the 'fixed-link' sub-node supports only the following properties: + +- 'speed' (integer, mandatory), to indicate the link speed. Accepted + values are 10, 100 and 1000 +- 'full-duplex' (boolean, optional), to indicate that full duplex is + used. When absent, half duplex is assumed. + Example: @@ -53,6 +70,10 @@ Example: label = "cpu"; ethernet = <&gmac1>; phy-mode = "rgmii"; + fixed-link { + speed = 1000; + full-duplex; + }; }; port@1 { -- cgit v1.2.3 From 30c8bd5aa8b2c78546c3e52337101b9c85879320 Mon Sep 17 00:00:00 2001 From: Sridhar Samudrala Date: Thu, 24 May 2018 09:55:13 -0700 Subject: net: Introduce generic failover module The failover module provides a generic interface for paravirtual drivers to register a netdev and a set of ops with a failover instance. The ops are used as event handlers that get called to handle netdev register/ unregister/link change/name change events on slave pci ethernet devices with the same mac address as the failover netdev. This enables paravirtual drivers to use a VF as an accelerated low latency datapath. It also allows migration of VMs with direct attached VFs by failing over to the paravirtual datapath when the VF is unplugged. Signed-off-by: Sridhar Samudrala Signed-off-by: David S. Miller --- Documentation/networking/failover.rst | 18 ++ MAINTAINERS | 8 + include/linux/netdevice.h | 16 ++ include/net/failover.h | 36 ++++ net/Kconfig | 13 ++ net/core/Makefile | 1 + net/core/failover.c | 315 ++++++++++++++++++++++++++++++++++ 7 files changed, 407 insertions(+) create mode 100644 Documentation/networking/failover.rst create mode 100644 include/net/failover.h create mode 100644 net/core/failover.c (limited to 'Documentation') diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst new file mode 100644 index 000000000000..f0c8483cdbf5 --- /dev/null +++ b/Documentation/networking/failover.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +FAILOVER +======== + +Overview +======== + +The failover module provides a generic interface for paravirtual drivers +to register a netdev and a set of ops with a failover instance. The ops +are used as event handlers that get called to handle netdev register/ +unregister/link change/name change events on slave pci ethernet devices +with the same mac address as the failover netdev. + +This enables paravirtual drivers to use a VF as an accelerated low latency +datapath. It also allows live migration of VMs with direct attached VFs by +failing over to the paravirtual datapath when the VF is unplugged. diff --git a/MAINTAINERS b/MAINTAINERS index f492431b239b..6c59bdf49a8a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5411,6 +5411,14 @@ S: Maintained F: Documentation/hwmon/f71805f F: drivers/hwmon/f71805f.c +FAILOVER MODULE +M: Sridhar Samudrala +L: netdev@vger.kernel.org +S: Supported +F: net/core/failover.c +F: include/net/failover.h +F: Documentation/networking/failover.rst + FANOTIFY M: Jan Kara R: Amir Goldstein diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 8452f72087ef..f45b1a4e37ab 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1425,6 +1425,8 @@ struct net_device_ops { * entity (i.e. the master device for bridged veth) * @IFF_MACSEC: device is a MACsec device * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook + * @IFF_FAILOVER: device is a failover master device + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device */ enum netdev_priv_flags { IFF_802_1Q_VLAN = 1<<0, @@ -1454,6 +1456,8 @@ enum netdev_priv_flags { IFF_PHONY_HEADROOM = 1<<24, IFF_MACSEC = 1<<25, IFF_NO_RX_HANDLER = 1<<26, + IFF_FAILOVER = 1<<27, + IFF_FAILOVER_SLAVE = 1<<28, }; #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN @@ -1482,6 +1486,8 @@ enum netdev_priv_flags { #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED #define IFF_MACSEC IFF_MACSEC #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER +#define IFF_FAILOVER IFF_FAILOVER +#define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE /** * struct net_device - The DEVICE structure. @@ -4336,6 +4342,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev) return dev->priv_flags & IFF_RXFH_CONFIGURED; } +static inline bool netif_is_failover(const struct net_device *dev) +{ + return dev->priv_flags & IFF_FAILOVER; +} + +static inline bool netif_is_failover_slave(const struct net_device *dev) +{ + return dev->priv_flags & IFF_FAILOVER_SLAVE; +} + /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */ static inline void netif_keep_dst(struct net_device *dev) { diff --git a/include/net/failover.h b/include/net/failover.h new file mode 100644 index 000000000000..bb15438f39c7 --- /dev/null +++ b/include/net/failover.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2018, Intel Corporation. */ + +#ifndef _FAILOVER_H +#define _FAILOVER_H + +#include + +struct failover_ops { + int (*slave_pre_register)(struct net_device *slave_dev, + struct net_device *failover_dev); + int (*slave_register)(struct net_device *slave_dev, + struct net_device *failover_dev); + int (*slave_pre_unregister)(struct net_device *slave_dev, + struct net_device *failover_dev); + int (*slave_unregister)(struct net_device *slave_dev, + struct net_device *failover_dev); + int (*slave_link_change)(struct net_device *slave_dev, + struct net_device *failover_dev); + int (*slave_name_change)(struct net_device *slave_dev, + struct net_device *failover_dev); + rx_handler_result_t (*slave_handle_frame)(struct sk_buff **pskb); +}; + +struct failover { + struct list_head list; + struct net_device __rcu *failover_dev; + struct failover_ops __rcu *ops; +}; + +struct failover *failover_register(struct net_device *dev, + struct failover_ops *ops); +void failover_unregister(struct failover *failover); +int failover_slave_unregister(struct net_device *slave_dev); + +#endif /* _FAILOVER_H */ diff --git a/net/Kconfig b/net/Kconfig index ba554cedb615..f738a6f27665 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -432,6 +432,19 @@ config MAY_USE_DEVLINK config PAGE_POOL bool +config FAILOVER + tristate "Generic failover module" + help + The failover module provides a generic interface for paravirtual + drivers to register a netdev and a set of ops with a failover + instance. The ops are used as event handlers that get called to + handle netdev register/unregister/link change/name change events + on slave pci ethernet devices with the same mac address as the + failover netdev. This enables paravirtual drivers to use a + VF as an accelerated low latency datapath. It also allows live + migration of VMs with direct attached VFs by failing over to the + paravirtual datapath when the VF is unplugged. + endif # if NET # Used by archs to tell that they support BPF JIT compiler plus which flavour. diff --git a/net/core/Makefile b/net/core/Makefile index 7080417f8bc8..80175e6a2eb8 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -31,3 +31,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o obj-$(CONFIG_HWBM) += hwbm.o obj-$(CONFIG_NET_DEVLINK) += devlink.o obj-$(CONFIG_GRO_CELLS) += gro_cells.o +obj-$(CONFIG_FAILOVER) += failover.o diff --git a/net/core/failover.c b/net/core/failover.c new file mode 100644 index 000000000000..4a92a98ccce9 --- /dev/null +++ b/net/core/failover.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2018, Intel Corporation. */ + +/* A common module to handle registrations and notifications for paravirtual + * drivers to enable accelerated datapath and support VF live migration. + * + * The notifier and event handling code is based on netvsc driver. + */ + +#include +#include +#include +#include +#include +#include + +static LIST_HEAD(failover_list); +static DEFINE_SPINLOCK(failover_lock); + +static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops) +{ + struct net_device *failover_dev; + struct failover *failover; + + spin_lock(&failover_lock); + list_for_each_entry(failover, &failover_list, list) { + failover_dev = rtnl_dereference(failover->failover_dev); + if (ether_addr_equal(failover_dev->perm_addr, mac)) { + *ops = rtnl_dereference(failover->ops); + spin_unlock(&failover_lock); + return failover_dev; + } + } + spin_unlock(&failover_lock); + return NULL; +} + +/** + * failover_slave_register - Register a slave netdev + * + * @slave_dev: slave netdev that is being registered + * + * Registers a slave device to a failover instance. Only ethernet devices + * are supported. + */ +static int failover_slave_register(struct net_device *slave_dev) +{ + struct netdev_lag_upper_info lag_upper_info; + struct net_device *failover_dev; + struct failover_ops *fops; + int err; + + if (slave_dev->type != ARPHRD_ETHER) + goto done; + + ASSERT_RTNL(); + + failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops); + if (!failover_dev) + goto done; + + if (fops && fops->slave_pre_register && + fops->slave_pre_register(slave_dev, failover_dev)) + goto done; + + err = netdev_rx_handler_register(slave_dev, fops->slave_handle_frame, + failover_dev); + if (err) { + netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n", + err); + goto done; + } + + lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP; + err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL, + &lag_upper_info, NULL); + if (err) { + netdev_err(slave_dev, "can not set failover device %s (err = %d)\n", + failover_dev->name, err); + goto err_upper_link; + } + + slave_dev->priv_flags |= IFF_FAILOVER_SLAVE; + + if (fops && fops->slave_register && + !fops->slave_register(slave_dev, failover_dev)) + return NOTIFY_OK; + + netdev_upper_dev_unlink(slave_dev, failover_dev); + slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE; +err_upper_link: + netdev_rx_handler_unregister(slave_dev); +done: + return NOTIFY_DONE; +} + +/** + * failover_slave_unregister - Unregister a slave netdev + * + * @slave_dev: slave netdev that is being unregistered + * + * Unregisters a slave device from a failover instance. + */ +int failover_slave_unregister(struct net_device *slave_dev) +{ + struct net_device *failover_dev; + struct failover_ops *fops; + + if (!netif_is_failover_slave(slave_dev)) + goto done; + + ASSERT_RTNL(); + + failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops); + if (!failover_dev) + goto done; + + if (fops && fops->slave_pre_unregister && + fops->slave_pre_unregister(slave_dev, failover_dev)) + goto done; + + netdev_rx_handler_unregister(slave_dev); + netdev_upper_dev_unlink(slave_dev, failover_dev); + slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE; + + if (fops && fops->slave_unregister && + !fops->slave_unregister(slave_dev, failover_dev)) + return NOTIFY_OK; + +done: + return NOTIFY_DONE; +} +EXPORT_SYMBOL_GPL(failover_slave_unregister); + +static int failover_slave_link_change(struct net_device *slave_dev) +{ + struct net_device *failover_dev; + struct failover_ops *fops; + + if (!netif_is_failover_slave(slave_dev)) + goto done; + + ASSERT_RTNL(); + + failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops); + if (!failover_dev) + goto done; + + if (!netif_running(failover_dev)) + goto done; + + if (fops && fops->slave_link_change && + !fops->slave_link_change(slave_dev, failover_dev)) + return NOTIFY_OK; + +done: + return NOTIFY_DONE; +} + +static int failover_slave_name_change(struct net_device *slave_dev) +{ + struct net_device *failover_dev; + struct failover_ops *fops; + + if (!netif_is_failover_slave(slave_dev)) + goto done; + + ASSERT_RTNL(); + + failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops); + if (!failover_dev) + goto done; + + if (!netif_running(failover_dev)) + goto done; + + if (fops && fops->slave_name_change && + !fops->slave_name_change(slave_dev, failover_dev)) + return NOTIFY_OK; + +done: + return NOTIFY_DONE; +} + +static int +failover_event(struct notifier_block *this, unsigned long event, void *ptr) +{ + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); + + /* Skip parent events */ + if (netif_is_failover(event_dev)) + return NOTIFY_DONE; + + switch (event) { + case NETDEV_REGISTER: + return failover_slave_register(event_dev); + case NETDEV_UNREGISTER: + return failover_slave_unregister(event_dev); + case NETDEV_UP: + case NETDEV_DOWN: + case NETDEV_CHANGE: + return failover_slave_link_change(event_dev); + case NETDEV_CHANGENAME: + return failover_slave_name_change(event_dev); + default: + return NOTIFY_DONE; + } +} + +static struct notifier_block failover_notifier = { + .notifier_call = failover_event, +}; + +static void +failover_existing_slave_register(struct net_device *failover_dev) +{ + struct net *net = dev_net(failover_dev); + struct net_device *dev; + + rtnl_lock(); + for_each_netdev(net, dev) { + if (netif_is_failover(dev)) + continue; + if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr)) + failover_slave_register(dev); + } + rtnl_unlock(); +} + +/** + * failover_register - Register a failover instance + * + * @dev: failover netdev + * @ops: failover ops + * + * Allocate and register a failover instance for a failover netdev. ops + * provides handlers for slave device register/unregister/link change/ + * name change events. + * + * Return: pointer to failover instance + */ +struct failover *failover_register(struct net_device *dev, + struct failover_ops *ops) +{ + struct failover *failover; + + if (dev->type != ARPHRD_ETHER) + return ERR_PTR(-EINVAL); + + failover = kzalloc(sizeof(*failover), GFP_KERNEL); + if (!failover) + return ERR_PTR(-ENOMEM); + + rcu_assign_pointer(failover->ops, ops); + dev_hold(dev); + dev->priv_flags |= IFF_FAILOVER; + rcu_assign_pointer(failover->failover_dev, dev); + + spin_lock(&failover_lock); + list_add_tail(&failover->list, &failover_list); + spin_unlock(&failover_lock); + + netdev_info(dev, "failover master:%s registered\n", dev->name); + + failover_existing_slave_register(dev); + + return failover; +} +EXPORT_SYMBOL_GPL(failover_register); + +/** + * failover_unregister - Unregister a failover instance + * + * @failover: pointer to failover instance + * + * Unregisters and frees a failover instance. + */ +void failover_unregister(struct failover *failover) +{ + struct net_device *failover_dev; + + failover_dev = rcu_dereference(failover->failover_dev); + + netdev_info(failover_dev, "failover master:%s unregistered\n", + failover_dev->name); + + failover_dev->priv_flags &= ~IFF_FAILOVER; + dev_put(failover_dev); + + spin_lock(&failover_lock); + list_del(&failover->list); + spin_unlock(&failover_lock); + + kfree(failover); +} +EXPORT_SYMBOL_GPL(failover_unregister); + +static __init int +failover_init(void) +{ + register_netdevice_notifier(&failover_notifier); + + return 0; +} +module_init(failover_init); + +static __exit +void failover_exit(void) +{ + unregister_netdevice_notifier(&failover_notifier); +} +module_exit(failover_exit); + +MODULE_DESCRIPTION("Generic failover infrastructure/interface"); +MODULE_LICENSE("GPL v2"); -- cgit v1.2.3 From cfc80d9a11635404a40199a1c9471c96890f3f74 Mon Sep 17 00:00:00 2001 From: Sridhar Samudrala Date: Thu, 24 May 2018 09:55:15 -0700 Subject: net: Introduce net_failover driver The net_failover driver provides an automated failover mechanism via APIs to create and destroy a failover master netdev and manages a primary and standby slave netdevs that get registered via the generic failover infrastructure. The failover netdev acts a master device and controls 2 slave devices. The original paravirtual interface gets registered as 'standby' slave netdev and a passthru/vf device with the same MAC gets registered as 'primary' slave netdev. Both 'standby' and 'failover' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'failover' netdev. The 'failover' netdev chooses 'primary' netdev as default for transmits when it is available with link up and running. This can be used by paravirtual drivers to enable an alternate low latency datapath. It also enables hypervisor controlled live migration of a VM with direct attached VF by failing over to the paravirtual datapath when the VF is unplugged. Signed-off-by: Sridhar Samudrala Signed-off-by: David S. Miller --- Documentation/networking/net_failover.rst | 26 + MAINTAINERS | 8 + drivers/net/Kconfig | 12 + drivers/net/Makefile | 1 + drivers/net/net_failover.c | 836 ++++++++++++++++++++++++++++++ include/net/net_failover.h | 40 ++ 6 files changed, 923 insertions(+) create mode 100644 Documentation/networking/net_failover.rst create mode 100644 drivers/net/net_failover.c create mode 100644 include/net/net_failover.h (limited to 'Documentation') diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst new file mode 100644 index 000000000000..d4513ad31809 --- /dev/null +++ b/Documentation/networking/net_failover.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +NET_FAILOVER +============ + +Overview +======== + +The net_failover driver provides an automated failover mechanism via APIs +to create and destroy a failover master netdev and mananges a primary and +standby slave netdevs that get registered via the generic failover +infrastructrure. + +The failover netdev acts a master device and controls 2 slave devices. The +original paravirtual interface is registered as 'standby' slave netdev and +a passthru/vf device with the same MAC gets registered as 'primary' slave +netdev. Both 'standby' and 'failover' netdevs are associated with the same +'pci' device. The user accesses the network interface via 'failover' netdev. +The 'failover' netdev chooses 'primary' netdev as default for transmits when +it is available with link up and running. + +This can be used by paravirtual drivers to enable an alternate low latency +datapath. It also enables hypervisor controlled live migration of a VM with +direct attached VF by failing over to the paravirtual datapath when the VF +is unplugged. diff --git a/MAINTAINERS b/MAINTAINERS index 6c59bdf49a8a..1831ff5863a1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9654,6 +9654,14 @@ S: Maintained F: Documentation/hwmon/nct6775 F: drivers/hwmon/nct6775.c +NET_FAILOVER MODULE +M: Sridhar Samudrala +L: netdev@vger.kernel.org +S: Supported +F: driver/net/net_failover.c +F: include/net/net_failover.h +F: Documentation/networking/net_failover.rst + NETEFFECT IWARP RNIC DRIVER (IW_NES) M: Faisal Latif L: linux-rdma@vger.kernel.org diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index a029b27fd002..2cdaff90a9ec 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -510,4 +510,16 @@ config NETDEVSIM To compile this driver as a module, choose M here: the module will be called netdevsim. +config NET_FAILOVER + tristate "Failover driver" + select FAILOVER + help + This provides an automated failover mechanism via APIs to create + and destroy a failover master netdev and manages a primary and + standby slave netdevs that get registered via the generic failover + infrastructure. This can be used by paravirtual drivers to enable + an alternate low latency datapath. It alsoenables live migration of + a VM with direct attached VF by failing over to the paravirtual + datapath when the VF is unplugged. + endif # NETDEVICES diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 91e67e375dd4..21cde7e78621 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -78,3 +78,4 @@ obj-$(CONFIG_FUJITSU_ES) += fjes/ thunderbolt-net-y += thunderbolt.o obj-$(CONFIG_THUNDERBOLT_NET) += thunderbolt-net.o obj-$(CONFIG_NETDEVSIM) += netdevsim/ +obj-$(CONFIG_NET_FAILOVER) += net_failover.o diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c new file mode 100644 index 000000000000..8b508e2cf29b --- /dev/null +++ b/drivers/net/net_failover.c @@ -0,0 +1,836 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2018, Intel Corporation. */ + +/* This provides a net_failover interface for paravirtual drivers to + * provide an alternate datapath by exporting APIs to create and + * destroy a upper 'net_failover' netdev. The upper dev manages the + * original paravirtual interface as a 'standby' netdev and uses the + * generic failover infrastructure to register and manage a direct + * attached VF as a 'primary' netdev. This enables live migration of + * a VM with direct attached VF by failing over to the paravirtual + * datapath when the VF is unplugged. + * + * Some of the netdev management routines are based on bond/team driver as + * this driver provides active-backup functionality similar to those drivers. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static bool net_failover_xmit_ready(struct net_device *dev) +{ + return netif_running(dev) && netif_carrier_ok(dev); +} + +static int net_failover_open(struct net_device *dev) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *primary_dev, *standby_dev; + int err; + + primary_dev = rtnl_dereference(nfo_info->primary_dev); + if (primary_dev) { + err = dev_open(primary_dev); + if (err) + goto err_primary_open; + } + + standby_dev = rtnl_dereference(nfo_info->standby_dev); + if (standby_dev) { + err = dev_open(standby_dev); + if (err) + goto err_standby_open; + } + + if ((primary_dev && net_failover_xmit_ready(primary_dev)) || + (standby_dev && net_failover_xmit_ready(standby_dev))) { + netif_carrier_on(dev); + netif_tx_wake_all_queues(dev); + } + + return 0; + +err_standby_open: + dev_close(primary_dev); +err_primary_open: + netif_tx_disable(dev); + return err; +} + +static int net_failover_close(struct net_device *dev) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *slave_dev; + + netif_tx_disable(dev); + + slave_dev = rtnl_dereference(nfo_info->primary_dev); + if (slave_dev) + dev_close(slave_dev); + + slave_dev = rtnl_dereference(nfo_info->standby_dev); + if (slave_dev) + dev_close(slave_dev); + + return 0; +} + +static netdev_tx_t net_failover_drop_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + atomic_long_inc(&dev->tx_dropped); + dev_kfree_skb_any(skb); + return NETDEV_TX_OK; +} + +static netdev_tx_t net_failover_start_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *xmit_dev; + + /* Try xmit via primary netdev followed by standby netdev */ + xmit_dev = rcu_dereference_bh(nfo_info->primary_dev); + if (!xmit_dev || !net_failover_xmit_ready(xmit_dev)) { + xmit_dev = rcu_dereference_bh(nfo_info->standby_dev); + if (!xmit_dev || !net_failover_xmit_ready(xmit_dev)) + return net_failover_drop_xmit(skb, dev); + } + + skb->dev = xmit_dev; + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; + + return dev_queue_xmit(skb); +} + +static u16 net_failover_select_queue(struct net_device *dev, + struct sk_buff *skb, void *accel_priv, + select_queue_fallback_t fallback) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *primary_dev; + u16 txq; + + primary_dev = rcu_dereference(nfo_info->primary_dev); + if (primary_dev) { + const struct net_device_ops *ops = primary_dev->netdev_ops; + + if (ops->ndo_select_queue) + txq = ops->ndo_select_queue(primary_dev, skb, + accel_priv, fallback); + else + txq = fallback(primary_dev, skb); + + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; + + return txq; + } + + txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; + + /* Save the original txq to restore before passing to the driver */ + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; + + if (unlikely(txq >= dev->real_num_tx_queues)) { + do { + txq -= dev->real_num_tx_queues; + } while (txq >= dev->real_num_tx_queues); + } + + return txq; +} + +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but + * that some drivers can provide 32bit values only. + */ +static void net_failover_fold_stats(struct rtnl_link_stats64 *_res, + const struct rtnl_link_stats64 *_new, + const struct rtnl_link_stats64 *_old) +{ + const u64 *new = (const u64 *)_new; + const u64 *old = (const u64 *)_old; + u64 *res = (u64 *)_res; + int i; + + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { + u64 nv = new[i]; + u64 ov = old[i]; + s64 delta = nv - ov; + + /* detects if this particular field is 32bit only */ + if (((nv | ov) >> 32) == 0) + delta = (s64)(s32)((u32)nv - (u32)ov); + + /* filter anomalies, some drivers reset their stats + * at down/up events. + */ + if (delta > 0) + res[i] += delta; + } +} + +static void net_failover_get_stats(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + const struct rtnl_link_stats64 *new; + struct rtnl_link_stats64 temp; + struct net_device *slave_dev; + + spin_lock(&nfo_info->stats_lock); + memcpy(stats, &nfo_info->failover_stats, sizeof(*stats)); + + rcu_read_lock(); + + slave_dev = rcu_dereference(nfo_info->primary_dev); + if (slave_dev) { + new = dev_get_stats(slave_dev, &temp); + net_failover_fold_stats(stats, new, &nfo_info->primary_stats); + memcpy(&nfo_info->primary_stats, new, sizeof(*new)); + } + + slave_dev = rcu_dereference(nfo_info->standby_dev); + if (slave_dev) { + new = dev_get_stats(slave_dev, &temp); + net_failover_fold_stats(stats, new, &nfo_info->standby_stats); + memcpy(&nfo_info->standby_stats, new, sizeof(*new)); + } + + rcu_read_unlock(); + + memcpy(&nfo_info->failover_stats, stats, sizeof(*stats)); + spin_unlock(&nfo_info->stats_lock); +} + +static int net_failover_change_mtu(struct net_device *dev, int new_mtu) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *primary_dev, *standby_dev; + int ret = 0; + + primary_dev = rcu_dereference(nfo_info->primary_dev); + if (primary_dev) { + ret = dev_set_mtu(primary_dev, new_mtu); + if (ret) + return ret; + } + + standby_dev = rcu_dereference(nfo_info->standby_dev); + if (standby_dev) { + ret = dev_set_mtu(standby_dev, new_mtu); + if (ret) { + if (primary_dev) + dev_set_mtu(primary_dev, dev->mtu); + return ret; + } + } + + dev->mtu = new_mtu; + + return 0; +} + +static void net_failover_set_rx_mode(struct net_device *dev) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *slave_dev; + + rcu_read_lock(); + + slave_dev = rcu_dereference(nfo_info->primary_dev); + if (slave_dev) { + dev_uc_sync_multiple(slave_dev, dev); + dev_mc_sync_multiple(slave_dev, dev); + } + + slave_dev = rcu_dereference(nfo_info->standby_dev); + if (slave_dev) { + dev_uc_sync_multiple(slave_dev, dev); + dev_mc_sync_multiple(slave_dev, dev); + } + + rcu_read_unlock(); +} + +static int net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto, + u16 vid) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *primary_dev, *standby_dev; + int ret = 0; + + primary_dev = rcu_dereference(nfo_info->primary_dev); + if (primary_dev) { + ret = vlan_vid_add(primary_dev, proto, vid); + if (ret) + return ret; + } + + standby_dev = rcu_dereference(nfo_info->standby_dev); + if (standby_dev) { + ret = vlan_vid_add(standby_dev, proto, vid); + if (ret) + if (primary_dev) + vlan_vid_del(primary_dev, proto, vid); + } + + return ret; +} + +static int net_failover_vlan_rx_kill_vid(struct net_device *dev, __be16 proto, + u16 vid) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *slave_dev; + + slave_dev = rcu_dereference(nfo_info->primary_dev); + if (slave_dev) + vlan_vid_del(slave_dev, proto, vid); + + slave_dev = rcu_dereference(nfo_info->standby_dev); + if (slave_dev) + vlan_vid_del(slave_dev, proto, vid); + + return 0; +} + +static const struct net_device_ops failover_dev_ops = { + .ndo_open = net_failover_open, + .ndo_stop = net_failover_close, + .ndo_start_xmit = net_failover_start_xmit, + .ndo_select_queue = net_failover_select_queue, + .ndo_get_stats64 = net_failover_get_stats, + .ndo_change_mtu = net_failover_change_mtu, + .ndo_set_rx_mode = net_failover_set_rx_mode, + .ndo_vlan_rx_add_vid = net_failover_vlan_rx_add_vid, + .ndo_vlan_rx_kill_vid = net_failover_vlan_rx_kill_vid, + .ndo_validate_addr = eth_validate_addr, + .ndo_features_check = passthru_features_check, +}; + +#define FAILOVER_NAME "net_failover" +#define FAILOVER_VERSION "0.1" + +static void nfo_ethtool_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *drvinfo) +{ + strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver)); + strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version)); +} + +static int nfo_ethtool_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) +{ + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *slave_dev; + + slave_dev = rtnl_dereference(nfo_info->primary_dev); + if (!slave_dev || !net_failover_xmit_ready(slave_dev)) { + slave_dev = rtnl_dereference(nfo_info->standby_dev); + if (!slave_dev || !net_failover_xmit_ready(slave_dev)) { + cmd->base.duplex = DUPLEX_UNKNOWN; + cmd->base.port = PORT_OTHER; + cmd->base.speed = SPEED_UNKNOWN; + + return 0; + } + } + + return __ethtool_get_link_ksettings(slave_dev, cmd); +} + +static const struct ethtool_ops failover_ethtool_ops = { + .get_drvinfo = nfo_ethtool_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_link_ksettings = nfo_ethtool_get_link_ksettings, +}; + +/* Called when slave dev is injecting data into network stack. + * Change the associated network device from lower dev to failover dev. + * note: already called with rcu_read_lock + */ +static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb) +{ + struct sk_buff *skb = *pskb; + struct net_device *dev = rcu_dereference(skb->dev->rx_handler_data); + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *primary_dev, *standby_dev; + + primary_dev = rcu_dereference(nfo_info->primary_dev); + standby_dev = rcu_dereference(nfo_info->standby_dev); + + if (primary_dev && skb->dev == standby_dev) + return RX_HANDLER_EXACT; + + skb->dev = dev; + + return RX_HANDLER_ANOTHER; +} + +static void net_failover_compute_features(struct net_device *dev) +{ + u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL; + netdev_features_t enc_features = FAILOVER_ENC_FEATURES; + unsigned short max_hard_header_len = ETH_HLEN; + unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE | + IFF_XMIT_DST_RELEASE_PERM; + struct net_failover_info *nfo_info = netdev_priv(dev); + struct net_device *primary_dev, *standby_dev; + + primary_dev = rcu_dereference(nfo_info->primary_dev); + if (primary_dev) { + vlan_features = + netdev_increment_features(vlan_features, + primary_dev->vlan_features, + FAILOVER_VLAN_FEATURES); + enc_features = + netdev_increment_features(enc_features, + primary_dev->hw_enc_features, + FAILOVER_ENC_FEATURES); + + dst_release_flag &= primary_dev->priv_flags; + if (primary_dev->hard_header_len > max_hard_header_len) + max_hard_header_len = primary_dev->hard_header_len; + } + + standby_dev = rcu_dereference(nfo_info->standby_dev); + if (standby_dev) { + vlan_features = + netdev_increment_features(vlan_features, + standby_dev->vlan_features, + FAILOVER_VLAN_FEATURES); + enc_features = + netdev_increment_features(enc_features, + standby_dev->hw_enc_features, + FAILOVER_ENC_FEATURES); + + dst_release_flag &= standby_dev->priv_flags; + if (standby_dev->hard_header_len > max_hard_header_len) + max_hard_header_len = standby_dev->hard_header_len; + } + + dev->vlan_features = vlan_features; + dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL; + dev->hard_header_len = max_hard_header_len; + + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; + if (dst_release_flag == (IFF_XMIT_DST_RELEASE | + IFF_XMIT_DST_RELEASE_PERM)) + dev->priv_flags |= IFF_XMIT_DST_RELEASE; + + netdev_change_features(dev); +} + +static void net_failover_lower_state_changed(struct net_device *slave_dev, + struct net_device *primary_dev, + struct net_device *standby_dev) +{ + struct netdev_lag_lower_state_info info; + + if (netif_carrier_ok(slave_dev)) + info.link_up = true; + else + info.link_up = false; + + if (slave_dev == primary_dev) { + if (netif_running(primary_dev)) + info.tx_enabled = true; + else + info.tx_enabled = false; + } else { + if ((primary_dev && netif_running(primary_dev)) || + (!netif_running(standby_dev))) + info.tx_enabled = false; + else + info.tx_enabled = true; + } + + netdev_lower_state_changed(slave_dev, &info); +} + +static int net_failover_slave_pre_register(struct net_device *slave_dev, + struct net_device *failover_dev) +{ + struct net_device *standby_dev, *primary_dev; + struct net_failover_info *nfo_info; + bool slave_is_standby; + + nfo_info = netdev_priv(failover_dev); + standby_dev = rtnl_dereference(nfo_info->standby_dev); + primary_dev = rtnl_dereference(nfo_info->primary_dev); + slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent; + if (slave_is_standby ? standby_dev : primary_dev) { + netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n", + slave_dev->name, + slave_is_standby ? "standby" : "primary"); + return -EINVAL; + } + + /* We want to allow only a direct attached VF device as a primary + * netdev. As there is no easy way to check for a VF device, restrict + * this to a pci device. + */ + if (!slave_is_standby && (!slave_dev->dev.parent || + !dev_is_pci(slave_dev->dev.parent))) + return -EINVAL; + + if (failover_dev->features & NETIF_F_VLAN_CHALLENGED && + vlan_uses_dev(failover_dev)) { + netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n", + failover_dev->name); + return -EINVAL; + } + + return 0; +} + +static int net_failover_slave_register(struct net_device *slave_dev, + struct net_device *failover_dev) +{ + struct net_device *standby_dev, *primary_dev; + struct net_failover_info *nfo_info; + bool slave_is_standby; + u32 orig_mtu; + int err; + + /* Align MTU of slave with failover dev */ + orig_mtu = slave_dev->mtu; + err = dev_set_mtu(slave_dev, failover_dev->mtu); + if (err) { + netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n", + slave_dev->name, failover_dev->mtu); + goto done; + } + + dev_hold(slave_dev); + + if (netif_running(failover_dev)) { + err = dev_open(slave_dev); + if (err && (err != -EBUSY)) { + netdev_err(failover_dev, "Opening slave %s failed err:%d\n", + slave_dev->name, err); + goto err_dev_open; + } + } + + netif_addr_lock_bh(failover_dev); + dev_uc_sync_multiple(slave_dev, failover_dev); + dev_uc_sync_multiple(slave_dev, failover_dev); + netif_addr_unlock_bh(failover_dev); + + err = vlan_vids_add_by_dev(slave_dev, failover_dev); + if (err) { + netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n", + slave_dev->name, err); + goto err_vlan_add; + } + + nfo_info = netdev_priv(failover_dev); + standby_dev = rtnl_dereference(nfo_info->standby_dev); + primary_dev = rtnl_dereference(nfo_info->primary_dev); + slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent; + + if (slave_is_standby) { + rcu_assign_pointer(nfo_info->standby_dev, slave_dev); + standby_dev = slave_dev; + dev_get_stats(standby_dev, &nfo_info->standby_stats); + } else { + rcu_assign_pointer(nfo_info->primary_dev, slave_dev); + primary_dev = slave_dev; + dev_get_stats(primary_dev, &nfo_info->primary_stats); + failover_dev->min_mtu = slave_dev->min_mtu; + failover_dev->max_mtu = slave_dev->max_mtu; + } + + net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev); + net_failover_compute_features(failover_dev); + + call_netdevice_notifiers(NETDEV_JOIN, slave_dev); + + netdev_info(failover_dev, "failover %s slave:%s registered\n", + slave_is_standby ? "standby" : "primary", slave_dev->name); + + return 0; + +err_vlan_add: + dev_uc_unsync(slave_dev, failover_dev); + dev_mc_unsync(slave_dev, failover_dev); + dev_close(slave_dev); +err_dev_open: + dev_put(slave_dev); + dev_set_mtu(slave_dev, orig_mtu); +done: + return err; +} + +static int net_failover_slave_pre_unregister(struct net_device *slave_dev, + struct net_device *failover_dev) +{ + struct net_device *standby_dev, *primary_dev; + struct net_failover_info *nfo_info; + + nfo_info = netdev_priv(failover_dev); + primary_dev = rtnl_dereference(nfo_info->primary_dev); + standby_dev = rtnl_dereference(nfo_info->standby_dev); + + if (slave_dev != primary_dev && slave_dev != standby_dev) + return -ENODEV; + + return 0; +} + +static int net_failover_slave_unregister(struct net_device *slave_dev, + struct net_device *failover_dev) +{ + struct net_device *standby_dev, *primary_dev; + struct net_failover_info *nfo_info; + bool slave_is_standby; + + nfo_info = netdev_priv(failover_dev); + primary_dev = rtnl_dereference(nfo_info->primary_dev); + standby_dev = rtnl_dereference(nfo_info->standby_dev); + + vlan_vids_del_by_dev(slave_dev, failover_dev); + dev_uc_unsync(slave_dev, failover_dev); + dev_mc_unsync(slave_dev, failover_dev); + dev_close(slave_dev); + + nfo_info = netdev_priv(failover_dev); + dev_get_stats(failover_dev, &nfo_info->failover_stats); + + slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent; + if (slave_is_standby) { + RCU_INIT_POINTER(nfo_info->standby_dev, NULL); + } else { + RCU_INIT_POINTER(nfo_info->primary_dev, NULL); + if (standby_dev) { + failover_dev->min_mtu = standby_dev->min_mtu; + failover_dev->max_mtu = standby_dev->max_mtu; + } + } + + dev_put(slave_dev); + + net_failover_compute_features(failover_dev); + + netdev_info(failover_dev, "failover %s slave:%s unregistered\n", + slave_is_standby ? "standby" : "primary", slave_dev->name); + + return 0; +} + +static int net_failover_slave_link_change(struct net_device *slave_dev, + struct net_device *failover_dev) +{ + struct net_device *primary_dev, *standby_dev; + struct net_failover_info *nfo_info; + + nfo_info = netdev_priv(failover_dev); + + primary_dev = rtnl_dereference(nfo_info->primary_dev); + standby_dev = rtnl_dereference(nfo_info->standby_dev); + + if (slave_dev != primary_dev && slave_dev != standby_dev) + return -ENODEV; + + if ((primary_dev && net_failover_xmit_ready(primary_dev)) || + (standby_dev && net_failover_xmit_ready(standby_dev))) { + netif_carrier_on(failover_dev); + netif_tx_wake_all_queues(failover_dev); + } else { + dev_get_stats(failover_dev, &nfo_info->failover_stats); + netif_carrier_off(failover_dev); + netif_tx_stop_all_queues(failover_dev); + } + + net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev); + + return 0; +} + +static int net_failover_slave_name_change(struct net_device *slave_dev, + struct net_device *failover_dev) +{ + struct net_device *primary_dev, *standby_dev; + struct net_failover_info *nfo_info; + + nfo_info = netdev_priv(failover_dev); + + primary_dev = rtnl_dereference(nfo_info->primary_dev); + standby_dev = rtnl_dereference(nfo_info->standby_dev); + + if (slave_dev != primary_dev && slave_dev != standby_dev) + return -ENODEV; + + /* We need to bring up the slave after the rename by udev in case + * open failed with EBUSY when it was registered. + */ + dev_open(slave_dev); + + return 0; +} + +static struct failover_ops net_failover_ops = { + .slave_pre_register = net_failover_slave_pre_register, + .slave_register = net_failover_slave_register, + .slave_pre_unregister = net_failover_slave_pre_unregister, + .slave_unregister = net_failover_slave_unregister, + .slave_link_change = net_failover_slave_link_change, + .slave_name_change = net_failover_slave_name_change, + .slave_handle_frame = net_failover_handle_frame, +}; + +/** + * net_failover_create - Create and register a failover instance + * + * @dev: standby netdev + * + * Creates a failover netdev and registers a failover instance for a standby + * netdev. Used by paravirtual drivers that use 3-netdev model. + * The failover netdev acts as a master device and controls 2 slave devices - + * the original standby netdev and a VF netdev with the same MAC gets + * registered as primary netdev. + * + * Return: pointer to failover instance + */ +struct failover *net_failover_create(struct net_device *standby_dev) +{ + struct device *dev = standby_dev->dev.parent; + struct net_device *failover_dev; + struct failover *failover; + int err; + + /* Alloc at least 2 queues, for now we are going with 16 assuming + * that VF devices being enslaved won't have too many queues. + */ + failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16); + if (!failover_dev) { + dev_err(dev, "Unable to allocate failover_netdev!\n"); + return ERR_PTR(-ENOMEM); + } + + dev_net_set(failover_dev, dev_net(standby_dev)); + SET_NETDEV_DEV(failover_dev, dev); + + failover_dev->netdev_ops = &failover_dev_ops; + failover_dev->ethtool_ops = &failover_ethtool_ops; + + /* Initialize the device options */ + failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE; + failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | + IFF_TX_SKB_SHARING); + + /* don't acquire failover netdev's netif_tx_lock when transmitting */ + failover_dev->features |= NETIF_F_LLTX; + + /* Don't allow failover devices to change network namespaces. */ + failover_dev->features |= NETIF_F_NETNS_LOCAL; + + failover_dev->hw_features = FAILOVER_VLAN_FEATURES | + NETIF_F_HW_VLAN_CTAG_TX | + NETIF_F_HW_VLAN_CTAG_RX | + NETIF_F_HW_VLAN_CTAG_FILTER; + + failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL; + failover_dev->features |= failover_dev->hw_features; + + memcpy(failover_dev->dev_addr, standby_dev->dev_addr, + failover_dev->addr_len); + + failover_dev->min_mtu = standby_dev->min_mtu; + failover_dev->max_mtu = standby_dev->max_mtu; + + err = register_netdev(failover_dev); + if (err) { + dev_err(dev, "Unable to register failover_dev!\n"); + goto err_register_netdev; + } + + netif_carrier_off(failover_dev); + + failover = failover_register(failover_dev, &net_failover_ops); + if (IS_ERR(failover)) + goto err_failover_register; + + return failover; + +err_failover_register: + unregister_netdev(failover_dev); +err_register_netdev: + free_netdev(failover_dev); + + return ERR_PTR(err); +} +EXPORT_SYMBOL_GPL(net_failover_create); + +/** + * net_failover_destroy - Destroy a failover instance + * + * @failover: pointer to failover instance + * + * Unregisters any slave netdevs associated with the failover instance by + * calling failover_slave_unregister(). + * unregisters the failover instance itself and finally frees the failover + * netdev. Used by paravirtual drivers that use 3-netdev model. + * + */ +void net_failover_destroy(struct failover *failover) +{ + struct net_failover_info *nfo_info; + struct net_device *failover_dev; + struct net_device *slave_dev; + + if (!failover) + return; + + failover_dev = rcu_dereference(failover->failover_dev); + nfo_info = netdev_priv(failover_dev); + + netif_device_detach(failover_dev); + + rtnl_lock(); + + slave_dev = rtnl_dereference(nfo_info->primary_dev); + if (slave_dev) + failover_slave_unregister(slave_dev); + + slave_dev = rtnl_dereference(nfo_info->standby_dev); + if (slave_dev) + failover_slave_unregister(slave_dev); + + failover_unregister(failover); + + unregister_netdevice(failover_dev); + + rtnl_unlock(); + + free_netdev(failover_dev); +} +EXPORT_SYMBOL_GPL(net_failover_destroy); + +static __init int +net_failover_init(void) +{ + return 0; +} +module_init(net_failover_init); + +static __exit +void net_failover_exit(void) +{ +} +module_exit(net_failover_exit); + +MODULE_DESCRIPTION("Failover driver for Paravirtual drivers"); +MODULE_LICENSE("GPL v2"); diff --git a/include/net/net_failover.h b/include/net/net_failover.h new file mode 100644 index 000000000000..b12a1c469d1c --- /dev/null +++ b/include/net/net_failover.h @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2018, Intel Corporation. */ + +#ifndef _NET_FAILOVER_H +#define _NET_FAILOVER_H + +#include + +/* failover state */ +struct net_failover_info { + /* primary netdev with same MAC */ + struct net_device __rcu *primary_dev; + + /* standby netdev */ + struct net_device __rcu *standby_dev; + + /* primary netdev stats */ + struct rtnl_link_stats64 primary_stats; + + /* standby netdev stats */ + struct rtnl_link_stats64 standby_stats; + + /* aggregated stats */ + struct rtnl_link_stats64 failover_stats; + + /* spinlock while updating stats */ + spinlock_t stats_lock; +}; + +struct failover *net_failover_create(struct net_device *standby_dev); +void net_failover_destroy(struct failover *failover); + +#define FAILOVER_VLAN_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \ + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \ + NETIF_F_HIGHDMA | NETIF_F_LRO) + +#define FAILOVER_ENC_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \ + NETIF_F_RXCSUM | NETIF_F_ALL_TSO) + +#endif /* _NET_FAILOVER_H */ -- cgit v1.2.3 From ba5e4426e80e0435358c7117c339e6a4c22c34ad Mon Sep 17 00:00:00 2001 From: Sridhar Samudrala Date: Thu, 24 May 2018 09:55:17 -0700 Subject: virtio_net: Extend virtio to use VF datapath when available This patch enables virtio_net to switch over to a VF datapath when STANDBY feature is enabled and a VF netdev is present with the same MAC address. It allows live migration of a VM with a direct attached VF without the need to setup a bond/team between a VF and virtio net device in the guest. It uses the API that is exported by the net_failover driver to create and and destroy a master failover netdev. When STANDBY feature is enabled, an additional netdev(failover netdev) is created that acts as a master device and tracks the state of the 2 lower netdevs. The original virtio_net netdev is marked as 'standby' netdev and a passthru device with the same MAC is registered as 'primary' netdev. The hypervisor needs to unplug the VF device from the guest on the source host and reset the MAC filter of the VF to initiate failover of datapath to virtio before starting the migration. After the migration is completed, the destination hypervisor sets the MAC filter on the VF and plugs it back to the guest to switch over to VF datapath. This patch is based on the discussion initiated by Jesse on this thread. https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 Signed-off-by: Sridhar Samudrala Signed-off-by: David S. Miller --- Documentation/networking/net_failover.rst | 90 +++++++++++++++++++++++++++++++ drivers/net/Kconfig | 1 + drivers/net/virtio_net.c | 38 ++++++++++++- 3 files changed, 128 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst index d4513ad31809..70ca2f5800c4 100644 --- a/Documentation/networking/net_failover.rst +++ b/Documentation/networking/net_failover.rst @@ -24,3 +24,93 @@ This can be used by paravirtual drivers to enable an alternate low latency datapath. It also enables hypervisor controlled live migration of a VM with direct attached VF by failing over to the paravirtual datapath when the VF is unplugged. + +virtio-net accelerated datapath: STANDBY mode +============================================= + +net_failover enables hypervisor controlled accelerated datapath to virtio-net +enabled VMs in a transparent manner with no/minimal guest userspace chanages. + +To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY +feature on the virtio-net interface and assign the same MAC address to both +virtio-net and VF interfaces. + +Here is an example XML snippet that shows such configuration. + + + + + + + + +
+ + + + +
+ +
+ + +Booting a VM with the above configuration will result in the following 3 +netdevs created in the VM. + +4: ens10: mtu 1500 qdisc noqueue state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + inet 192.168.12.53/24 brd 192.168.12.255 scope global dynamic ens10 + valid_lft 42482sec preferred_lft 42482sec + inet6 fe80::97d8:db2:8c10:b6d6/64 scope link + valid_lft forever preferred_lft forever +5: ens10nsby: mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff +7: ens11: mtu 1500 qdisc mq master ens10 state UP group default qlen 1000 + link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff + +ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave +'standby' and 'primary' netdevs respectively. + +Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode +================================================================== + +net_failover also enables hypervisor controlled live migration to be supported +with VMs that have direct attached SR-IOV VF devices by automatic failover to +the paravirtual datapath when the VF is unplugged. + +Here is a sample script that shows the steps to initiate live migration on +the source hypervisor. + +# cat vf_xml + + + +
+ +
+ + +# Source Hypervisor +#!/bin/bash + +DOMAIN=fedora27-tap01 +PF=enp66s0f0 +VF_NUM=5 +TAP_IF=tap01 +VF_XML= + +MAC=52:54:00:00:12:53 +ZERO_MAC=00:00:00:00:00:00 + +virsh domif-setlink $DOMAIN $TAP_IF up +bridge fdb del $MAC dev $PF master +virsh detach-device $DOMAIN $VF_XML +ip link set $PF vf $VF_NUM mac $ZERO_MAC + +virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system + +# Destination Hypervisor +#!/bin/bash + +virsh attach-device $DOMAIN $VF_XML +virsh domif-setlink $DOMAIN $TAP_IF down diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 2cdaff90a9ec..d03775100f7d 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -332,6 +332,7 @@ config VETH config VIRTIO_NET tristate "Virtio network driver" depends on VIRTIO + select NET_FAILOVER ---help--- This is the virtual network driver for virtio. It can be used with QEMU based VMMs (like KVM or Xen). Say Y or M. diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index fe4534ee50a1..8f08a3e1bbaa 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -30,8 +30,11 @@ #include #include #include +#include +#include #include #include +#include static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -210,6 +213,9 @@ struct virtnet_info { u32 speed; unsigned long guest_offloads; + + /* failover when STANDBY feature enabled */ + struct failover *failover; }; struct padded_vnet_hdr { @@ -1554,6 +1560,9 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p) struct sockaddr *addr; struct scatterlist sg; + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY)) + return -EOPNOTSUPP; + addr = kmemdup(p, sizeof(*addr), GFP_KERNEL); if (!addr) return -ENOMEM; @@ -2337,6 +2346,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) } } +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, + size_t len) +{ + struct virtnet_info *vi = netdev_priv(dev); + int ret; + + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY)) + return -EOPNOTSUPP; + + ret = snprintf(buf, len, "sby"); + if (ret >= len) + return -EOPNOTSUPP; + + return 0; +} + static const struct net_device_ops virtnet_netdev = { .ndo_open = virtnet_open, .ndo_stop = virtnet_close, @@ -2354,6 +2379,7 @@ static const struct net_device_ops virtnet_netdev = { .ndo_xdp_xmit = virtnet_xdp_xmit, .ndo_xdp_flush = virtnet_xdp_flush, .ndo_features_check = passthru_features_check, + .ndo_get_phys_port_name = virtnet_get_phys_port_name, }; static void virtnet_config_changed_work(struct work_struct *work) @@ -2907,10 +2933,16 @@ static int virtnet_probe(struct virtio_device *vdev) virtnet_init_settings(dev); + if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) { + vi->failover = net_failover_create(vi->dev); + if (IS_ERR(vi->failover)) + goto free_vqs; + } + err = register_netdev(dev); if (err) { pr_debug("virtio_net: registering device failed\n"); - goto free_vqs; + goto free_failover; } virtio_device_ready(vdev); @@ -2947,6 +2979,8 @@ free_unregister_netdev: vi->vdev->config->reset(vdev); unregister_netdev(dev); +free_failover: + net_failover_destroy(vi->failover); free_vqs: cancel_delayed_work_sync(&vi->refill); free_receive_page_frags(vi); @@ -2981,6 +3015,8 @@ static void virtnet_remove(struct virtio_device *vdev) unregister_netdev(vi->dev); + net_failover_destroy(vi->failover); + remove_vq_common(vi); free_netdev(vi->dev); -- cgit v1.2.3 From a98ac8bd24791afbdd092eb53e89b536d8996464 Mon Sep 17 00:00:00 2001 From: Yangbo Lu Date: Fri, 25 May 2018 12:40:37 +0800 Subject: dt-bindings: ptp: add ptp-qoriq.txt This patch is to add a documentation for ptp_qoriq dt-bindings. The description for ptp_qoriq dt-bindings was actually moved from Documentation/devicetree/bindings/net/fsl-tsec-phy.txt, since gianfar_ptp driver was moved to ptp_qoriq driver. Signed-off-by: Yangbo Lu Signed-off-by: David S. Miller --- .../devicetree/bindings/net/fsl-tsec-phy.txt | 68 +-------------------- .../devicetree/bindings/ptp/ptp-qoriq.txt | 69 ++++++++++++++++++++++ 2 files changed, 70 insertions(+), 67 deletions(-) create mode 100644 Documentation/devicetree/bindings/ptp/ptp-qoriq.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt index 79bf352e659c..047bdf7bdd2f 100644 --- a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt +++ b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt @@ -86,70 +86,4 @@ Example: * Gianfar PTP clock nodes -General Properties: - - - compatible Should be "fsl,etsec-ptp" - - reg Offset and length of the register set for the device - - interrupts There should be at least two interrupts. Some devices - have as many as four PTP related interrupts. - -Clock Properties: - - - fsl,cksel Timer reference clock source. - - fsl,tclk-period Timer reference clock period in nanoseconds. - - fsl,tmr-prsc Prescaler, divides the output clock. - - fsl,tmr-add Frequency compensation value. - - fsl,tmr-fiper1 Fixed interval period pulse generator. - - fsl,tmr-fiper2 Fixed interval period pulse generator. - - fsl,max-adj Maximum frequency adjustment in parts per billion. - - These properties set the operational parameters for the PTP - clock. You must choose these carefully for the clock to work right. - Here is how to figure good values: - - TimerOsc = selected reference clock MHz - tclk_period = desired clock period nanoseconds - NominalFreq = 1000 / tclk_period MHz - FreqDivRatio = TimerOsc / NominalFreq (must be greater that 1.0) - tmr_add = ceil(2^32 / FreqDivRatio) - OutputClock = NominalFreq / tmr_prsc MHz - PulseWidth = 1 / OutputClock microseconds - FiperFreq1 = desired frequency in Hz - FiperDiv1 = 1000000 * OutputClock / FiperFreq1 - tmr_fiper1 = tmr_prsc * tclk_period * FiperDiv1 - tclk_period - max_adj = 1000000000 * (FreqDivRatio - 1.0) - 1 - - The calculation for tmr_fiper2 is the same as for tmr_fiper1. The - driver expects that tmr_fiper1 will be correctly set to produce a 1 - Pulse Per Second (PPS) signal, since this will be offered to the PPS - subsystem to synchronize the Linux clock. - - Reference clock source is determined by the value, which is holded - in CKSEL bits in TMR_CTRL register. "fsl,cksel" property keeps the - value, which will be directly written in those bits, that is why, - according to reference manual, the next clock sources can be used: - - <0> - external high precision timer reference clock (TSEC_TMR_CLK - input is used for this purpose); - <1> - eTSEC system clock; - <2> - eTSEC1 transmit clock; - <3> - RTC clock input. - - When this attribute is not used, eTSEC system clock will serve as - IEEE 1588 timer reference clock. - -Example: - - ptp_clock@24e00 { - compatible = "fsl,etsec-ptp"; - reg = <0x24E00 0xB0>; - interrupts = <12 0x8 13 0x8>; - interrupt-parent = < &ipic >; - fsl,cksel = <1>; - fsl,tclk-period = <10>; - fsl,tmr-prsc = <100>; - fsl,tmr-add = <0x999999A4>; - fsl,tmr-fiper1 = <0x3B9AC9F6>; - fsl,tmr-fiper2 = <0x00018696>; - fsl,max-adj = <659999998>; - }; +Refer to Documentation/devicetree/bindings/ptp/ptp-qoriq.txt diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt new file mode 100644 index 000000000000..0f569d8e73a3 --- /dev/null +++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt @@ -0,0 +1,69 @@ +* Freescale QorIQ 1588 timer based PTP clock + +General Properties: + + - compatible Should be "fsl,etsec-ptp" + - reg Offset and length of the register set for the device + - interrupts There should be at least two interrupts. Some devices + have as many as four PTP related interrupts. + +Clock Properties: + + - fsl,cksel Timer reference clock source. + - fsl,tclk-period Timer reference clock period in nanoseconds. + - fsl,tmr-prsc Prescaler, divides the output clock. + - fsl,tmr-add Frequency compensation value. + - fsl,tmr-fiper1 Fixed interval period pulse generator. + - fsl,tmr-fiper2 Fixed interval period pulse generator. + - fsl,max-adj Maximum frequency adjustment in parts per billion. + + These properties set the operational parameters for the PTP + clock. You must choose these carefully for the clock to work right. + Here is how to figure good values: + + TimerOsc = selected reference clock MHz + tclk_period = desired clock period nanoseconds + NominalFreq = 1000 / tclk_period MHz + FreqDivRatio = TimerOsc / NominalFreq (must be greater that 1.0) + tmr_add = ceil(2^32 / FreqDivRatio) + OutputClock = NominalFreq / tmr_prsc MHz + PulseWidth = 1 / OutputClock microseconds + FiperFreq1 = desired frequency in Hz + FiperDiv1 = 1000000 * OutputClock / FiperFreq1 + tmr_fiper1 = tmr_prsc * tclk_period * FiperDiv1 - tclk_period + max_adj = 1000000000 * (FreqDivRatio - 1.0) - 1 + + The calculation for tmr_fiper2 is the same as for tmr_fiper1. The + driver expects that tmr_fiper1 will be correctly set to produce a 1 + Pulse Per Second (PPS) signal, since this will be offered to the PPS + subsystem to synchronize the Linux clock. + + Reference clock source is determined by the value, which is holded + in CKSEL bits in TMR_CTRL register. "fsl,cksel" property keeps the + value, which will be directly written in those bits, that is why, + according to reference manual, the next clock sources can be used: + + <0> - external high precision timer reference clock (TSEC_TMR_CLK + input is used for this purpose); + <1> - eTSEC system clock; + <2> - eTSEC1 transmit clock; + <3> - RTC clock input. + + When this attribute is not used, eTSEC system clock will serve as + IEEE 1588 timer reference clock. + +Example: + + ptp_clock@24e00 { + compatible = "fsl,etsec-ptp"; + reg = <0x24E00 0xB0>; + interrupts = <12 0x8 13 0x8>; + interrupt-parent = < &ipic >; + fsl,cksel = <1>; + fsl,tclk-period = <10>; + fsl,tmr-prsc = <100>; + fsl,tmr-add = <0x999999A4>; + fsl,tmr-fiper1 = <0x3B9AC9F6>; + fsl,tmr-fiper2 = <0x00018696>; + fsl,max-adj = <659999998>; + }; -- cgit v1.2.3 From 1f3feacbfd48a3fa740e3561c9b7cbc8cf694d1f Mon Sep 17 00:00:00 2001 From: Christophe Roullier Date: Fri, 25 May 2018 09:46:39 +0200 Subject: dt-bindings: stm32-dwmac: add support of MPU families Add description for Ethernet MPU families fields Signed-off-by: Christophe Roullier Reviewed-by: Rob Herring Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/stm32-dwmac.txt | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/stm32-dwmac.txt b/Documentation/devicetree/bindings/net/stm32-dwmac.txt index 489dbcb66c5a..1341012722aa 100644 --- a/Documentation/devicetree/bindings/net/stm32-dwmac.txt +++ b/Documentation/devicetree/bindings/net/stm32-dwmac.txt @@ -6,14 +6,28 @@ Please see stmmac.txt for the other unchanged properties. The device node has following properties. Required properties: -- compatible: Should be "st,stm32-dwmac" to select glue, and +- compatible: For MCU family should be "st,stm32-dwmac" to select glue, and "snps,dwmac-3.50a" to select IP version. + For MPU family should be "st,stm32mp1-dwmac" to select + glue, and "snps,dwmac-4.20a" to select IP version. - clocks: Must contain a phandle for each entry in clock-names. - clock-names: Should be "stmmaceth" for the host clock. Should be "mac-clk-tx" for the MAC TX clock. Should be "mac-clk-rx" for the MAC RX clock. + For MPU family need to add also "ethstp" for power mode clock and, + "syscfg-clk" for SYSCFG clock. +- interrupt-names: Should contain a list of interrupt names corresponding to + the interrupts in the interrupts property, if available. + Should be "macirq" for the main MAC IRQ + Should be "eth_wake_irq" for the IT which wake up system - st,syscon : Should be phandle/offset pair. The phandle to the syscon node which - encompases the glue register, and the offset of the control register. + encompases the glue register, and the offset of the control register. + +Optional properties: +- clock-names: For MPU family "mac-clk-ck" for PHY without quartz +- st,int-phyclk (boolean) : valid only where PHY do not have quartz and need to be clock + by RCC + Example: ethernet@40028000 { -- cgit v1.2.3 From 1f809b47e570fe420adc93ba55712a8f8f6d5be7 Mon Sep 17 00:00:00 2001 From: Christophe Roullier Date: Fri, 25 May 2018 09:46:41 +0200 Subject: dt-bindings: stm32: add compatible for syscon This patch describes syscon DT bindings. Signed-off-by: Christophe Roullier Reviewed-by: Rob Herring Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/arm/stm32.txt | 10 ---------- .../devicetree/bindings/arm/stm32/stm32-syscon.txt | 14 ++++++++++++++ Documentation/devicetree/bindings/arm/stm32/stm32.txt | 10 ++++++++++ 3 files changed, 24 insertions(+), 10 deletions(-) delete mode 100644 Documentation/devicetree/bindings/arm/stm32.txt create mode 100644 Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt create mode 100644 Documentation/devicetree/bindings/arm/stm32/stm32.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/arm/stm32.txt b/Documentation/devicetree/bindings/arm/stm32.txt deleted file mode 100644 index 6808ed9ddfd5..000000000000 --- a/Documentation/devicetree/bindings/arm/stm32.txt +++ /dev/null @@ -1,10 +0,0 @@ -STMicroelectronics STM32 Platforms Device Tree Bindings - -Each device tree must specify which STM32 SoC it uses, -using one of the following compatible strings: - - st,stm32f429 - st,stm32f469 - st,stm32f746 - st,stm32h743 - st,stm32mp157 diff --git a/Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt b/Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt new file mode 100644 index 000000000000..99980aee26e5 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/stm32/stm32-syscon.txt @@ -0,0 +1,14 @@ +STMicroelectronics STM32 Platforms System Controller + +Properties: + - compatible : should contain two values. First value must be : + - " st,stm32mp157-syscfg " - for stm32mp157 based SoCs, + second value must be always "syscon". + - reg : offset and length of the register set. + + Example: + syscfg: syscon@50020000 { + compatible = "st,stm32mp157-syscfg", "syscon"; + reg = <0x50020000 0x400>; + }; + diff --git a/Documentation/devicetree/bindings/arm/stm32/stm32.txt b/Documentation/devicetree/bindings/arm/stm32/stm32.txt new file mode 100644 index 000000000000..6808ed9ddfd5 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/stm32/stm32.txt @@ -0,0 +1,10 @@ +STMicroelectronics STM32 Platforms Device Tree Bindings + +Each device tree must specify which STM32 SoC it uses, +using one of the following compatible strings: + + st,stm32f429 + st,stm32f469 + st,stm32f746 + st,stm32h743 + st,stm32mp157 -- cgit v1.2.3 From bbff2f321a864ee07c9d3d1245af498023146951 Mon Sep 17 00:00:00 2001 From: Björn Töpel Date: Mon, 4 Jun 2018 13:57:13 +0200 Subject: xsk: new descriptor addressing scheme MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, AF_XDP only supports a fixed frame-size memory scheme where each frame is referenced via an index (idx). A user passes the frame index to the kernel, and the kernel acts upon the data. Some NICs, however, do not have a fixed frame-size model, instead they have a model where a memory window is passed to the hardware and multiple frames are filled into that window (referred to as the "type-writer" model). By changing the descriptor format from the current frame index addressing scheme, AF_XDP can in the future be extended to support these kinds of NICs. In the index-based model, an idx refers to a frame of size frame_size. Addressing a frame in the UMEM is done by offseting the UMEM starting address by a global offset, idx * frame_size + offset. Communicating via the fill- and completion-rings are done by means of idx. In this commit, the idx is removed in favor of an address (addr), which is a relative address ranging over the UMEM. To convert an idx-based address to the new addr is simply: addr = idx * frame_size + offset. We also stop referring to the UMEM "frame" as a frame. Instead it is simply called a chunk. To transfer ownership of a chunk to the kernel, the addr of the chunk is passed in the fill-ring. Note, that the kernel will mask addr to make it chunk aligned, so there is no need for userspace to do that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or 3000 to the fill-ring will refer to the same chunk. On the completion-ring, the addr will match that of the Tx descriptor, passed to the kernel. Changing the descriptor format to use chunks/addr will allow for future changes to move to a type-writer based model, where multiple frames can reside in one chunk. In this model passing one single chunk into the fill-ring, would potentially result in multiple Rx descriptors. This commit changes the uapi of AF_XDP sockets, and updates the documentation. Signed-off-by: Björn Töpel Signed-off-by: Daniel Borkmann --- Documentation/networking/af_xdp.rst | 101 +++++++++++++++++++++--------------- include/uapi/linux/if_xdp.h | 12 ++--- net/xdp/xdp_umem.c | 33 ++++++------ net/xdp/xdp_umem.h | 27 +++------- net/xdp/xdp_umem_props.h | 4 +- net/xdp/xsk.c | 30 ++++++----- net/xdp/xsk_queue.c | 2 +- net/xdp/xsk_queue.h | 43 +++++++-------- 8 files changed, 123 insertions(+), 129 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 91928d9ee4bf..ff929cfab4f4 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -12,7 +12,7 @@ packet processing. This document assumes that the reader is familiar with BPF and XDP. If not, the Cilium project has an excellent reference guide at -http://cilium.readthedocs.io/en/doc-1.0/bpf/. +http://cilium.readthedocs.io/en/latest/bpf/. Using the XDP_REDIRECT action from an XDP program, the program can redirect ingress frames to other XDP enabled netdevs, using the @@ -33,22 +33,22 @@ for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This again avoids copying data. -The UMEM consists of a number of equally size frames and each frame -has a unique frame id. A descriptor in one of the rings references a -frame by referencing its frame id. The user space allocates memory for -this UMEM using whatever means it feels is most appropriate (malloc, -mmap, huge pages, etc). This memory area is then registered with the -kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two -rings: the FILL ring and the COMPLETION ring. The fill ring is used by -the application to send down frame ids for the kernel to fill in with -RX packet data. References to these frames will then appear in the RX -ring once each packet has been received. The completion ring, on the -other hand, contains frame ids that the kernel has transmitted -completely and can now be used again by user space, for either TX or -RX. Thus, the frame ids appearing in the completion ring are ids that -were previously transmitted using the TX ring. In summary, the RX and -FILL rings are used for the RX path and the TX and COMPLETION rings -are used for the TX path. +The UMEM consists of a number of equally sized chunks. A descriptor in +one of the rings references a frame by referencing its addr. The addr +is simply an offset within the entire UMEM region. The user space +allocates memory for this UMEM using whatever means it feels is most +appropriate (malloc, mmap, huge pages, etc). This memory area is then +registered with the kernel using the new setsockopt XDP_UMEM_REG. The +UMEM also has two rings: the FILL ring and the COMPLETION ring. The +fill ring is used by the application to send down addr for the kernel +to fill in with RX packet data. References to these frames will then +appear in the RX ring once each packet has been received. The +completion ring, on the other hand, contains frame addr that the +kernel has transmitted completely and can now be used again by user +space, for either TX or RX. Thus, the frame addrs appearing in the +completion ring are addrs that were previously transmitted using the +TX ring. In summary, the RX and FILL rings are used for the RX path +and the TX and COMPLETION rings are used for the TX path. The socket is then finally bound with a bind() call to a device and a specific queue id on that device, and it is not until bind is @@ -59,13 +59,13 @@ wants to do this, it simply skips the registration of the UMEM and its corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind call and submits the XSK of the process it would like to share UMEM with as well as its own newly created XSK socket. The new process will -then receive frame id references in its own RX ring that point to this -shared UMEM. Note that since the ring structures are single-consumer / -single-producer (for performance reasons), the new process has to -create its own socket with associated RX and TX rings, since it cannot -share this with the other process. This is also the reason that there -is only one set of FILL and COMPLETION rings per UMEM. It is the -responsibility of a single process to handle the UMEM. +then receive frame addr references in its own RX ring that point to +this shared UMEM. Note that since the ring structures are +single-consumer / single-producer (for performance reasons), the new +process has to create its own socket with associated RX and TX rings, +since it cannot share this with the other process. This is also the +reason that there is only one set of FILL and COMPLETION rings per +UMEM. It is the responsibility of a single process to handle the UMEM. How is then packets distributed from an XDP program to the XSKs? There is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The @@ -102,10 +102,10 @@ UMEM UMEM is a region of virtual contiguous memory, divided into equal-sized frames. An UMEM is associated to a netdev and a specific -queue id of that netdev. It is created and configured (frame size, -frame headroom, start address and size) by using the XDP_UMEM_REG -setsockopt system call. A UMEM is bound to a netdev and queue id, via -the bind() system call. +queue id of that netdev. It is created and configured (chunk size, +headroom, start address and size) by using the XDP_UMEM_REG setsockopt +system call. A UMEM is bound to a netdev and queue id, via the bind() +system call. An AF_XDP is socket linked to a single UMEM, but one UMEM can have multiple AF_XDP sockets. To share an UMEM created via one socket A, @@ -147,13 +147,17 @@ UMEM Fill Ring ~~~~~~~~~~~~~~ The Fill ring is used to transfer ownership of UMEM frames from -user-space to kernel-space. The UMEM indicies are passed in the -ring. As an example, if the UMEM is 64k and each frame is 4k, then the -UMEM has 16 frames and can pass indicies between 0 and 15. +user-space to kernel-space. The UMEM addrs are passed in the ring. As +an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has +16 chunks and can pass addrs between 0 and 64k. Frames passed to the kernel are used for the ingress path (RX rings). -The user application produces UMEM indicies to this ring. +The user application produces UMEM addrs to this ring. Note that the +kernel will mask the incoming addr. E.g. for a chunk size of 2k, the +log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 +and 3000 refers to the same chunk. + UMEM Completetion Ring ~~~~~~~~~~~~~~~~~~~~~~ @@ -165,16 +169,15 @@ used. Frames passed from the kernel to user-space are frames that has been sent (TX ring) and can be used by user-space again. -The user application consumes UMEM indicies from this ring. +The user application consumes UMEM addrs from this ring. RX Ring ~~~~~~~ The RX ring is the receiving side of a socket. Each entry in the ring -is a struct xdp_desc descriptor. The descriptor contains UMEM index -(idx), the length of the data (len), the offset into the frame -(offset). +is a struct xdp_desc descriptor. The descriptor contains UMEM offset +(addr) and the length of the data (len). If no frames have been passed to kernel via the Fill ring, no descriptors will (or can) appear on the RX ring. @@ -221,38 +224,50 @@ side is xdpsock_user.c and the XDP side xdpsock_kern.c. Naive ring dequeue and enqueue could look like this:: + // struct xdp_rxtx_ring { + // __u32 *producer; + // __u32 *consumer; + // struct xdp_desc *desc; + // }; + + // struct xdp_umem_ring { + // __u32 *producer; + // __u32 *consumer; + // __u64 *desc; + // }; + // typedef struct xdp_rxtx_ring RING; // typedef struct xdp_umem_ring RING; // typedef struct xdp_desc RING_TYPE; - // typedef __u32 RING_TYPE; + // typedef __u64 RING_TYPE; int dequeue_one(RING *ring, RING_TYPE *item) { - __u32 entries = ring->ptrs.producer - ring->ptrs.consumer; + __u32 entries = *ring->producer - *ring->consumer; if (entries == 0) return -1; // read-barrier! - *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)]; - ring->ptrs.consumer++; + *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; + (*ring->consumer)++; return 0; } int enqueue_one(RING *ring, const RING_TYPE *item) { - u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer); + u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); if (free_entries == 0) return -1; - ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item; + ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; // write-barrier! - ring->ptrs.producer++; + (*ring->producer)++; return 0; } diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h index 4737cfe222f5..e411d6f9ac65 100644 --- a/include/uapi/linux/if_xdp.h +++ b/include/uapi/linux/if_xdp.h @@ -48,8 +48,8 @@ struct xdp_mmap_offsets { struct xdp_umem_reg { __u64 addr; /* Start of packet data area */ __u64 len; /* Length of packet data area */ - __u32 frame_size; /* Frame size */ - __u32 frame_headroom; /* Frame head room */ + __u32 chunk_size; + __u32 headroom; }; struct xdp_statistics { @@ -66,13 +66,11 @@ struct xdp_statistics { /* Rx/Tx descriptor */ struct xdp_desc { - __u32 idx; + __u64 addr; __u32 len; - __u16 offset; - __u8 flags; - __u8 padding[5]; + __u32 options; }; -/* UMEM descriptor is __u32 */ +/* UMEM descriptor is __u64 */ #endif /* _LINUX_IF_XDP_H */ diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index 87998818116f..9ad791ff4739 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -14,7 +14,7 @@ #include "xdp_umem.h" -#define XDP_UMEM_MIN_FRAME_SIZE 2048 +#define XDP_UMEM_MIN_CHUNK_SIZE 2048 static void xdp_umem_unpin_pages(struct xdp_umem *umem) { @@ -151,12 +151,12 @@ static int xdp_umem_account_pages(struct xdp_umem *umem) static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) { - u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom; + u32 chunk_size = mr->chunk_size, headroom = mr->headroom; + unsigned int chunks, chunks_per_page; u64 addr = mr->addr, size = mr->len; - unsigned int nframes, nfpp; int size_chk, err; - if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) { + if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) { /* Strictly speaking we could support this, if: * - huge pages, or* * - using an IOMMU, or @@ -166,7 +166,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) return -EINVAL; } - if (!is_power_of_2(frame_size)) + if (!is_power_of_2(chunk_size)) return -EINVAL; if (!PAGE_ALIGNED(addr)) { @@ -179,33 +179,30 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) if ((addr + size) < addr) return -EINVAL; - nframes = (unsigned int)div_u64(size, frame_size); - if (nframes == 0 || nframes > UINT_MAX) + chunks = (unsigned int)div_u64(size, chunk_size); + if (chunks == 0) return -EINVAL; - nfpp = PAGE_SIZE / frame_size; - if (nframes < nfpp || nframes % nfpp) + chunks_per_page = PAGE_SIZE / chunk_size; + if (chunks < chunks_per_page || chunks % chunks_per_page) return -EINVAL; - frame_headroom = ALIGN(frame_headroom, 64); + headroom = ALIGN(headroom, 64); - size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM; + size_chk = chunk_size - headroom - XDP_PACKET_HEADROOM; if (size_chk < 0) return -EINVAL; umem->pid = get_task_pid(current, PIDTYPE_PID); - umem->size = (size_t)size; umem->address = (unsigned long)addr; - umem->props.frame_size = frame_size; - umem->props.nframes = nframes; - umem->frame_headroom = frame_headroom; + umem->props.chunk_mask = ~((u64)chunk_size - 1); + umem->props.size = size; + umem->headroom = headroom; + umem->chunk_size_nohr = chunk_size - headroom; umem->npgs = size / PAGE_SIZE; umem->pgs = NULL; umem->user = NULL; - umem->frame_size_log2 = ilog2(frame_size); - umem->nfpp_mask = nfpp - 1; - umem->nfpplog2 = ilog2(nfpp); refcount_set(&umem->users, 1); err = xdp_umem_account_pages(umem); diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h index 0881cf456230..aeadd1bcb72d 100644 --- a/net/xdp/xdp_umem.h +++ b/net/xdp/xdp_umem.h @@ -18,35 +18,20 @@ struct xdp_umem { struct xsk_queue *cq; struct page **pgs; struct xdp_umem_props props; - u32 npgs; - u32 frame_headroom; - u32 nfpp_mask; - u32 nfpplog2; - u32 frame_size_log2; + u32 headroom; + u32 chunk_size_nohr; struct user_struct *user; struct pid *pid; unsigned long address; - size_t size; refcount_t users; struct work_struct work; + u32 npgs; }; -static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx) -{ - u64 pg, off; - char *data; - - pg = idx >> umem->nfpplog2; - off = (idx & umem->nfpp_mask) << umem->frame_size_log2; - - data = page_address(umem->pgs[pg]); - return data + off; -} - -static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem, - u32 idx) +static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr) { - return xdp_umem_get_data(umem, idx) + umem->frame_headroom; + return page_address(umem->pgs[addr >> PAGE_SHIFT]) + + (addr & (PAGE_SIZE - 1)); } bool xdp_umem_validate_queues(struct xdp_umem *umem); diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h index 2cf8ec485fd2..40eab10dfc49 100644 --- a/net/xdp/xdp_umem_props.h +++ b/net/xdp/xdp_umem_props.h @@ -7,8 +7,8 @@ #define XDP_UMEM_PROPS_H_ struct xdp_umem_props { - u32 frame_size; - u32 nframes; + u64 chunk_mask; + u64 size; }; #endif /* XDP_UMEM_PROPS_H_ */ diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 966307ce4b8e..4688c750df1d 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -41,24 +41,27 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) { - u32 id, len = xdp->data_end - xdp->data; + u32 len = xdp->data_end - xdp->data; void *buffer; + u64 addr; int err; if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) return -EINVAL; - if (!xskq_peek_id(xs->umem->fq, &id)) { + if (!xskq_peek_addr(xs->umem->fq, &addr) || + len > xs->umem->chunk_size_nohr) { xs->rx_dropped++; return -ENOSPC; } - buffer = xdp_umem_get_data_with_headroom(xs->umem, id); + addr += xs->umem->headroom; + + buffer = xdp_umem_get_data(xs->umem, addr); memcpy(buffer, xdp->data, len); - err = xskq_produce_batch_desc(xs->rx, id, len, - xs->umem->frame_headroom); + err = xskq_produce_batch_desc(xs->rx, addr, len); if (!err) - xskq_discard_id(xs->umem->fq); + xskq_discard_addr(xs->umem->fq); else xs->rx_dropped++; @@ -95,10 +98,10 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) static void xsk_destruct_skb(struct sk_buff *skb) { - u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg; + u64 addr = (u64)(long)skb_shinfo(skb)->destructor_arg; struct xdp_sock *xs = xdp_sk(skb->sk); - WARN_ON_ONCE(xskq_produce_id(xs->umem->cq, id)); + WARN_ON_ONCE(xskq_produce_addr(xs->umem->cq, addr)); sock_wfree(skb); } @@ -123,14 +126,15 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m, while (xskq_peek_desc(xs->tx, &desc)) { char *buffer; - u32 id, len; + u64 addr; + u32 len; if (max_batch-- == 0) { err = -EAGAIN; goto out; } - if (xskq_reserve_id(xs->umem->cq)) { + if (xskq_reserve_addr(xs->umem->cq)) { err = -EAGAIN; goto out; } @@ -153,8 +157,8 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m, } skb_put(skb, len); - id = desc.idx; - buffer = xdp_umem_get_data(xs->umem, id) + desc.offset; + addr = desc.addr; + buffer = xdp_umem_get_data(xs->umem, addr); err = skb_store_bits(skb, 0, buffer, len); if (unlikely(err)) { kfree_skb(skb); @@ -164,7 +168,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m, skb->dev = xs->dev; skb->priority = sk->sk_priority; skb->mark = sk->sk_mark; - skb_shinfo(skb)->destructor_arg = (void *)(long)id; + skb_shinfo(skb)->destructor_arg = (void *)(long)addr; skb->destructor = xsk_destruct_skb; err = dev_direct_xmit(skb, xs->queue_id); diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c index ebe85e59507e..6c32e92e98fc 100644 --- a/net/xdp/xsk_queue.c +++ b/net/xdp/xsk_queue.c @@ -17,7 +17,7 @@ void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props) static u32 xskq_umem_get_ring_size(struct xsk_queue *q) { - return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32); + return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u64); } static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q) diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index b5924e7aeb2b..337e5ad3b10e 100644 --- a/net/xdp/xsk_queue.h +++ b/net/xdp/xsk_queue.h @@ -27,7 +27,7 @@ struct xdp_rxtx_ring { /* Used for the fill and completion queues for buffers */ struct xdp_umem_ring { struct xdp_ring ptrs; - u32 desc[0] ____cacheline_aligned_in_smp; + u64 desc[0] ____cacheline_aligned_in_smp; }; struct xsk_queue { @@ -76,24 +76,25 @@ static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt) /* UMEM queue */ -static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx) +static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr) { - if (unlikely(idx >= q->umem_props.nframes)) { + if (addr >= q->umem_props.size) { q->invalid_descs++; return false; } + return true; } -static inline u32 *xskq_validate_id(struct xsk_queue *q, u32 *id) +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr) { while (q->cons_tail != q->cons_head) { struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring; unsigned int idx = q->cons_tail & q->ring_mask; - *id = READ_ONCE(ring->desc[idx]); - if (xskq_is_valid_id(q, *id)) - return id; + *addr = READ_ONCE(ring->desc[idx]) & q->umem_props.chunk_mask; + if (xskq_is_valid_addr(q, *addr)) + return addr; q->cons_tail++; } @@ -101,7 +102,7 @@ static inline u32 *xskq_validate_id(struct xsk_queue *q, u32 *id) return NULL; } -static inline u32 *xskq_peek_id(struct xsk_queue *q, u32 *id) +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr) { if (q->cons_tail == q->cons_head) { WRITE_ONCE(q->ring->consumer, q->cons_tail); @@ -111,19 +112,19 @@ static inline u32 *xskq_peek_id(struct xsk_queue *q, u32 *id) smp_rmb(); } - return xskq_validate_id(q, id); + return xskq_validate_addr(q, addr); } -static inline void xskq_discard_id(struct xsk_queue *q) +static inline void xskq_discard_addr(struct xsk_queue *q) { q->cons_tail++; } -static inline int xskq_produce_id(struct xsk_queue *q, u32 id) +static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr) { struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring; - ring->desc[q->prod_tail++ & q->ring_mask] = id; + ring->desc[q->prod_tail++ & q->ring_mask] = addr; /* Order producer and data */ smp_wmb(); @@ -132,7 +133,7 @@ static inline int xskq_produce_id(struct xsk_queue *q, u32 id) return 0; } -static inline int xskq_reserve_id(struct xsk_queue *q) +static inline int xskq_reserve_addr(struct xsk_queue *q) { if (xskq_nb_free(q, q->prod_head, 1) == 0) return -ENOSPC; @@ -145,16 +146,11 @@ static inline int xskq_reserve_id(struct xsk_queue *q) static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d) { - u32 buff_len; - - if (unlikely(d->idx >= q->umem_props.nframes)) { - q->invalid_descs++; + if (!xskq_is_valid_addr(q, d->addr)) return false; - } - buff_len = q->umem_props.frame_size; - if (unlikely(d->len > buff_len || d->len == 0 || - d->offset > buff_len || d->offset + d->len > buff_len)) { + if (((d->addr + d->len) & q->umem_props.chunk_mask) != + (d->addr & q->umem_props.chunk_mask)) { q->invalid_descs++; return false; } @@ -199,7 +195,7 @@ static inline void xskq_discard_desc(struct xsk_queue *q) } static inline int xskq_produce_batch_desc(struct xsk_queue *q, - u32 id, u32 len, u16 offset) + u64 addr, u32 len) { struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring; unsigned int idx; @@ -208,9 +204,8 @@ static inline int xskq_produce_batch_desc(struct xsk_queue *q, return -ENOSPC; idx = (q->prod_head++) & q->ring_mask; - ring->desc[idx].idx = id; + ring->desc[idx].addr = addr; ring->desc[idx].len = len; - ring->desc[idx].offset = offset; return 0; } -- cgit v1.2.3 From 85d63445f41125dafeddda74e5b13b7eefac9407 Mon Sep 17 00:00:00 2001 From: Jeff Kirsher Date: Thu, 10 May 2018 12:20:13 -0700 Subject: Documentation: e100: Update the Intel 10/100 driver doc Over the years, several of the links have changed or are no longer valid so update them. In addition, the default values were incorrect for a couple of parameters. Converted the text file to the reStructuredText (RST) format, since the Linux kernel documentation now uses this format for documentation. Signed-off-by: Jeff Kirsher Tested-by: Aaron Brown --- Documentation/networking/e100.rst | 177 +++++++++++++++++++++++++++++++++++ Documentation/networking/e100.txt | 183 ------------------------------------- Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 179 insertions(+), 184 deletions(-) create mode 100644 Documentation/networking/e100.rst delete mode 100644 Documentation/networking/e100.txt (limited to 'Documentation') diff --git a/Documentation/networking/e100.rst b/Documentation/networking/e100.rst new file mode 100644 index 000000000000..d4d837027925 --- /dev/null +++ b/Documentation/networking/e100.rst @@ -0,0 +1,177 @@ +Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters +============================================================== + +June 1, 2018 + +Contents +======== + +- In This Release +- Identifying Your Adapter +- Building and Installation +- Driver Configuration Parameters +- Additional Configurations +- Known Issues +- Support + + +In This Release +=============== + +This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of +Adapters. This driver includes support for Itanium(R)2-based systems. + +For questions related to hardware requirements, refer to the documentation +supplied with your Intel PRO/100 adapter. + +The following features are now available in supported kernels: + - Native VLANs + - Channel Bonding (teaming) + - SNMP + +Channel Bonding documentation can be found in the Linux kernel source: +/Documentation/networking/bonding.txt + + +Identifying Your Adapter +======================== + +For information on how to identify your adapter, and for the latest Intel +network drivers, refer to the Intel Support website: +http://www.intel.com/support + +Driver Configuration Parameters +=============================== + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +Rx Descriptors: Number of receive descriptors. A receive descriptor is a data + structure that describes a receive buffer and its attributes to the network + controller. The data in the descriptor is used by the controller to write + data from the controller to host memory. In the 3.x.x driver the valid range + for this parameter is 64-256. The default value is 256. This parameter can be + changed using the command:: + + ethtool -G eth? rx n + + Where n is the number of desired Rx descriptors. + +Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data + structure that describes a transmit buffer and its attributes to the network + controller. The data in the descriptor is used by the controller to read + data from the host memory to the controller. In the 3.x.x driver the valid + range for this parameter is 64-256. The default value is 128. This parameter + can be changed using the command:: + + ethtool -G eth? tx n + + Where n is the number of desired Tx descriptors. + +Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by + default. The ethtool utility can be used as follows to force speed/duplex.:: + + ethtool -s eth? autoneg off speed {10|100} duplex {full|half} + + NOTE: setting the speed/duplex to incorrect values will cause the link to + fail. + +Event Log Message Level: The driver uses the message level flag to log events + to syslog. The message level can be set at driver load time. It can also be + set using the command:: + + ethtool -s eth? msglvl n + + +Additional Configurations +========================= + + Configuring the Driver on Different Distributions + ------------------------------------------------- + + Configuring a network driver to load properly when the system is started is + distribution dependent. Typically, the configuration process involves adding + an alias line to /etc/modprobe.d/*.conf as well as editing other system + startup scripts and/or configuration files. Many popular Linux + distributions ship with tools to make these changes for you. To learn the + proper way to configure a network device for your system, refer to your + distribution documentation. If during this process you are asked for the + driver or module name, the name for the Linux Base Driver for the Intel + PRO/100 Family of Adapters is e100. + + As an example, if you install the e100 driver for two PRO/100 adapters + (eth0 and eth1), add the following to a configuration file in /etc/modprobe.d/ + + alias eth0 e100 + alias eth1 e100 + + Viewing Link Messages + --------------------- + In order to see link messages and other Intel driver information on your + console, you must set the dmesg level up to six. This can be done by + entering the following on the command line before loading the e100 driver:: + + dmesg -n 6 + + If you wish to see all messages issued by the driver, including debug + messages, set the dmesg level to eight. + + NOTE: This setting is not saved across reboots. + + + ethtool + ------- + + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 1.6 or later is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool/ + + Enabling Wake on LAN* (WoL) + --------------------------- + WoL is provided through the ethtool* utility. For instructions on enabling + WoL with ethtool, refer to the ethtool man page. + + WoL will be enabled on the system during the next shut down or reboot. For + this driver version, in order to enable WoL, the e100 driver must be + loaded when shutting down or rebooting the system. + + NAPI + ---- + + NAPI (Rx polling mode) is supported in the e100 driver. + + See https://wiki.linuxfoundation.org/networking/napi for more information + on NAPI. + + Multiple Interfaces on Same Ethernet Broadcast Network + ------------------------------------------------------ + + Due to the default ARP behavior on Linux, it is not possible to have + one system on two IP networks in the same Ethernet broadcast domain + (non-partitioned switch) behave as expected. All Ethernet interfaces + will respond to IP traffic for any IP address assigned to the system. + This results in unbalanced receive traffic. + + If you have multiple interfaces in a server, either turn on ARP + filtering by + + (1) entering:: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter + (this only works if your kernel's version is higher than 2.4.5), or + + (2) installing the interfaces in separate broadcast domains (either + in different switches or in a switch partitioned to VLANs). + + +Support +======= +For general information, go to the Intel support website at: +http://www.intel.com/support/ + +or the Intel Wired Networking project hosted by Sourceforge at: +http://sourceforge.net/projects/e1000 +If an issue is identified with the released source code on a supported kernel +with a supported adapter, email the specific information related to the issue +to e1000-devel@lists.sf.net. diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.txt deleted file mode 100644 index 54810b82c01a..000000000000 --- a/Documentation/networking/e100.txt +++ /dev/null @@ -1,183 +0,0 @@ -Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters -============================================================== - -March 15, 2011 - -Contents -======== - -- In This Release -- Identifying Your Adapter -- Building and Installation -- Driver Configuration Parameters -- Additional Configurations -- Known Issues -- Support - - -In This Release -=============== - -This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of -Adapters. This driver includes support for Itanium(R)2-based systems. - -For questions related to hardware requirements, refer to the documentation -supplied with your Intel PRO/100 adapter. - -The following features are now available in supported kernels: - - Native VLANs - - Channel Bonding (teaming) - - SNMP - -Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.txt - - -Identifying Your Adapter -======================== - -For more information on how to identify your adapter, go to the Adapter & -Driver ID Guide at: - - http://support.intel.com/support/network/adapter/pro100/21397.htm - -For the latest Intel network drivers for Linux, refer to the following -website. In the search field, enter your adapter name or type, or use the -networking link on the left to search for your adapter: - - http://downloadfinder.intel.com/scripts-df/support_intel.asp - -Driver Configuration Parameters -=============================== - -The default value for each parameter is generally the recommended setting, -unless otherwise noted. - -Rx Descriptors: Number of receive descriptors. A receive descriptor is a data - structure that describes a receive buffer and its attributes to the network - controller. The data in the descriptor is used by the controller to write - data from the controller to host memory. In the 3.x.x driver the valid range - for this parameter is 64-256. The default value is 64. This parameter can be - changed using the command: - - ethtool -G eth? rx n, where n is the number of desired rx descriptors. - -Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data - structure that describes a transmit buffer and its attributes to the network - controller. The data in the descriptor is used by the controller to read - data from the host memory to the controller. In the 3.x.x driver the valid - range for this parameter is 64-256. The default value is 64. This parameter - can be changed using the command: - - ethtool -G eth? tx n, where n is the number of desired tx descriptors. - -Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by - default. The ethtool utility can be used as follows to force speed/duplex. - - ethtool -s eth? autoneg off speed {10|100} duplex {full|half} - - NOTE: setting the speed/duplex to incorrect values will cause the link to - fail. - -Event Log Message Level: The driver uses the message level flag to log events - to syslog. The message level can be set at driver load time. It can also be - set using the command: - - ethtool -s eth? msglvl n - - -Additional Configurations -========================= - - Configuring the Driver on Different Distributions - ------------------------------------------------- - - Configuring a network driver to load properly when the system is started is - distribution dependent. Typically, the configuration process involves adding - an alias line to /etc/modprobe.d/*.conf as well as editing other system - startup scripts and/or configuration files. Many popular Linux - distributions ship with tools to make these changes for you. To learn the - proper way to configure a network device for your system, refer to your - distribution documentation. If during this process you are asked for the - driver or module name, the name for the Linux Base Driver for the Intel - PRO/100 Family of Adapters is e100. - - As an example, if you install the e100 driver for two PRO/100 adapters - (eth0 and eth1), add the following to a configuration file in /etc/modprobe.d/ - - alias eth0 e100 - alias eth1 e100 - - Viewing Link Messages - --------------------- - In order to see link messages and other Intel driver information on your - console, you must set the dmesg level up to six. This can be done by - entering the following on the command line before loading the e100 driver: - - dmesg -n 8 - - If you wish to see all messages issued by the driver, including debug - messages, set the dmesg level to eight. - - NOTE: This setting is not saved across reboots. - - - ethtool - ------- - - The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. The ethtool - version 1.6 or later is required for this functionality. - - The latest release of ethtool can be found from - https://www.kernel.org/pub/software/network/ethtool/ - - Enabling Wake on LAN* (WoL) - --------------------------- - WoL is provided through the ethtool* utility. For instructions on enabling - WoL with ethtool, refer to the ethtool man page. - - WoL will be enabled on the system during the next shut down or reboot. For - this driver version, in order to enable WoL, the e100 driver must be - loaded when shutting down or rebooting the system. - - NAPI - ---- - - NAPI (Rx polling mode) is supported in the e100 driver. - - See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. - - Multiple Interfaces on Same Ethernet Broadcast Network - ------------------------------------------------------ - - Due to the default ARP behavior on Linux, it is not possible to have - one system on two IP networks in the same Ethernet broadcast domain - (non-partitioned switch) behave as expected. All Ethernet interfaces - will respond to IP traffic for any IP address assigned to the system. - This results in unbalanced receive traffic. - - If you have multiple interfaces in a server, either turn on ARP - filtering by - - (1) entering: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter - (this only works if your kernel's version is higher than 2.4.5), or - - (2) installing the interfaces in separate broadcast domains (either - in different switches or in a switch partitioned to VLANs). - - -Support -======= - -For general information, go to the Intel support website at: - - http://support.intel.com - - or the Intel Wired Networking project hosted by Sourceforge at: - - http://sourceforge.net/projects/e1000 - -If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related to the -issue to e1000-devel@lists.sourceforge.net. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index cbd9bdd4a79e..d11a62977edd 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -10,6 +10,7 @@ Contents: batman-adv can dpaa2/index + e100 kapi z8530book msg_zerocopy diff --git a/MAINTAINERS b/MAINTAINERS index 0ae0dbf0e15e..d68981ca9896 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7089,7 +7089,7 @@ Q: http://patchwork.ozlabs.org/project/intel-wired-lan/list/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git S: Supported -F: Documentation/networking/e100.txt +F: Documentation/networking/e100.rst F: Documentation/networking/e1000.txt F: Documentation/networking/e1000e.txt F: Documentation/networking/igb.txt -- cgit v1.2.3 From 228046e76189ce542f15321b760e380551468017 Mon Sep 17 00:00:00 2001 From: Jeff Kirsher Date: Thu, 10 May 2018 12:55:38 -0700 Subject: Documentation: e1000: Update kernel documentation Updated the e1000.txt kernel documentation with the latest information. Also convert the text file to reStructuredText (RST) format, since the Linux kernel documentation now uses this format for documentation. Signed-off-by: Jeff Kirsher Tested-by: Aaron Brown --- Documentation/networking/e1000.rst | 422 +++++++++++++++++++++++++++++++++ Documentation/networking/e1000.txt | 461 ------------------------------------- Documentation/networking/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 424 insertions(+), 462 deletions(-) create mode 100644 Documentation/networking/e1000.rst delete mode 100644 Documentation/networking/e1000.txt (limited to 'Documentation') diff --git a/Documentation/networking/e1000.rst b/Documentation/networking/e1000.rst new file mode 100644 index 000000000000..616848940e63 --- /dev/null +++ b/Documentation/networking/e1000.rst @@ -0,0 +1,422 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Command Line Parameters +- Speed and Duplex Configuration +- Additional Configurations +- Support + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the +networking link on the left to search for your adapter: + + http://support.intel.com/support/go/network/adapter/home.htm + +Command Line Parameters +======================= + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +NOTES: For more information about the AutoNeg, Duplex, and Speed + parameters, see the "Speed and Duplex Configuration" section in + this document. + + For more information about the InterruptThrottleRate, + RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay + parameters, see the application note at: + http://www.intel.com/design/network/applnots/ap450.htm + +AutoNeg +------- +(Supported only on adapters with copper connections) +Valid Range: 0x01-0x0F, 0x20-0x2F +Default Value: 0x2F + +This parameter is a bit-mask that specifies the speed and duplex settings +advertised by the adapter. When this parameter is used, the Speed and +Duplex parameters must not be specified. + +NOTE: Refer to the Speed and Duplex section of this readme for more + information on the AutoNeg parameter. + +Duplex +------ +(Supported only on adapters with copper connections) +Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) +Default Value: 0 + +This defines the direction in which data is allowed to flow. Can be +either one or two-directional. If both Duplex and the link partner are +set to auto-negotiate, the board auto-detects the correct duplex. If the +link partner is forced (either full or half), Duplex defaults to half- +duplex. + +FlowControl +----------- +Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) +Default Value: Reads flow control settings from the EEPROM + +This parameter controls the automatic generation(Tx) and response(Rx) +to Ethernet PAUSE frames. + +InterruptThrottleRate +--------------------- +(not supported on Intel(R) 82542, 82543 or 82544-based adapters) +Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, + 4=simplified balancing) +Default Value: 3 + +The driver can limit the amount of interrupts per second that the adapter +will generate for incoming packets. It does this by writing a value to the +adapter that is based on the maximum amount of interrupts that the adapter +will generate per second. + +Setting InterruptThrottleRate to a value greater or equal to 100 +will program the adapter to send out a maximum of that many interrupts +per second, even if more packets have come in. This reduces interrupt +load on the system and can lower CPU utilization under heavy load, +but will increase latency as packets are not processed as quickly. + +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types,but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + +Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which +it dynamically adjusts the InterruptThrottleRate value based on the traffic +that it receives. After determining the type of incoming traffic in the last +timeframe, it will adjust the InterruptThrottleRate to an appropriate value +for that traffic. + +The algorithm classifies the incoming traffic every interval into +classes. Once the class is determined, the InterruptThrottleRate value is +adjusted to suit that traffic type the best. There are three classes defined: +"Bulk traffic", for large amounts of packets of normal size; "Low latency", +for small amounts of traffic and/or a significant percentage of small +packets; and "Lowest latency", for almost completely small packets or +minimal traffic. + +In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 +for traffic that falls in class "Bulk traffic". If traffic falls in the "Low +latency" or "Lowest latency" class, the InterruptThrottleRate is increased +stepwise to 20000. This default mode is suitable for most applications. + +For situations where low latency is vital such as cluster or +grid computing, the algorithm can reduce latency even more when +InterruptThrottleRate is set to mode 1. In this mode, which operates +the same as mode 3, the InterruptThrottleRate will be increased stepwise to +70000 for traffic in class "Lowest latency". + +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the +interrupt rate will drop as low as 2000 interrupts per second. If the +traffic is mostly transmit or mostly receive, the interrupt rate could +be as high as 8000. + +Setting InterruptThrottleRate to 0 turns off any interrupt moderation +and may improve small packet latency, but is generally not suitable +for bulk throughput traffic. + +NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and + RxAbsIntDelay parameters. In other words, minimizing the receive + and/or transmit absolute delays does not force the controller to + generate more interrupts than what the Interrupt Throttle Rate + allows. + +CAUTION: If you are using the Intel(R) PRO/1000 CT Network Connection + (controller 82547), setting InterruptThrottleRate to a value + greater than 75,000, may hang (stop transmitting) adapters + under certain network conditions. If this occurs a NETDEV + WATCHDOG message is logged in the system event log. In + addition, the controller is automatically reset, restoring + the network connection. To eliminate the potential for the + hang, ensure that InterruptThrottleRate is set no greater + than 75,000 and is not set to 0. + +NOTE: When e1000 is loaded with default settings and multiple adapters + are in use simultaneously, the CPU utilization may increase non- + linearly. In order to limit the CPU utilization without impacting + the overall throughput, we recommend that you load the driver as + follows:: + + modprobe e1000 InterruptThrottleRate=3000,3000,3000 + + This sets the InterruptThrottleRate to 3000 interrupts/sec for + the first, second, and third instances of the driver. The range + of 2000 to 3000 interrupts per second works on a majority of + systems and is a good starting point, but the optimal value will + be platform-specific. If CPU utilization is not a concern, use + RX_POLLING (NAPI) and default driver settings. + +RxDescriptors +------------- +Valid Range: 48-256 for 82542 and 82543-based adapters + 48-4096 for all other supported adapters +Default Value: 256 + +This value specifies the number of receive buffer descriptors allocated +by the driver. Increasing this value allows the driver to buffer more +incoming packets, at the expense of increased system memory utilization. + +Each descriptor is 16 bytes. A receive buffer is also allocated for each +descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending +on the MTU setting. The maximum MTU size is 16110. + +NOTE: MTU designates the frame size. It only needs to be set for Jumbo + Frames. Depending on the available system resources, the request + for a higher number of receive descriptors may be denied. In this + case, use a lower number. + +RxIntDelay +---------- +Valid Range: 0-65535 (0=off) +Default Value: 0 + +This value delays the generation of receive interrupts in units of 1.024 +microseconds. Receive interrupt reduction can improve CPU efficiency if +properly tuned for specific network traffic. Increasing this value adds +extra latency to frame reception and can end up decreasing the throughput +of TCP traffic. If the system is reporting dropped receives, this value +may be set too high, causing the driver to run out of available receive +descriptors. + +CAUTION: When setting RxIntDelay to a value other than 0, adapters may + hang (stop transmitting) under certain network conditions. If + this occurs a NETDEV WATCHDOG message is logged in the system + event log. In addition, the controller is automatically reset, + restoring the network connection. To eliminate the potential + for the hang ensure that RxIntDelay is set to 0. + +RxAbsIntDelay +------------- +(This parameter is supported only on 82540, 82545 and later adapters.) +Valid Range: 0-65535 (0=off) +Default Value: 128 + +This value, in units of 1.024 microseconds, limits the delay in which a +receive interrupt is generated. Useful only if RxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is received within the set amount of time. Proper tuning, +along with RxIntDelay, may improve traffic throughput in specific network +conditions. + +Speed +----- +(This parameter is supported only on adapters with copper connections.) +Valid Settings: 0, 10, 100, 1000 +Default Value: 0 (auto-negotiate at all supported speeds) + +Speed forces the line speed to the specified value in megabits per second +(Mbps). If this parameter is not specified or is set to 0 and the link +partner is set to auto-negotiate, the board will auto-detect the correct +speed. Duplex should also be set when Speed is set to either 10 or 100. + +TxDescriptors +------------- +Valid Range: 48-256 for 82542 and 82543-based adapters + 48-4096 for all other supported adapters +Default Value: 256 + +This value is the number of transmit descriptors allocated by the driver. +Increasing this value allows the driver to queue more transmits. Each +descriptor is 16 bytes. + +NOTE: Depending on the available system resources, the request for a + higher number of transmit descriptors may be denied. In this case, + use a lower number. + +TxIntDelay +---------- +Valid Range: 0-65535 (0=off) +Default Value: 8 + +This value delays the generation of transmit interrupts in units of +1.024 microseconds. Transmit interrupt reduction can improve CPU +efficiency if properly tuned for specific network traffic. If the +system is reporting dropped transmits, this value may be set too high +causing the driver to run out of available transmit descriptors. + +TxAbsIntDelay +------------- +(This parameter is supported only on 82540, 82545 and later adapters.) +Valid Range: 0-65535 (0=off) +Default Value: 32 + +This value, in units of 1.024 microseconds, limits the delay in which a +transmit interrupt is generated. Useful only if TxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is sent on the wire within the set amount of time. Proper tuning, +along with TxIntDelay, may improve traffic throughput in specific +network conditions. + +XsumRX +------ +(This parameter is NOT supported on the 82542-based adapter.) +Valid Range: 0-1 +Default Value: 1 + +A value of '1' indicates that the driver should enable IP checksum +offload for received packets (both UDP and TCP) to the adapter hardware. + +Copybreak +--------- +Valid Range: 0-xxxxxxx (0=off) +Default Value: 256 +Usage: modprobe e1000.ko copybreak=128 + +Driver copies all packets below or equaling this size to a fresh RX +buffer before handing it up the stack. + +This parameter is different than other parameters, in that it is a +single (not 1,1,1 etc.) parameter applied to all driver instances and +it is also available during runtime at +/sys/module/e1000/parameters/copybreak + +SmartPowerDownEnable +-------------------- +Valid Range: 0-1 +Default Value: 0 (disabled) + +Allows PHY to turn off in lower power states. The user can turn off +this parameter in supported chipsets. + +Speed and Duplex Configuration +============================== + +Three keywords are used to control the speed and duplex configuration. +These keywords are Speed, Duplex, and AutoNeg. + +If the board uses a fiber interface, these keywords are ignored, and the +fiber interface board only links at 1000 Mbps full-duplex. + +For copper-based boards, the keywords interact as follows: + + The default operation is auto-negotiate. The board advertises all + supported speed and duplex combinations, and it links at the highest + common speed and duplex mode IF the link partner is set to auto-negotiate. + + If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps + is advertised (The 1000BaseT spec requires auto-negotiation.) + + If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- + negotiation is disabled, and the AutoNeg parameter is ignored. Partner + SHOULD also be forced. + +The AutoNeg parameter is used when more control is required over the +auto-negotiation process. It should be used when you wish to control which +speed and duplex combinations are advertised during the auto-negotiation +process. + +The parameter may be specified as either a decimal or hexadecimal value as +determined by the bitmap below. + +Bit position 7 6 5 4 3 2 1 0 +Decimal Value 128 64 32 16 8 4 2 1 +Hex value 80 40 20 10 8 4 2 1 +Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10 +Duplex Full Full Half Full Half + +Some examples of using AutoNeg: + + modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half) + modprobe e1000 AutoNeg=1 (Same as above) + modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full) + modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full) + modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half) + modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100 + Half) + modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full) + modprobe e1000 AutoNeg=32 (Same as above) + +Note that when this parameter is used, Speed and Duplex must not be specified. + +If the link partner is forced to a specific speed and duplex, then this +parameter should not be used. Instead, use the Speed and Duplex parameters +previously mentioned to force the adapter to the same speed and duplex. + +Additional Configurations +========================= + + Jumbo Frames + ------------ + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example:: + + ifconfig eth mtu 9000 up + + This setting is not saved across reboots. It can be made permanent if + you add:: + + MTU=9000 + + to the file /etc/sysconfig/network-scripts/ifcfg-eth. This example + applies to the Red Hat distributions; other distributions may store this + setting in a different location. + + Notes: + Degradation in throughput performance may be observed in some Jumbo frames + environments. If this is observed, increasing the application's socket buffer + size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. + See the specific application manual and /usr/src/linux*/Documentation/ + networking/ip-sysctl.txt for more details. + + - The maximum MTU setting for Jumbo Frames is 16110. This value coincides + with the maximum Jumbo Frames size of 16128. + + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + - Adapters based on the Intel(R) 82542 and 82573V/E controller do not + support Jumbo Frames. These correspond to the following product names: + Intel(R) PRO/1000 Gigabit Server Adapter + Intel(R) PRO/1000 PM Network Connection + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 1.6 or later is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool/ + + Enabling Wake on LAN* (WoL) + --------------------------- + WoL is configured through the ethtool* utility. + + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the e1000 driver must be + loaded when shutting down or rebooting the system. + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.txt deleted file mode 100644 index 1f6ed848363d..000000000000 --- a/Documentation/networking/e1000.txt +++ /dev/null @@ -1,461 +0,0 @@ -Linux* Base Driver for Intel(R) Ethernet Network Connection -=========================================================== - -Intel Gigabit Linux driver. -Copyright(c) 1999 - 2013 Intel Corporation. - -Contents -======== - -- Identifying Your Adapter -- Command Line Parameters -- Speed and Duplex Configuration -- Additional Configurations -- Support - -Identifying Your Adapter -======================== - -For more information on how to identify your adapter, go to the Adapter & -Driver ID Guide at: - - http://support.intel.com/support/go/network/adapter/idguide.htm - -For the latest Intel network drivers for Linux, refer to the following -website. In the search field, enter your adapter name or type, or use the -networking link on the left to search for your adapter: - - http://support.intel.com/support/go/network/adapter/home.htm - -Command Line Parameters -======================= - -The default value for each parameter is generally the recommended setting, -unless otherwise noted. - -NOTES: For more information about the AutoNeg, Duplex, and Speed - parameters, see the "Speed and Duplex Configuration" section in - this document. - - For more information about the InterruptThrottleRate, - RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay - parameters, see the application note at: - http://www.intel.com/design/network/applnots/ap450.htm - -AutoNeg -------- -(Supported only on adapters with copper connections) -Valid Range: 0x01-0x0F, 0x20-0x2F -Default Value: 0x2F - -This parameter is a bit-mask that specifies the speed and duplex settings -advertised by the adapter. When this parameter is used, the Speed and -Duplex parameters must not be specified. - -NOTE: Refer to the Speed and Duplex section of this readme for more - information on the AutoNeg parameter. - -Duplex ------- -(Supported only on adapters with copper connections) -Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) -Default Value: 0 - -This defines the direction in which data is allowed to flow. Can be -either one or two-directional. If both Duplex and the link partner are -set to auto-negotiate, the board auto-detects the correct duplex. If the -link partner is forced (either full or half), Duplex defaults to half- -duplex. - -FlowControl ------------ -Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) -Default Value: Reads flow control settings from the EEPROM - -This parameter controls the automatic generation(Tx) and response(Rx) -to Ethernet PAUSE frames. - -InterruptThrottleRate ---------------------- -(not supported on Intel(R) 82542, 82543 or 82544-based adapters) -Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, - 4=simplified balancing) -Default Value: 3 - -The driver can limit the amount of interrupts per second that the adapter -will generate for incoming packets. It does this by writing a value to the -adapter that is based on the maximum amount of interrupts that the adapter -will generate per second. - -Setting InterruptThrottleRate to a value greater or equal to 100 -will program the adapter to send out a maximum of that many interrupts -per second, even if more packets have come in. This reduces interrupt -load on the system and can lower CPU utilization under heavy load, -but will increase latency as packets are not processed as quickly. - -The default behaviour of the driver previously assumed a static -InterruptThrottleRate value of 8000, providing a good fallback value for -all traffic types,but lacking in small packet performance and latency. -The hardware can handle many more small packets per second however, and -for this reason an adaptive interrupt moderation algorithm was implemented. - -Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which -it dynamically adjusts the InterruptThrottleRate value based on the traffic -that it receives. After determining the type of incoming traffic in the last -timeframe, it will adjust the InterruptThrottleRate to an appropriate value -for that traffic. - -The algorithm classifies the incoming traffic every interval into -classes. Once the class is determined, the InterruptThrottleRate value is -adjusted to suit that traffic type the best. There are three classes defined: -"Bulk traffic", for large amounts of packets of normal size; "Low latency", -for small amounts of traffic and/or a significant percentage of small -packets; and "Lowest latency", for almost completely small packets or -minimal traffic. - -In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 -for traffic that falls in class "Bulk traffic". If traffic falls in the "Low -latency" or "Lowest latency" class, the InterruptThrottleRate is increased -stepwise to 20000. This default mode is suitable for most applications. - -For situations where low latency is vital such as cluster or -grid computing, the algorithm can reduce latency even more when -InterruptThrottleRate is set to mode 1. In this mode, which operates -the same as mode 3, the InterruptThrottleRate will be increased stepwise to -70000 for traffic in class "Lowest latency". - -In simplified mode the interrupt rate is based on the ratio of TX and -RX traffic. If the bytes per second rate is approximately equal, the -interrupt rate will drop as low as 2000 interrupts per second. If the -traffic is mostly transmit or mostly receive, the interrupt rate could -be as high as 8000. - -Setting InterruptThrottleRate to 0 turns off any interrupt moderation -and may improve small packet latency, but is generally not suitable -for bulk throughput traffic. - -NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and - RxAbsIntDelay parameters. In other words, minimizing the receive - and/or transmit absolute delays does not force the controller to - generate more interrupts than what the Interrupt Throttle Rate - allows. - -CAUTION: If you are using the Intel(R) PRO/1000 CT Network Connection - (controller 82547), setting InterruptThrottleRate to a value - greater than 75,000, may hang (stop transmitting) adapters - under certain network conditions. If this occurs a NETDEV - WATCHDOG message is logged in the system event log. In - addition, the controller is automatically reset, restoring - the network connection. To eliminate the potential for the - hang, ensure that InterruptThrottleRate is set no greater - than 75,000 and is not set to 0. - -NOTE: When e1000 is loaded with default settings and multiple adapters - are in use simultaneously, the CPU utilization may increase non- - linearly. In order to limit the CPU utilization without impacting - the overall throughput, we recommend that you load the driver as - follows: - - modprobe e1000 InterruptThrottleRate=3000,3000,3000 - - This sets the InterruptThrottleRate to 3000 interrupts/sec for - the first, second, and third instances of the driver. The range - of 2000 to 3000 interrupts per second works on a majority of - systems and is a good starting point, but the optimal value will - be platform-specific. If CPU utilization is not a concern, use - RX_POLLING (NAPI) and default driver settings. - -RxDescriptors -------------- -Valid Range: 80-256 for 82542 and 82543-based adapters - 80-4096 for all other supported adapters -Default Value: 256 - -This value specifies the number of receive buffer descriptors allocated -by the driver. Increasing this value allows the driver to buffer more -incoming packets, at the expense of increased system memory utilization. - -Each descriptor is 16 bytes. A receive buffer is also allocated for each -descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending -on the MTU setting. The maximum MTU size is 16110. - -NOTE: MTU designates the frame size. It only needs to be set for Jumbo - Frames. Depending on the available system resources, the request - for a higher number of receive descriptors may be denied. In this - case, use a lower number. - -RxIntDelay ----------- -Valid Range: 0-65535 (0=off) -Default Value: 0 - -This value delays the generation of receive interrupts in units of 1.024 -microseconds. Receive interrupt reduction can improve CPU efficiency if -properly tuned for specific network traffic. Increasing this value adds -extra latency to frame reception and can end up decreasing the throughput -of TCP traffic. If the system is reporting dropped receives, this value -may be set too high, causing the driver to run out of available receive -descriptors. - -CAUTION: When setting RxIntDelay to a value other than 0, adapters may - hang (stop transmitting) under certain network conditions. If - this occurs a NETDEV WATCHDOG message is logged in the system - event log. In addition, the controller is automatically reset, - restoring the network connection. To eliminate the potential - for the hang ensure that RxIntDelay is set to 0. - -RxAbsIntDelay -------------- -(This parameter is supported only on 82540, 82545 and later adapters.) -Valid Range: 0-65535 (0=off) -Default Value: 128 - -This value, in units of 1.024 microseconds, limits the delay in which a -receive interrupt is generated. Useful only if RxIntDelay is non-zero, -this value ensures that an interrupt is generated after the initial -packet is received within the set amount of time. Proper tuning, -along with RxIntDelay, may improve traffic throughput in specific network -conditions. - -Speed ------ -(This parameter is supported only on adapters with copper connections.) -Valid Settings: 0, 10, 100, 1000 -Default Value: 0 (auto-negotiate at all supported speeds) - -Speed forces the line speed to the specified value in megabits per second -(Mbps). If this parameter is not specified or is set to 0 and the link -partner is set to auto-negotiate, the board will auto-detect the correct -speed. Duplex should also be set when Speed is set to either 10 or 100. - -TxDescriptors -------------- -Valid Range: 80-256 for 82542 and 82543-based adapters - 80-4096 for all other supported adapters -Default Value: 256 - -This value is the number of transmit descriptors allocated by the driver. -Increasing this value allows the driver to queue more transmits. Each -descriptor is 16 bytes. - -NOTE: Depending on the available system resources, the request for a - higher number of transmit descriptors may be denied. In this case, - use a lower number. - -TxDescriptorStep ----------------- -Valid Range: 1 (use every Tx Descriptor) - 4 (use every 4th Tx Descriptor) - -Default Value: 1 (use every Tx Descriptor) - -On certain non-Intel architectures, it has been observed that intense TX -traffic bursts of short packets may result in an improper descriptor -writeback. If this occurs, the driver will report a "TX Timeout" and reset -the adapter, after which the transmit flow will restart, though data may -have stalled for as much as 10 seconds before it resumes. - -The improper writeback does not occur on the first descriptor in a system -memory cache-line, which is typically 32 bytes, or 4 descriptors long. - -Setting TxDescriptorStep to a value of 4 will ensure that all TX descriptors -are aligned to the start of a system memory cache line, and so this problem -will not occur. - -NOTES: Setting TxDescriptorStep to 4 effectively reduces the number of - TxDescriptors available for transmits to 1/4 of the normal allocation. - This has a possible negative performance impact, which may be - compensated for by allocating more descriptors using the TxDescriptors - module parameter. - - There are other conditions which may result in "TX Timeout", which will - not be resolved by the use of the TxDescriptorStep parameter. As the - issue addressed by this parameter has never been observed on Intel - Architecture platforms, it should not be used on Intel platforms. - -TxIntDelay ----------- -Valid Range: 0-65535 (0=off) -Default Value: 64 - -This value delays the generation of transmit interrupts in units of -1.024 microseconds. Transmit interrupt reduction can improve CPU -efficiency if properly tuned for specific network traffic. If the -system is reporting dropped transmits, this value may be set too high -causing the driver to run out of available transmit descriptors. - -TxAbsIntDelay -------------- -(This parameter is supported only on 82540, 82545 and later adapters.) -Valid Range: 0-65535 (0=off) -Default Value: 64 - -This value, in units of 1.024 microseconds, limits the delay in which a -transmit interrupt is generated. Useful only if TxIntDelay is non-zero, -this value ensures that an interrupt is generated after the initial -packet is sent on the wire within the set amount of time. Proper tuning, -along with TxIntDelay, may improve traffic throughput in specific -network conditions. - -XsumRX ------- -(This parameter is NOT supported on the 82542-based adapter.) -Valid Range: 0-1 -Default Value: 1 - -A value of '1' indicates that the driver should enable IP checksum -offload for received packets (both UDP and TCP) to the adapter hardware. - -Copybreak ---------- -Valid Range: 0-xxxxxxx (0=off) -Default Value: 256 -Usage: insmod e1000.ko copybreak=128 - -Driver copies all packets below or equaling this size to a fresh RX -buffer before handing it up the stack. - -This parameter is different than other parameters, in that it is a -single (not 1,1,1 etc.) parameter applied to all driver instances and -it is also available during runtime at -/sys/module/e1000/parameters/copybreak - -SmartPowerDownEnable --------------------- -Valid Range: 0-1 -Default Value: 0 (disabled) - -Allows PHY to turn off in lower power states. The user can turn off -this parameter in supported chipsets. - -KumeranLockLoss ---------------- -Valid Range: 0-1 -Default Value: 1 (enabled) - -This workaround skips resetting the PHY at shutdown for the initial -silicon releases of ICH8 systems. - -Speed and Duplex Configuration -============================== - -Three keywords are used to control the speed and duplex configuration. -These keywords are Speed, Duplex, and AutoNeg. - -If the board uses a fiber interface, these keywords are ignored, and the -fiber interface board only links at 1000 Mbps full-duplex. - -For copper-based boards, the keywords interact as follows: - - The default operation is auto-negotiate. The board advertises all - supported speed and duplex combinations, and it links at the highest - common speed and duplex mode IF the link partner is set to auto-negotiate. - - If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps - is advertised (The 1000BaseT spec requires auto-negotiation.) - - If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- - negotiation is disabled, and the AutoNeg parameter is ignored. Partner - SHOULD also be forced. - -The AutoNeg parameter is used when more control is required over the -auto-negotiation process. It should be used when you wish to control which -speed and duplex combinations are advertised during the auto-negotiation -process. - -The parameter may be specified as either a decimal or hexadecimal value as -determined by the bitmap below. - -Bit position 7 6 5 4 3 2 1 0 -Decimal Value 128 64 32 16 8 4 2 1 -Hex value 80 40 20 10 8 4 2 1 -Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10 -Duplex Full Full Half Full Half - -Some examples of using AutoNeg: - - modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half) - modprobe e1000 AutoNeg=1 (Same as above) - modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full) - modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full) - modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half) - modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100 - Half) - modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full) - modprobe e1000 AutoNeg=32 (Same as above) - -Note that when this parameter is used, Speed and Duplex must not be specified. - -If the link partner is forced to a specific speed and duplex, then this -parameter should not be used. Instead, use the Speed and Duplex parameters -previously mentioned to force the adapter to the same speed and duplex. - -Additional Configurations -========================= - - Jumbo Frames - ------------ - Jumbo Frames support is enabled by changing the MTU to a value larger than - the default of 1500. Use the ifconfig command to increase the MTU size. - For example: - - ifconfig eth mtu 9000 up - - This setting is not saved across reboots. It can be made permanent if - you add: - - MTU=9000 - - to the file /etc/sysconfig/network-scripts/ifcfg-eth. This example - applies to the Red Hat distributions; other distributions may store this - setting in a different location. - - Notes: - Degradation in throughput performance may be observed in some Jumbo frames - environments. If this is observed, increasing the application's socket buffer - size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. - See the specific application manual and /usr/src/linux*/Documentation/ - networking/ip-sysctl.txt for more details. - - - The maximum MTU setting for Jumbo Frames is 16110. This value coincides - with the maximum Jumbo Frames size of 16128. - - - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in - poor performance or loss of link. - - - Adapters based on the Intel(R) 82542 and 82573V/E controller do not - support Jumbo Frames. These correspond to the following product names: - Intel(R) PRO/1000 Gigabit Server Adapter - Intel(R) PRO/1000 PM Network Connection - - ethtool - ------- - The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. The ethtool - version 1.6 or later is required for this functionality. - - The latest release of ethtool can be found from - https://www.kernel.org/pub/software/network/ethtool/ - - Enabling Wake on LAN* (WoL) - --------------------------- - WoL is configured through the ethtool* utility. - - WoL will be enabled on the system during the next shut down or reboot. - For this driver version, in order to enable WoL, the e1000 driver must be - loaded when shutting down or rebooting the system. - -Support -======= - -For general information, go to the Intel support website at: - - http://support.intel.com - -or the Intel Wired Networking project hosted by Sourceforge at: - - http://sourceforge.net/projects/e1000 - -If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related -to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d11a62977edd..fec8588a588e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -11,6 +11,7 @@ Contents: can dpaa2/index e100 + e1000 kapi z8530book msg_zerocopy diff --git a/MAINTAINERS b/MAINTAINERS index d68981ca9896..32472fbf4d6e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7090,7 +7090,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git S: Supported F: Documentation/networking/e100.rst -F: Documentation/networking/e1000.txt +F: Documentation/networking/e1000.rst F: Documentation/networking/e1000e.txt F: Documentation/networking/igb.txt F: Documentation/networking/igbvf.txt -- cgit v1.2.3 From 79e9fed460385a3d8ba0b5782e9e74405cb199b1 Mon Sep 17 00:00:00 2001 From: Maciej Żenczykowski Date: Sun, 3 Jun 2018 10:41:17 -0700 Subject: net-tcp: extend tcp_tw_reuse sysctl to enable loopback only optimization MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This changes the /proc/sys/net/ipv4/tcp_tw_reuse from a boolean to an integer. It now takes the values 0, 1 and 2, where 0 and 1 behave as before, while 2 enables timewait socket reuse only for sockets that we can prove are loopback connections: ie. bound to 'lo' interface or where one of source or destination IPs is 127.0.0.0/8, ::ffff:127.0.0.0/104 or ::1. This enables quicker reuse of ephemeral ports for loopback connections - where tcp_tw_reuse is 100% safe from a protocol perspective (this assumes no artificially induced packet loss on 'lo'). This also makes estblishing many loopback connections *much* faster (allocating ports out of the first half of the ephemeral port range is significantly faster, then allocating from the second half) Without this change in a 32K ephemeral port space my sample program (it just establishes and closes [::1]:ephemeral -> [::1]:server_port connections in a tight loop) fails after 32765 connections in 24 seconds. With it enabled 50000 connections only take 4.7 seconds. This is particularly problematic for IPv6 where we only have one local address and cannot play tricks with varying source IP from 127.0.0.0/8 pool. Signed-off-by: Maciej Żenczykowski Cc: Neal Cardwell Cc: Yuchung Cheng Cc: Wei Wang Change-Id: I0377961749979d0301b7b62871a32a4b34b654e1 Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 10 +++++++--- net/ipv4/sysctl_net_ipv4.c | 5 ++++- net/ipv4/tcp_ipv4.c | 35 +++++++++++++++++++++++++++++++--- 3 files changed, 43 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 924bd51327b7..6841c74eac00 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -667,11 +667,15 @@ tcp_tso_win_divisor - INTEGER building larger TSO frames. Default: 3 -tcp_tw_reuse - BOOLEAN - Allow to reuse TIME-WAIT sockets for new connections when it is - safe from protocol viewpoint. Default value is 0. +tcp_tw_reuse - INTEGER + Enable reuse of TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. + 0 - disable + 1 - global enable + 2 - enable for loopback traffic only It should not be changed without advice/request of technical experts. + Default: 2 tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index d2eed3ddcb0a..d06247ba08b2 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -30,6 +30,7 @@ static int zero; static int one = 1; +static int two = 2; static int four = 4; static int thousand = 1000; static int gso_max_segs = GSO_MAX_SEGS; @@ -845,7 +846,9 @@ static struct ctl_table ipv4_net_table[] = { .data = &init_net.ipv4.sysctl_tcp_tw_reuse, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &two, }, { .procname = "tcp_max_tw_buckets", diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 749b0ef9f405..633963e228bc 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -110,8 +110,38 @@ static u32 tcp_v4_init_ts_off(const struct net *net, const struct sk_buff *skb) int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) { + const struct inet_timewait_sock *tw = inet_twsk(sktw); const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw); struct tcp_sock *tp = tcp_sk(sk); + int reuse = sock_net(sk)->ipv4.sysctl_tcp_tw_reuse; + + if (reuse == 2) { + /* Still does not detect *everything* that goes through + * lo, since we require a loopback src or dst address + * or direct binding to 'lo' interface. + */ + bool loopback = false; + if (tw->tw_bound_dev_if == LOOPBACK_IFINDEX) + loopback = true; +#if IS_ENABLED(CONFIG_IPV6) + if (tw->tw_family == AF_INET6) { + if (ipv6_addr_loopback(&tw->tw_v6_daddr) || + (ipv6_addr_v4mapped(&tw->tw_v6_daddr) && + (tw->tw_v6_daddr.s6_addr[12] == 127)) || + ipv6_addr_loopback(&tw->tw_v6_rcv_saddr) || + (ipv6_addr_v4mapped(&tw->tw_v6_rcv_saddr) && + (tw->tw_v6_rcv_saddr.s6_addr[12] == 127))) + loopback = true; + } else +#endif + { + if (ipv4_is_loopback(tw->tw_daddr) || + ipv4_is_loopback(tw->tw_rcv_saddr)) + loopback = true; + } + if (!loopback) + reuse = 0; + } /* With PAWS, it is safe from the viewpoint of data integrity. Even without PAWS it is safe provided sequence @@ -125,8 +155,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) and use initial timestamp retrieved from peer table. */ if (tcptw->tw_ts_recent_stamp && - (!twp || (sock_net(sk)->ipv4.sysctl_tcp_tw_reuse && - get_seconds() - tcptw->tw_ts_recent_stamp > 1))) { + (!twp || (reuse && get_seconds() - tcptw->tw_ts_recent_stamp > 1))) { tp->write_seq = tcptw->tw_snd_nxt + 65535 + 2; if (tp->write_seq == 0) tp->write_seq = 1; @@ -2529,7 +2558,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_orphan_retries = 0; net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT; net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX; - net->ipv4.sysctl_tcp_tw_reuse = 0; + net->ipv4.sysctl_tcp_tw_reuse = 2; cnt = tcp_hashinfo.ehash_mask + 1; net->ipv4.tcp_death_row.sysctl_max_tw_buckets = (cnt + 1) / 2; -- cgit v1.2.3 From bb38ccce887c29a3ca78b8bd105c168c811766d9 Mon Sep 17 00:00:00 2001 From: Olivier Gayot Date: Mon, 4 Jun 2018 12:07:37 +0200 Subject: docs: networking: fix minor typos in various documentation files This patch fixes some typos/misspelling errors in the Documentation/networking files. Signed-off-by: Olivier Gayot Signed-off-by: David S. Miller --- Documentation/networking/6lowpan.txt | 4 ++-- Documentation/networking/gtp.txt | 4 ++-- Documentation/networking/ila.txt | 2 +- Documentation/networking/ip-sysctl.txt | 2 +- Documentation/networking/ipsec.txt | 4 ++-- Documentation/networking/ipvlan.txt | 4 ++-- Documentation/networking/kcm.txt | 10 +++++----- Documentation/networking/nf_conntrack-sysctl.txt | 2 +- 8 files changed, 16 insertions(+), 16 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/6lowpan.txt b/Documentation/networking/6lowpan.txt index a7dc7e939c7a..2e5a939d7e6f 100644 --- a/Documentation/networking/6lowpan.txt +++ b/Documentation/networking/6lowpan.txt @@ -24,10 +24,10 @@ enum lowpan_lltypes. Example to evaluate the private usually you can do: -static inline sturct lowpan_priv_foobar * +static inline struct lowpan_priv_foobar * lowpan_foobar_priv(struct net_device *dev) { - return (sturct lowpan_priv_foobar *)lowpan_priv(dev)->priv; + return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv; } switch (dev->type) { diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.txt index 0d9c18f05ec6..6966bbec1ecb 100644 --- a/Documentation/networking/gtp.txt +++ b/Documentation/networking/gtp.txt @@ -67,7 +67,7 @@ Don't be confused by terminology: The GTP User Plane goes through kernel accelerated path, while the GTP Control Plane goes to Userspace :) -The official homepge of the module is at +The official homepage of the module is at https://osmocom.org/projects/linux-kernel-gtp-u/wiki == Userspace Programs with Linux Kernel GTP-U support == @@ -120,7 +120,7 @@ If yo have questions regarding how to use the Kernel GTP module from your own software, or want to contribute to the code, please use the osmocom-net-grps mailing list for related discussion. The list can be reached at osmocom-net-gprs@lists.osmocom.org and the mailman -interface for managign your subscription is at +interface for managing your subscription is at https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs == Issue Tracker == diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.txt index 78df879abd26..a17dac9dc915 100644 --- a/Documentation/networking/ila.txt +++ b/Documentation/networking/ila.txt @@ -121,7 +121,7 @@ three options to deal with this: - checksum neutral mapping When an address is translated the difference can be offset - elsewhere in a part of the packet that is covered by the + elsewhere in a part of the packet that is covered by the checksum. The low order sixteen bits of the identifier are used. This method is preferred since it doesn't require parsing a packet beyond the IP header and in most cases the diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 6841c74eac00..ce8fbf5aa63c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -26,7 +26,7 @@ ip_no_pmtu_disc - INTEGER discarded. Outgoing frames are handled the same as in mode 1, implicitly setting IP_PMTUDISC_DONT on every created socket. - Mode 3 is a hardend pmtu discover mode. The kernel will only + Mode 3 is a hardened pmtu discover mode. The kernel will only accept fragmentation-needed errors if the underlying protocol can verify them besides a plain socket lookup. Current protocols for which pmtu events will be honored are TCP, SCTP diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt index 8dbc08b7e431..ba794b7e51be 100644 --- a/Documentation/networking/ipsec.txt +++ b/Documentation/networking/ipsec.txt @@ -25,8 +25,8 @@ Quote from RFC3173: is implementation dependent. Current IPComp implementation is indeed by the book, while as in practice -when sending non-compressed packet to the peer(whether or not packet len -is smaller than the threshold or the compressed len is large than original +when sending non-compressed packet to the peer (whether or not packet len +is smaller than the threshold or the compressed len is larger than original packet len), the packet is dropped when checking the policy as this packet matches the selector but not coming from any XFRM layer, i.e., with no security path. Such naked packet will not eventually make it to upper layer. diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 812ef003e0a8..27a38e50c287 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -73,11 +73,11 @@ mode to make conn-tracking work. This is the default option. To configure the IPvlan port in this mode, user can choose to either add this option on the command-line or don't specify anything. This is the traditional mode where slaves can cross-talk among -themseleves apart from talking through the master device. +themselves apart from talking through the master device. 5.2 private: If this option is added to the command-line, the port is set in private -mode. i.e. port wont allow cross communication between slaves. +mode. i.e. port won't allow cross communication between slaves. 5.3 vepa: If this is added to the command-line, the port is set in VEPA mode. diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt index 9a513295b07c..b773a5278ac4 100644 --- a/Documentation/networking/kcm.txt +++ b/Documentation/networking/kcm.txt @@ -1,4 +1,4 @@ -Kernel Connection Mulitplexor +Kernel Connection Multiplexor ----------------------------- Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based @@ -31,7 +31,7 @@ KCM implements an NxM multiplexor in the kernel as diagrammed below: KCM sockets ----------- -The KCM sockets provide the user interface to the muliplexor. All the KCM sockets +The KCM sockets provide the user interface to the multiplexor. All the KCM sockets bound to a multiplexor are considered to have equivalent function, and I/O operations in different sockets may be done in parallel without the need for synchronization between threads in userspace. @@ -199,7 +199,7 @@ while. Example use: BFP programs for message delineation ------------------------------------ -BPF programs can be compiled using the BPF LLVM backend. For exmple, +BPF programs can be compiled using the BPF LLVM backend. For example, the BPF program for parsing Thrift is: #include "bpf.h" /* for __sk_buff */ @@ -222,7 +222,7 @@ messages. The kernel provides necessary assurances that messages are sent and received atomically. This relieves much of the burden applications have in mapping a message based protocol onto the TCP stream. KCM also make application layer messages a unit of work in the kernel for the purposes of -steerng and scheduling, which in turn allows a simpler networking model in +steering and scheduling, which in turn allows a simpler networking model in multithreaded applications. Configurations @@ -272,7 +272,7 @@ on the socket thus waking up the application thread. When the application sees the error (which may just be a disconnect) it should unattach the socket from KCM and then close it. It is assumed that once an error is posted on the TCP socket the data stream is unrecoverable (i.e. an error -may have occurred in the middle of receiving a messssge). +may have occurred in the middle of receiving a message). TCP connection monitoring ------------------------- diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt index 433b6724797a..1669dc2419fd 100644 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -156,7 +156,7 @@ nf_conntrack_timestamp - BOOLEAN nf_conntrack_udp_timeout - INTEGER (seconds) default 30 -nf_conntrack_udp_timeout_stream2 - INTEGER (seconds) +nf_conntrack_udp_timeout_stream - INTEGER (seconds) default 180 This extended timeout will be used in case there is an UDP stream -- cgit v1.2.3 From 75d4e704fa8d2cf33ff295e5b441317603d7f9fd Mon Sep 17 00:00:00 2001 From: Cong Wang Date: Tue, 5 Jun 2018 09:48:13 -0700 Subject: netdev-FAQ: clarify DaveM's position for stable backports Per discussion with David at netconf 2018, let's clarify DaveM's position of handling stable backports in netdev-FAQ. This is important for people relying on upstream -stable releases. Cc: Greg Kroah-Hartman Signed-off-by: Cong Wang Signed-off-by: David S. Miller --- Documentation/networking/netdev-FAQ.txt | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt index 2a3278d5cf35..fa951b820b25 100644 --- a/Documentation/networking/netdev-FAQ.txt +++ b/Documentation/networking/netdev-FAQ.txt @@ -179,6 +179,15 @@ A: No. See above answer. In short, if you think it really belongs in dash marker line as described in Documentation/process/submitting-patches.rst to temporarily embed that information into the patch that you send. +Q: Are all networking bug fixes backported to all stable releases? + +A: Due to capacity, Dave could only take care of the backports for the last + 2 stable releases. For earlier stable releases, each stable branch maintainer + is supposed to take care of them. If you find any patch is missing from an + earlier stable branch, please notify stable@vger.kernel.org with either a + commit ID or a formal patch backported, and CC Dave and other relevant + networking developers. + Q: Someone said that the comment style and coding convention is different for the networking content. Is this true? -- cgit v1.2.3