summaryrefslogtreecommitdiffstats
path: root/net/sched
AgeCommit message (Collapse)AuthorFilesLines
2019-03-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-8/+11
2019-02-28net: sched: pie: avoid slow division in drop probability decayLeslie Monis1-1/+2
As per RFC 8033, it is sufficient for the drop probability decay factor to have a value of (1 - 1/64) instead of 98%. This avoids the need to do slow division. Suggested-by: David Laight <David.Laight@aculab.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-28net: netem: fix skb length BUG_ON in __skb_to_sgvecSheng Lan1-3/+7
It can be reproduced by following steps: 1. virtio_net NIC is configured with gso/tso on 2. configure nginx as http server with an index file bigger than 1M bytes 3. use tc netem to produce duplicate packets and delay: tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90% 4. continually curl the nginx http server to get index file on client 5. BUG_ON is seen quickly [10258690.371129] kernel BUG at net/core/skbuff.c:4028! [10258690.371748] invalid opcode: 0000 [#1] SMP PTI [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.0.0-rc6 #2 [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202 [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002 [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028 [10258690.372094] FS: 0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000 [10258690.372094] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0 [10258690.372094] Call Trace: [10258690.372094] <IRQ> [10258690.372094] skb_to_sgvec+0x11/0x40 [10258690.372094] start_xmit+0x38c/0x520 [virtio_net] [10258690.372094] dev_hard_start_xmit+0x9b/0x200 [10258690.372094] sch_direct_xmit+0xff/0x260 [10258690.372094] __qdisc_run+0x15e/0x4e0 [10258690.372094] net_tx_action+0x137/0x210 [10258690.372094] __do_softirq+0xd6/0x2a9 [10258690.372094] irq_exit+0xde/0xf0 [10258690.372094] smp_apic_timer_interrupt+0x74/0x140 [10258690.372094] apic_timer_interrupt+0xf/0x20 [10258690.372094] </IRQ> In __skb_to_sgvec(), the skb->len is not equal to the sum of the skb's linear data size and nonlinear data size, thus BUG_ON triggered. Because the skb is cloned and a part of nonlinear data is split off. Duplicate packet is cloned in netem_enqueue() and may be delayed some time in qdisc. When qdisc len reached the limit and returns NET_XMIT_DROP, the skb will be retransmit later in write queue. the skb will be fragmented by tso_fragment(), the limit size that depends on cwnd and mss decrease, the skb's nonlinear data will be split off. The length of the skb cloned by netem will not be updated. When we use virtio_net NIC and invoke skb_to_sgvec(), the BUG_ON trigger. To fix it, netem returns NET_XMIT_SUCCESS to upper stack when it clones a duplicate packet. Fixes: 35d889d1 ("sch_netem: fix skb leak in netem_enqueue()") Signed-off-by: Sheng Lan <lansheng@huawei.com> Reported-by: Qin Ji <jiqin.ji@huawei.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27net: sched: act_csum: Fix csum calc for tagged packetsEli Britstein1-2/+29
The csum calculation is different for IPv4/6. For VLAN packets, tc_skb_protocol returns the VLAN protocol rather than the packet's one (e.g. IPv4/6), so csum is not calculated. Furthermore, VLAN may not be stripped so csum is not calculated in this case too. Calculate the csum for those cases. Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Eli Britstein <elibr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27net: sched: act_tunnel_key: fix metadata handlingVlad Buslov1-9/+9
Tunnel key action params->tcft_enc_metadata is only set when action is TCA_TUNNEL_KEY_ACT_SET. However, metadata pointer is incorrectly dereferenced during tunnel key init and release without verifying that action is if correct type, which causes NULL pointer dereference. Metadata tunnel dst_cache is also leaked on action overwrite. Fix metadata handling: - Verify that metadata pointer is not NULL before dereferencing it in tunnel_key_init error handling code. - Move dst_cache destroy code into tunnel_key_release_params() function that is called in both action overwrite and release cases (fixes resource leak) and verifies that actions has correct type before dereferencing metadata pointer (fixes NULL pointer dereference). Oops with KASAN enabled during tdc tests execution: [ 261.080482] ================================================================== [ 261.088049] BUG: KASAN: null-ptr-deref in dst_cache_destroy+0x21/0xa0 [ 261.094613] Read of size 8 at addr 00000000000000b0 by task tc/2976 [ 261.102524] CPU: 14 PID: 2976 Comm: tc Not tainted 5.0.0-rc7+ #157 [ 261.108844] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 261.116726] Call Trace: [ 261.119234] dump_stack+0x9a/0xeb [ 261.122625] ? dst_cache_destroy+0x21/0xa0 [ 261.126818] ? dst_cache_destroy+0x21/0xa0 [ 261.131004] kasan_report+0x176/0x192 [ 261.134752] ? idr_get_next+0xd0/0x120 [ 261.138578] ? dst_cache_destroy+0x21/0xa0 [ 261.142768] dst_cache_destroy+0x21/0xa0 [ 261.146799] tunnel_key_release+0x3a/0x50 [act_tunnel_key] [ 261.152392] tcf_action_cleanup+0x2c/0xc0 [ 261.156490] tcf_generic_walker+0x4c2/0x5c0 [ 261.160794] ? tcf_action_dump_1+0x390/0x390 [ 261.165163] ? tunnel_key_walker+0x5/0x1a0 [act_tunnel_key] [ 261.170865] ? tunnel_key_walker+0xe9/0x1a0 [act_tunnel_key] [ 261.176641] tca_action_gd+0x600/0xa40 [ 261.180482] ? tca_get_fill.constprop.17+0x200/0x200 [ 261.185548] ? __lock_acquire+0x588/0x1d20 [ 261.189741] ? __lock_acquire+0x588/0x1d20 [ 261.193922] ? mark_held_locks+0x90/0x90 [ 261.197944] ? mark_held_locks+0x90/0x90 [ 261.202018] ? __nla_parse+0xfe/0x190 [ 261.205774] tc_ctl_action+0x218/0x230 [ 261.209614] ? tcf_action_add+0x230/0x230 [ 261.213726] rtnetlink_rcv_msg+0x3a5/0x600 [ 261.217910] ? lock_downgrade+0x2d0/0x2d0 [ 261.222006] ? validate_linkmsg+0x400/0x400 [ 261.226278] ? find_held_lock+0x6d/0xd0 [ 261.230200] ? match_held_lock+0x1b/0x210 [ 261.234296] ? validate_linkmsg+0x400/0x400 [ 261.238567] netlink_rcv_skb+0xc7/0x1f0 [ 261.242489] ? netlink_ack+0x470/0x470 [ 261.246319] ? netlink_deliver_tap+0x1f3/0x5a0 [ 261.250874] netlink_unicast+0x2ae/0x350 [ 261.254884] ? netlink_attachskb+0x340/0x340 [ 261.261647] ? _copy_from_iter_full+0xdd/0x380 [ 261.268576] ? __virt_addr_valid+0xb6/0xf0 [ 261.275227] ? __check_object_size+0x159/0x240 [ 261.282184] netlink_sendmsg+0x4d3/0x630 [ 261.288572] ? netlink_unicast+0x350/0x350 [ 261.295132] ? netlink_unicast+0x350/0x350 [ 261.301608] sock_sendmsg+0x6d/0x80 [ 261.307467] ___sys_sendmsg+0x48e/0x540 [ 261.313633] ? copy_msghdr_from_user+0x210/0x210 [ 261.320545] ? save_stack+0x89/0xb0 [ 261.326289] ? __lock_acquire+0x588/0x1d20 [ 261.332605] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 261.340063] ? mark_held_locks+0x90/0x90 [ 261.346162] ? do_filp_open+0x138/0x1d0 [ 261.352108] ? may_open_dev+0x50/0x50 [ 261.357897] ? match_held_lock+0x1b/0x210 [ 261.364016] ? __fget_light+0xa6/0xe0 [ 261.369840] ? __sys_sendmsg+0xd2/0x150 [ 261.375814] __sys_sendmsg+0xd2/0x150 [ 261.381610] ? __ia32_sys_shutdown+0x30/0x30 [ 261.388026] ? lock_downgrade+0x2d0/0x2d0 [ 261.394182] ? mark_held_locks+0x1c/0x90 [ 261.400230] ? do_syscall_64+0x1e/0x280 [ 261.406172] do_syscall_64+0x78/0x280 [ 261.411932] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 261.419103] RIP: 0033:0x7f28e91a8b87 [ 261.424791] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00 00 00 53 48 89 f3 48 [ 261.448226] RSP: 002b:00007ffdc5c4e2d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 261.458183] RAX: ffffffffffffffda RBX: 000000005c73c202 RCX: 00007f28e91a8b87 [ 261.467728] RDX: 0000000000000000 RSI: 00007ffdc5c4e340 RDI: 0000000000000003 [ 261.477342] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000000c [ 261.486970] R10: 000000000000000c R11: 0000000000000246 R12: 0000000000000001 [ 261.496599] R13: 000000000067b4e0 R14: 00007ffdc5c5248c R15: 00007ffdc5c52480 [ 261.506281] ================================================================== [ 261.516076] Disabling lock debugging due to kernel taint [ 261.523979] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 [ 261.534413] #PF error: [normal kernel read fault] [ 261.541730] PGD 8000000317400067 P4D 8000000317400067 PUD 316878067 PMD 0 [ 261.551294] Oops: 0000 [#1] SMP KASAN PTI [ 261.557985] CPU: 14 PID: 2976 Comm: tc Tainted: G B 5.0.0-rc7+ #157 [ 261.568306] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 261.578874] RIP: 0010:dst_cache_destroy+0x21/0xa0 [ 261.586413] Code: f4 ff ff ff eb f6 0f 1f 00 0f 1f 44 00 00 41 56 41 55 49 c7 c6 60 fe 35 af 41 54 55 49 89 fc 53 bd ff ff ff ff e8 ef 98 73 ff <49> 83 3c 24 00 75 35 eb 6c 4c 63 ed e8 de 98 73 ff 4a 8d 3c ed 40 [ 261.611247] RSP: 0018:ffff888316447160 EFLAGS: 00010282 [ 261.619564] RAX: 0000000000000000 RBX: ffff88835b3e2f00 RCX: ffffffffad1c5071 [ 261.629862] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: 0000000000000297 [ 261.640149] RBP: 00000000ffffffff R08: fffffbfff5dd4e89 R09: fffffbfff5dd4e89 [ 261.650467] R10: 0000000000000001 R11: fffffbfff5dd4e88 R12: 00000000000000b0 [ 261.660785] R13: ffff8883267a10c0 R14: ffffffffaf35fe60 R15: 0000000000000001 [ 261.671110] FS: 00007f28ea3e6400(0000) GS:ffff888364200000(0000) knlGS:0000000000000000 [ 261.682447] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 261.691491] CR2: 00000000000000b0 CR3: 00000003178ae004 CR4: 00000000001606e0 [ 261.701283] Call Trace: [ 261.706374] tunnel_key_release+0x3a/0x50 [act_tunnel_key] [ 261.714522] tcf_action_cleanup+0x2c/0xc0 [ 261.721208] tcf_generic_walker+0x4c2/0x5c0 [ 261.728074] ? tcf_action_dump_1+0x390/0x390 [ 261.734996] ? tunnel_key_walker+0x5/0x1a0 [act_tunnel_key] [ 261.743247] ? tunnel_key_walker+0xe9/0x1a0 [act_tunnel_key] [ 261.751557] tca_action_gd+0x600/0xa40 [ 261.757991] ? tca_get_fill.constprop.17+0x200/0x200 [ 261.765644] ? __lock_acquire+0x588/0x1d20 [ 261.772461] ? __lock_acquire+0x588/0x1d20 [ 261.779266] ? mark_held_locks+0x90/0x90 [ 261.785880] ? mark_held_locks+0x90/0x90 [ 261.792470] ? __nla_parse+0xfe/0x190 [ 261.798738] tc_ctl_action+0x218/0x230 [ 261.805145] ? tcf_action_add+0x230/0x230 [ 261.811760] rtnetlink_rcv_msg+0x3a5/0x600 [ 261.818564] ? lock_downgrade+0x2d0/0x2d0 [ 261.825433] ? validate_linkmsg+0x400/0x400 [ 261.832256] ? find_held_lock+0x6d/0xd0 [ 261.838624] ? match_held_lock+0x1b/0x210 [ 261.845142] ? validate_linkmsg+0x400/0x400 [ 261.851729] netlink_rcv_skb+0xc7/0x1f0 [ 261.857976] ? netlink_ack+0x470/0x470 [ 261.864132] ? netlink_deliver_tap+0x1f3/0x5a0 [ 261.870969] netlink_unicast+0x2ae/0x350 [ 261.877294] ? netlink_attachskb+0x340/0x340 [ 261.883962] ? _copy_from_iter_full+0xdd/0x380 [ 261.890750] ? __virt_addr_valid+0xb6/0xf0 [ 261.897188] ? __check_object_size+0x159/0x240 [ 261.903928] netlink_sendmsg+0x4d3/0x630 [ 261.910112] ? netlink_unicast+0x350/0x350 [ 261.916410] ? netlink_unicast+0x350/0x350 [ 261.922656] sock_sendmsg+0x6d/0x80 [ 261.928257] ___sys_sendmsg+0x48e/0x540 [ 261.934183] ? copy_msghdr_from_user+0x210/0x210 [ 261.940865] ? save_stack+0x89/0xb0 [ 261.946355] ? __lock_acquire+0x588/0x1d20 [ 261.952358] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 261.959468] ? mark_held_locks+0x90/0x90 [ 261.965248] ? do_filp_open+0x138/0x1d0 [ 261.970910] ? may_open_dev+0x50/0x50 [ 261.976386] ? match_held_lock+0x1b/0x210 [ 261.982210] ? __fget_light+0xa6/0xe0 [ 261.987648] ? __sys_sendmsg+0xd2/0x150 [ 261.993263] __sys_sendmsg+0xd2/0x150 [ 261.998613] ? __ia32_sys_shutdown+0x30/0x30 [ 262.004555] ? lock_downgrade+0x2d0/0x2d0 [ 262.010236] ? mark_held_locks+0x1c/0x90 [ 262.015758] ? do_syscall_64+0x1e/0x280 [ 262.021234] do_syscall_64+0x78/0x280 [ 262.026500] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 262.033207] RIP: 0033:0x7f28e91a8b87 [ 262.038421] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00 00 00 53 48 89 f3 48 [ 262.060708] RSP: 002b:00007ffdc5c4e2d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 262.070112] RAX: ffffffffffffffda RBX: 000000005c73c202 RCX: 00007f28e91a8b87 [ 262.079087] RDX: 0000000000000000 RSI: 00007ffdc5c4e340 RDI: 0000000000000003 [ 262.088122] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000000c [ 262.097157] R10: 000000000000000c R11: 0000000000000246 R12: 0000000000000001 [ 262.106207] R13: 000000000067b4e0 R14: 00007ffdc5c5248c R15: 00007ffdc5c52480 [ 262.115271] Modules linked in: act_tunnel_key act_skbmod act_simple act_connmark nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 act_csum libcrc32c act_meta_skbtcindex act_meta_skbprio act_meta_mark act_ife ife act_police act_sample psample act_gact veth nfsv3 nfs_acl nfs lockd grace fscache bridge stp llc intel_rapl sb_edac mlx5_ib x86_pkg_temp_thermal sunrpc intel_powerclamp coretemp ib_uverbs kvm_intel ib_core kvm irqbypass mlx5_core crct10dif_pclmul crc32_pclmul crc32c_intel igb ghash_clmulni_intel intel_cstate mlxfw iTCO_wdt devlink intel_uncore iTCO_vendor_support ipmi_ssif ptp mei_me intel_rapl_perf ioatdma joydev pps_core ses mei i2c_i801 pcspkr enclosure lpc_ich dca wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pcc_cpufreq ast i2c_algo_bit drm_kms_helper ttm drm mpt3sas raid_class scsi_transport_sas [ 262.204393] CR2: 00000000000000b0 [ 262.210390] ---[ end trace 2e41d786f2c7901a ]--- [ 262.226790] RIP: 0010:dst_cache_destroy+0x21/0xa0 [ 262.234083] Code: f4 ff ff ff eb f6 0f 1f 00 0f 1f 44 00 00 41 56 41 55 49 c7 c6 60 fe 35 af 41 54 55 49 89 fc 53 bd ff ff ff ff e8 ef 98 73 ff <49> 83 3c 24 00 75 35 eb 6c 4c 63 ed e8 de 98 73 ff 4a 8d 3c ed 40 [ 262.258311] RSP: 0018:ffff888316447160 EFLAGS: 00010282 [ 262.266304] RAX: 0000000000000000 RBX: ffff88835b3e2f00 RCX: ffffffffad1c5071 [ 262.276251] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: 0000000000000297 [ 262.286208] RBP: 00000000ffffffff R08: fffffbfff5dd4e89 R09: fffffbfff5dd4e89 [ 262.296183] R10: 0000000000000001 R11: fffffbfff5dd4e88 R12: 00000000000000b0 [ 262.306157] R13: ffff8883267a10c0 R14: ffffffffaf35fe60 R15: 0000000000000001 [ 262.316139] FS: 00007f28ea3e6400(0000) GS:ffff888364200000(0000) knlGS:0000000000000000 [ 262.327146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.335815] CR2: 00000000000000b0 CR3: 00000003178ae004 CR4: 00000000001606e0 Fixes: 41411e2fd6b8 ("net/sched: act_tunnel_key: Add dst_cache support") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27Revert "net: sched: fw: don't set arg->stop in fw_walk() when empty"Vlad Buslov1-1/+4
This reverts commit 31a998487641 ("net: sched: fw: don't set arg->stop in fw_walk() when empty") Cls API function tcf_proto_is_empty() was changed in commit 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") to no longer depend on arg->stop to determine that classifier instance is empty. Instead, it adds dedicated arg->nonempty field, which makes the fix in fw classifier no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26net: sched: pie: fix 64-bit divisionLeslie Monis1-1/+1
Use div_u64() to resolve build failures on 32-bit platforms. Fixes: 3f7ae5f3dc52 ("net: sched: pie: add more cases to auto-tune alpha and beta") Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26net: Use RCU_POINTER_INITIALIZER() to init static variableLi RongQing1-1/+1
This pointer is RCU protected, so proper primitives should be used. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26net: sched: fix typo in walker_check_empty()Vlad Buslov1-2/+2
Function walker_check_empty() incorrectly verifies that tp pointer is not NULL, instead of actual filter pointer. Fix conditional to check the right pointer. Adjust filter pointer naming accordingly to other cls API functions. Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26net: sched: pie: fix mistake in reference linkLeslie Monis1-1/+1
Fix the incorrect reference link to RFC 8033 Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: update referencesMohit P. Tahiliani1-3/+1
RFC 8033 replaces the IETF draft for PIE Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: add derandomization mechanismMohit P. Tahiliani1-1/+27
Random dropping of packets to achieve latency control may introduce outlier situations where packets are dropped too close to each other or too far from each other. This can cause the real drop percentage to temporarily deviate from the intended drop probability. In certain scenarios, such as a small number of simultaneous TCP flows, these deviations can cause significant deviations in link utilization and queuing latency. RFC 8033 suggests using a derandomization mechanism to avoid these deviations. Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: add more cases to auto-tune alpha and betaMohit P. Tahiliani1-33/+32
The current implementation scales the local alpha and beta variables in the calculate_probability function by the same amount for all values of drop probability below 1%. RFC 8033 suggests using additional cases for auto-tuning alpha and beta when the drop probability is less than 1%. In order to add more auto-tuning cases, MAX_PROB must be scaled by u64 instead of u32 to prevent underflow when scaling the local alpha and beta variables in the calculate_probability function. Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: change initial value of pie_vars->burst_timeMohit P. Tahiliani1-2/+2
RFC 8033 suggests an initial value of 150 milliseconds for the maximum time allowed for a burst of packets. Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: change default value of pie_params->tupdateMohit P. Tahiliani1-1/+1
RFC 8033 suggests a default value of 15 milliseconds for the update interval. Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: change default value of pie_params->targetMohit P. Tahiliani1-1/+1
RFC 8033 suggests a default value of 15 milliseconds for the target queue delay. Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: pie: change value of QUEUE_THRESHOLDMohit P. Tahiliani1-1/+1
RFC 8033 recommends a value of 16384 bytes for the queue threshold. Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in> Signed-off-by: Dhaval Khandla <dhavaljkhandla26@gmail.com> Signed-off-by: Hrishikesh Hiraskar <hrishihiraskar@gmail.com> Signed-off-by: Manish Kumar B <bmanish15597@gmail.com> Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com> Signed-off-by: Leslie Monis <lesliemonis@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: don't release block->lock when dumping chainsVlad Buslov1-9/+7
Function tc_dump_chain() obtains and releases block->lock on each iteration of its inner loop that dumps all chains on block. Outputting chain template info is fast operation so locking/unlocking mutex multiple times is an overhead when lock is highly contested. Modify tc_dump_chain() to only obtain block->lock once and dump all chains without releasing it. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: set dedicated tcf_walker flag when tp is emptyVlad Buslov1-4/+9
Using tcf_walker->stop flag to determine when tcf_walker->fn() was called at least once is unreliable. Some classifiers set 'stop' flag on error before calling walker callback, other classifiers used to call it with NULL filter pointer when empty. In order to prevent further regressions, extend tcf_walker structure with dedicated 'nonempty' flag. Set this flag in tcf_walker->fn() implementation that is used to check if classifier has filters configured. Fixes: 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25net: sched: act_tunnel_key: fix NULL pointer dereference during initVlad Buslov1-1/+2
Metadata pointer is only initialized for action TCA_TUNNEL_KEY_ACT_SET, but it is unconditionally dereferenced in tunnel_key_init() error handler. Verify that metadata pointer is not NULL before dereferencing it in tunnel_key_init error handling code. Fixes: ee28bb56ac5b ("net/sched: fix memory leak in act_tunnel_key_init()") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net/sched: act_tunnel_key: Add dst_cache supportwenxu1-4/+21
The metadata_dst is not init the dst_cache which make the ip_md_tunnel_xmit can't use the dst_cache. It will lookup route table every packets. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net/sched: act_skbedit: fix refcount leak when replace failsDavide Caratti1-2/+1
when act_skbedit was converted to use RCU in the data plane, we added an error path, but we forgot to drop the action refcount in case of failure during a 'replace' operation: # tc actions add action skbedit ptype otherhost pass index 100 # tc action show action skbedit total acts 1 action order 0: skbedit ptype otherhost pass index 100 ref 1 bind 0 # tc actions replace action skbedit ptype otherhost drop index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action show action skbedit total acts 1 action order 0: skbedit ptype otherhost pass index 100 ref 2 bind 0 Ensure we call tcf_idr_release(), in case 'params_new' allocation failed, also when the action is being replaced. Fixes: c749cdda9089 ("net/sched: act_skbedit: don't use spinlock in the data path") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24net/sched: act_ipt: fix refcount leak when replace failsDavide Caratti1-2/+1
After commit 4e8ddd7f1758 ("net: sched: don't release reference on action overwrite"), the error path of all actions was converted to drop refcount also when the action was being overwritten. But we forgot act_ipt_init(), in case allocation of 'tname' was not successful: # tc action add action xt -j LOG --log-prefix hello index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "hello" index 100 # tc action show action xt total acts 1 action order 0: tablename: mangle hook: NF_IP_POST_ROUTING target LOG level warning prefix "hello" index 100 ref 1 bind 0 # tc action replace action xt -j LOG --log-prefix world index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "world" index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action show action xt total acts 1 action order 0: tablename: mangle hook: NF_IP_POST_ROUTING target LOG level warning prefix "hello" index 100 ref 2 bind 0 Ensure we call tcf_idr_release(), in case 'tname' allocation failed, also when the action is being replaced. Fixes: 4e8ddd7f1758 ("net: sched: don't release reference on action overwrite") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22net_sched: initialize net pointer inside tcf_exts_init()Cong Wang12-27/+27
For tcindex filter, it is too late to initialize the net pointer in tcf_exts_validate(), as tcf_exts_get_net() requires a non-NULL net pointer. We can just move its initialization into tcf_exts_init(), which just requires an additional parameter. This makes the code in tcindex_alloc_perfect_hash() prettier. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21net: sched: potential NULL dereference in tcf_block_find()Dan Carpenter1-1/+3
The error code isn't set on this path so it would result in returning ERR_PTR(0) and a NULL dereference in the caller. Fixes: 18d3eefb17cf ("net: sched: refactor tcf_block_find() into standalone functions") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20net_sched: fix a memory leak in cls_tcindexCong Wang1-13/+24
(cherry picked from commit 033b228e7f26b29ae37f8bfa1bc6b209a5365e9f) When tcindex_destroy() destroys all the filter results in the perfect hash table, it invokes the walker to delete each of them. However, results with class==0 are skipped in either tcindex_walk() or tcindex_delete(), which causes a memory leak reported by kmemleak. This patch fixes it by skipping the walker and directly deleting these filter results so we don't miss any filter result. As a result of this change, we have to initialize exts->net properly in tcindex_alloc_perfect_hash(). For net-next, we need to consider whether we should initialize ->net in tcf_exts_init() instead, before that just directly test CONFIG_NET_CLS_ACT=y. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-20net_sched: fix a race condition in tcindex_destroy()Cong Wang1-7/+11
(cherry picked from commit 8015d93ebd27484418d4952284fd02172fa4b0b2) tcindex_destroy() invokes tcindex_destroy_element() via a walker to delete each filter result in its perfect hash table, and tcindex_destroy_element() calls tcindex_delete() which schedules tcf RCU works to do the final deletion work. Unfortunately this races with the RCU callback __tcindex_destroy(), which could lead to use-after-free as reported by Adrian. Fix this by migrating this RCU callback to tcf RCU work too, as that workqueue is ordered, we will not have use-after-free. Note, we don't need to hold netns refcnt because we don't call tcf_exts_destroy() here. Fixes: 27ce4f05e2ab ("net_sched: use tcf_queue_work() in tcindex filter") Reported-by: Adrian <bugs@abtelecom.ro> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-18net: sched: using kfree_rcu() to simplify the codeWei Yongjun1-6/+1
The callback function of call_rcu() just calls a kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: sch_api: set an error msg when qdisc_alloc_handle() failsIvan Vecera1-2/+4
This patch sets an error message in extack when the number of qdisc handles exceeds the maximum. Also the error-code ENOSPC is more appropriate than ENOMEM in this situation. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reported-by: Li Shuang <shuali@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: cgroup: verify that filter is not NULL during walkVlad Buslov1-0/+2
Check that filter is not NULL before passing it to tcf_walker->fn() callback in cls_cgroup_walk(). This can happen when cls_cgroup_change() failed to set first filter. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: matchall: verify that filter is not NULL in mall_walk()Vlad Buslov1-0/+3
Check that filter is not NULL before passing it to tcf_walker->fn() callback. This can happen when mall_change() failed to offload filter to hardware. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Reported-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: route: don't set arg->stop in route4_walk() when emptyVlad Buslov1-4/+1
Some classifiers set arg->stop in their implementation of tp->walk() API when empty. Most of classifiers do not adhere to that convention. Do not set arg->stop in route4_walk() to unify tp->walk() behavior among classifier implementations. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17net: sched: fw: don't set arg->stop in fw_walk() when emptyVlad Buslov1-4/+1
Some classifiers set arg->stop in their implementation of tp->walk() API when empty. Most of classifiers do not adhere to that convention. Do not set arg->stop in fw_walk() to unify tp->walk() behavior among classifier implementations. Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-13/+14
The netfilter conflicts were rather simple overlapping changes. However, the cls_tcindex.c stuff was a bit more complex. On the 'net' side, Cong is fixing several races and memory leaks. Whilst on the 'net-next' side we have Vlad adding the rtnl-ness support. What I've decided to do, in order to resolve this, is revert the conversion over to using a workqueue that Cong did, bringing us back to pure RCU. I did it this way because I believe that either Cong's races don't apply with have Vlad did things, or Cong will have to implement the race fix slightly differently. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: sched: remove duplicated include from cls_api.cYueHaibing1-1/+0
Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13net: sched: flower: only return error from hw offload if skip_swVlad Buslov1-2/+10
Recently introduced tc_setup_flow_action() can fail when parsing tcf_exts on some unsupported action commands. However, this should not affect the case when user did not explicitly request hw offload by setting skip_sw flag. Modify tc_setup_flow_action() callers to only propagate the error if skip_sw flag is set for filter that is being offloaded, and set extack error message in that case. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Fixes: 3a7b68617de7 ("cls_api: add translator to flow_action representation") Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net_sched: fix two more memory leaks in cls_tcindexCong Wang1-9/+7
struct tcindex_filter_result contains two parts: struct tcf_exts and struct tcf_result. For the local variable 'cr', its exts part is never used but initialized without being released properly on success path. So just completely remove the exts part to fix this leak. For the local variable 'new_filter_result', it is never properly released if not used by 'r' on success path. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net_sched: fix a memory leak in cls_tcindexCong Wang1-16/+30
When tcindex_destroy() destroys all the filter results in the perfect hash table, it invokes the walker to delete each of them. However, results with class==0 are skipped in either tcindex_walk() or tcindex_delete(), which causes a memory leak reported by kmemleak. This patch fixes it by skipping the walker and directly deleting these filter results so we don't miss any filter result. As a result of this change, we have to initialize exts->net properly in tcindex_alloc_perfect_hash(). For net-next, we need to consider whether we should initialize ->net in tcf_exts_init() instead, before that just directly test CONFIG_NET_CLS_ACT=y. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net_sched: fix a race condition in tcindex_destroy()Cong Wang1-7/+11
tcindex_destroy() invokes tcindex_destroy_element() via a walker to delete each filter result in its perfect hash table, and tcindex_destroy_element() calls tcindex_delete() which schedules tcf RCU works to do the final deletion work. Unfortunately this races with the RCU callback __tcindex_destroy(), which could lead to use-after-free as reported by Adrian. Fix this by migrating this RCU callback to tcf RCU work too, as that workqueue is ordered, we will not have use-after-free. Note, we don't need to hold netns refcnt because we don't call tcf_exts_destroy() here. Fixes: 27ce4f05e2ab ("net_sched: use tcf_queue_work() in tcindex filter") Reported-by: Adrian <bugs@abtelecom.ro> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: unlock rules update APIVlad Buslov1-17/+114
Register netlink protocol handlers for message types RTM_NEWTFILTER, RTM_DELTFILTER, RTM_GETTFILTER as unlocked. Set rtnl_held variable that tracks rtnl mutex state to be false by default. Introduce tcf_proto_is_unlocked() helper that is used to check tcf_proto_ops->flag to determine if ops can be called without taking rtnl lock. Manually lookup Qdisc, class and block in rule update handlers. Verify that both Qdisc ops and proto ops are unlocked before using any of their callbacks, and obtain rtnl lock otherwise. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: refactor tcf_block_find() into standalone functionsVlad Buslov1-92/+149
Refactor tcf_block_find() code into three standalone functions: - __tcf_qdisc_find() to lookup Qdisc and increment its reference counter. - __tcf_qdisc_cl_find() to lookup class. - __tcf_block_find() to lookup block and increment its reference counter. This change is necessary to allow netlink tc rule update handlers to call these functions directly in order to conditionally take rtnl lock according to Qdisc class ops flags before calling any of class ops functions. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: extend proto ops to support unlocked classifiersVlad Buslov13-135/+178
Add 'rtnl_held' flag to tcf proto change, delete, destroy, dump, walk functions to track rtnl lock status. Extend users of these function in cls API to propagate rtnl lock status to them. This allows classifiers to obtain rtnl lock when necessary and to pass rtnl lock status to extensions and driver offload callbacks. Add flags field to tcf proto ops. Add flag value to indicate that classifier doesn't require rtnl lock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: extend proto ops with 'put' callbackVlad Buslov1-1/+11
Add optional tp->ops->put() API to be implemented for filter reference counting. This new function is called by cls API to release filter reference for filters returned by tp->ops->change() or tp->ops->get() functions. Implement tfilter_put() helper to call tp->ops->put() only for classifiers that implement it. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: track rtnl lock status when validating extensionsVlad Buslov12-15/+20
Actions API is already updated to not rely on rtnl lock for synchronization. However, it need to be provided with rtnl status when called from classifiers API in order to be able to correctly release the lock when loading kernel module. Extend extension validation function with 'rtnl_held' flag which is passed to actions API. Add new 'rtnl_held' parameter to tcf_exts_validate() in cls API. No classifier is currently updated to support unlocked execution, so pass hardcoded 'true' flag parameter value. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: prevent insertion of new classifiers during chain flushVlad Buslov1-6/+29
Extend tcf_chain with 'flushing' flag. Use the flag to prevent insertion of new classifier instances when chain flushing is in progress in order to prevent resource leak when tcf_proto is created by unlocked users concurrently. Return EAGAIN error from tcf_chain_tp_insert_unique() to restart tc_new_tfilter() and lookup the chain/proto again. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: refactor tp insert/delete for concurrent executionVlad Buslov1-25/+152
Implement unique insertion function to atomically attach tcf_proto to chain after verifying that no other tcf proto with specified priority exists. Implement delete function that verifies that tp is actually empty before deleting it. Use these functions to refactor cls API to account for concurrent tp and rule update instead of relying on rtnl lock. Add new 'deleting' flag to tcf proto. Use it to restart search when iterating over tp's on chain to prevent accessing potentially inval tp->next pointer. Extend tcf proto with spinlock that is intended to be used to protect its data from concurrent modification instead of relying on rtnl mutex. Use it to protect 'deleting' flag. Add lockdep macros to validate that lock is held when accessing protected fields. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: traverse classifiers in chain with tcf_get_next_proto()Vlad Buslov2-12/+62
All users of chain->filters_chain rely on rtnl lock and assume that no new classifier instances are added when traversing the list. Use tcf_get_next_proto() to traverse filters list without relying on rtnl mutex. This function iterates over classifiers by taking reference to current iterator classifier only and doesn't assume external synchronization of filters list. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: introduce reference counting for tcf_protoVlad Buslov1-10/+43
In order to remove dependency on rtnl lock and allow concurrent tcf_proto modification, extend tcf_proto with reference counter. Implement helper get/put functions for tcf proto and use them to modify cls API to always take reference to tcf_proto while using it. Only release reference to parent chain after releasing last reference to tp. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect filter_chain list with filter_chain_lock mutexVlad Buslov2-33/+84
Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain when accessing filter_chain list, instead of relying on rtnl lock. Dereference filter_chain with tcf_chain_dereference() lockdep macro to verify that all users of chain_list have the lock taken. Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute all necessary code while holding chain lock in order to prevent invalidation of chain_info structure by potential concurrent change. This also serializes calls to tcf_chain0_head_change(), which allows head change callbacks to rely on filter_chain_lock for synchronization instead of rtnl mutex. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12net: sched: protect chain template accesses with block lockVlad Buslov1-16/+57
When cls API is called without protection of rtnl lock, parallel modification of chain is possible, which means that chain template can be changed concurrently in certain circumstances. For example, when chain is 'deleted' by new user-space chain API, the chain might continue to be used if it is referenced by actions, and can be 're-created' again by user. In such case same chain structure is reused and its template is changed. To protect from described scenario, cache chain template while holding block lock. Introduce standalone tc_chain_notify_delete() function that works with cached template values, instead of chains themselves. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>