linux - Linux Kernel (branches are rebased on master from time to time)

Age	Commit message (Collapse)	Author	Files	Lines
2015-10-16	cfg80211: reg: fix antenna gain in chan_reg_rule_print_dbg()	Johannes Berg	1	-2/+2
	Printing "N/A mBi" is strange - print just "N/A" instead. Also add a missing opening parenthesis. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-16	cfg80211: reg: centralize freeing ignored requests	Johannes Berg	1	-29/+35
	Instead of having a lot of places that free ignored requests and then return REG_REQ_OK, make reg_process_hint() process REG_REQ_IGNORE by freeing the request, and let functions it calls return that instead of freeing. This also fixes a leak when a second (different) country IE hint was ignored. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-16	cfg80211: reg: clarify 'treatment' handling in reg_process_hint()	Johannes Berg	1	-7/+9
	This function can only deal with treatment values OK and ALREADY_SET so make the callees not return anything else and warn if they do. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-16	cfg80211: reg: rename reg_regdb_query() to reg_query_builtin()	Johannes Berg	1	-3/+3
	The new name better reflects the functionality. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-16	cfg80211: reg: make CRDA support optional	Johannes Berg	3	-73/+114
	If there's a built-in regulatory database, there may be little point in also calling out to CRDA and failing if the system is configured that way. Allow removing CRDA support to save ~1K kernel size. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-15	cfg80211: reg: remove useless reg_timeout scheduling	Johannes Berg	1	-8/+2
	When the functions reg_set_rd_driver() and reg_set_rd_country_ie() return with an error, the calling function already restores data by calling restore_regulatory_settings(), so there's no need to also schedule a timeout (which would lead to other side effects such as indicating CRDA failed, which clearly isn't true.) Remove the scheduling. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-15	cfg80211: reg: search built-in database directly	Johannes Berg	1	-44/+58
	Instead of searching the built-in database only in the worker, search it directly and return an error if the entry cannot be found (or memory cannot be allocated.) This means that builtin database queries no longer rely on the timeout. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-15	cfg80211: reg: rename reg_call_crda to reg_query_database	Johannes Berg	1	-5/+5
	The new name is more appropriate since in the case of a built-in database it may not really rely on CRDA. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-15	cfg80211: reg: fix reg_call_crda() return value bug	Johannes Berg	1	-31/+30
	The function reg_call_crda() can't actually validly return REG_REQ_IGNORE as it does now when calling CRDA fails since that return value isn't handled properly. Fix that. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-15	cfg80211: reg: remove useless non-NULL check	Johannes Berg	1	-3/+0
	There's no way that the alpha2 pointer can be NULL, so no point in checking that it isn't. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-15	cfg80211: fix gHz to GHz	Johannes Berg	2	-2/+2
	There's no "g" prefix, only "G" (1e9) that was clearly intended here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-14	mac80211: remove event.c	Johannes Berg	4	-34/+6
	That file contains just a single function, which itself is just a single statement to call a different function. Remove it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-14	mac80211: remove cfg.h	Johannes Berg	4	-11/+2
	The file contains just a single declaration that can easily move to another file - remove it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-14	mac80211: move sta_set_rate_info_rx() and make it static	Johannes Berg	3	-41/+39
	There's only a single caller of this function, so it can be moved to the same file and made static. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-14	mac80211: clean up ieee80211_rx_h_check_dup code	Johannes Berg	1	-10/+10
	Reduce indentation a bit to make the condition more readable. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-14	mac80211: remove PM-QoS listener	Johannes Berg	10	-87/+21
	As this API has never really seen any use and most drivers don't ever use the value derived from it, remove it. Change the only driver using it (rt2x00) to simply use the DTIM period instead of the "max sleep" time. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	mac80211: use new cfg80211_inform_bss_frame_data() API	Johannes Berg	2	-15/+15
	The new API is more easily extensible with a metadata struct passed to it, use it in mac80211. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	mac80211: Do not restart scheduled scan if multiple scan plans are set	Avraham Stern	1	-2/+6
	If multiple scan plans were set for scheduled scan, do not restart scheduled scan on reconfig because it is possible that some scan plans were already completed and there is no need to run them all over again. Instead, notify userspace that scheduled scan stopped so it can configure new scan plans for scheduled scan. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	cfg80211: Add multiple scan plans for scheduled scan	Avraham Stern	10	-25/+262
	Add the option to configure multiple 'scan plans' for scheduled scan. Each 'scan plan' defines the number of scan cycles and the interval between scans. The scan plans are executed in the order they were configured. The last scan plan will always run infinitely and thus defines only the interval between scans. The maximum number of scan plans supported by the device and the maximum number of iterations in a single scan plan are advertised to userspace so it can configure the scan plans appropriately. When scheduled scan results are received there is no way to know which scan plan is being currently executed, so there is no way to know when the next scan iteration will start. This is not a problem, however. The scan start timestamp is only used for flushing old scan results, and there is no difference between flushing all results received until the end of the previous iteration or the start of the current one, since no results will be received in between. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	wireless: add WNM action frame categories	Johannes Berg	1	-0/+3
	Add the WNM and unprotected WNM categories and mark the latter as not robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	wireless: update robust action frame list	Johannes Berg	1	-0/+2
	Unprotected DMG and VHT action frames are not protected, reflect that in the list. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	nl80211: allow BSS data to include CLOCK_BOOTTIME timestamp	Dmitry Shmidt	7	-60/+130
	For location and connectivity services, userspace would often like to know the time when the BSS was last seen. The current "last seen" value is calculated in a way that makes it less useful, especially if the system suspended in the meantime. Add the ability for the driver to report a real CLOCK_BOOTTIME stamp that can then be reported to userspace (if present). Drivers wishing to use this must be converted to the new API to call cfg80211_inform_bss_data() or cfg80211_inform_bss_frame_data(). They need to ensure the reported value is accurate enough even when the frame might have been buffered in the device (e.g. firmware.) Signed-off-by: Dmitry Shmidt <dimitrysh@google.com> [modified to use struct, inlines] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	Revert "mac80211: remove exposing 'mfp' to drivers"	Tamizh chelvam	3	-1/+8
	This reverts commit 5c48f1201744233d4f235c7dd916d5196ed20716. Some device drivers (ath10k) offload part of aggregation including AddBA/DelBA negotiations to firmware. In such scenario, the PMF configuration of the station needs to be provided to driver to enable encryption of AddBA/DelBA action frames. Signed-off-by: Tamizh chelvam <c_traja@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	Merge remote-tracking branch 'net-next/master' into mac80211-next	Johannes Berg	9728	-249779/+529550
	Merge net-next to get some driver changes that patches depend on (in order to avoid conflicts). Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-12	bridge: vlan: enforce no pvid flag in vlan ranges	Nikolay Aleksandrov	1	-0/+3
	Currently it's possible for someone to send a vlan range to the kernel with the pvid flag set which will result in the pvid bouncing from a vlan to vlan and isn't correct, it also introduces problems for hardware where it doesn't make sense having more than 1 pvid. iproute2 already enforces this, so let's enforce it on kernel-side as well. Reported-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	atm: iphase: fix misleading indention	Tillmann Heidsieck	1	-1/+1
	Fix a smatch warning: drivers/atm/iphase.c:1178 rx_pkt() warn: curly braces intended? The code is correct, the indention is misleading. In case the allocation of skb fails, we want to skip to the end. Signed-off-by: Tillmann Heidsieck <theidsieck@leenox.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	atm: iphase: return -ENOMEM instead of -1 in case of failed kmalloc()	Tillmann Heidsieck	1	-1/+2
	Smatch complains about returning hard coded error codes, silence this warning. drivers/atm/iphase.c:115 ia_enque_rtn_q() warn: returning -1 instead of -ENOMEM is sloppy Signed-off-by: Tillmann Heidsieck <theidsieck@leenox.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv6 route: use err pointers instead of returning pointer by reference	Roopa Prabhu	1	-15/+17
	This patch makes ip6_route_info_create return err pointer instead of returning the rt pointer by reference as suggested by Dave Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: hns: fix the unknown phy_nterface_t type error	huangdaode	1	-0/+1
	This patch fix the building error reported by Jiri Pirko <jiri@resnulli.us> drivers/net/ethernet/hisilicon/hns/hnae.h:465:2: error: unknown type name 'phy_interface_t' phy_interface_t phy_if; ^ the full build log is on https://lists.01.org/pipermail/kbuild-all. Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: yankejian <yankejian@huawei.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	tun: use sk_fullsock() before reading sk->sk_tsflags	Eric Dumazet	1	-1/+1
	timewait or request sockets are small and do not contain sk->sk_tsflags Without this fix, we might read garbage, and crash later in __skb_complete_tx_timestamp() -> sock_queue_err_skb() (These pseudo sockets do not have an error queue either) Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge branch 'netns-defrag'	David S. Miller	11	-26/+27
	Eric W. Biederman says: ==================== net: Pass net into defragmentation This is the next installment of my work to pass struct net through the output path so the code does not need to guess how to figure out which network namespace it is in, and ultimately routes can have output devices in another network namespace. In netfilter and af_packet we defragment packets in the output path, and there is the usual amount of confusion about how to compute which net we are processing the packets in. This patchset clears that confusion up by explicitly passing in struct net in ip_defrag, ip_check_defrag, and nf_ct_frag6_gather. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv6: Pass struct net into nf_ct_frag6_gather	Eric W. Biederman	4	-6/+5
	The function nf_ct_frag6_gather is called on both the input and the output paths of the networking stack. In particular ipv6_defrag which calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain on input and the LOCAL_OUT chain on output. The addition of a net parameter makes it explicit which network namespace the packets are being reassembled in, and removes the need for nf_ct_frag6_gather to guess. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv4: Pass struct net into ip_defrag and ip_check_defrag	Eric W. Biederman	8	-19/+20
	The function ip_defrag is called on both the input and the output paths of the networking stack. In particular conntrack when it is tracking outbound packets from the local machine calls ip_defrag. So add a struct net parameter and stop making ip_defrag guess which network namespace it needs to defragment packets in. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv4: Only compute net once in ip_call_ra_chain	Eric W. Biederman	1	-1/+2
	ip_call_ra_chain is called early in the forwarding chain from ip_forward and ip_mr_input, which makes skb->dev the correct expression to get the input network device and dev_net(skb->dev) a correct expression for the network namespace the packet is being processed in. Compute the network namespace and store it in a variable to make the code clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	packet: fix match_fanout_group()	Eric Dumazet	1	-3/+3
	Recent TCP listener patches exposed a prior af_packet bug : match_fanout_group() blindly assumes it is always safe to cast sk to a packet socket to compare fanout with af_packet_priv But SYNACK packets can be sent while attached to request_sock, which are smaller than a "struct sock". We can read non existent memory and crash. Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group") Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of ↵	David S. Miller	117	-1410/+2677
	git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== Major changes: iwlwifi * some debugfs improvements * fix signedness in beacon statistics * deinline some functions to reduce size when device tracing is enabled * filter beacons out in AP mode when no stations are associated * deprecate firmwares version -12 * fix a runtime PM vs. legacy suspend race * one-liner fix for a ToF bug * clean-ups in the rx code * small debugging improvement * fix WoWLAN with new firmware versions * more clean-ups towards multiple RX queues; * some rate scaling fixes and improvements; * some time-of-flight fixes; * other generic improvements and clean-ups; brcmfmac * rework code dealing with multiple interfaces * allow logging firmware console using debug level * support for BCM4350, BCM4365, and BCM4366 PCIE devices * fixed for legacy P2P and P2P device handling * correct set and get tx-power ath9k * add support for Outside Context of a BSS (OCB) mode mwifiex * add USB multichannel feature ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv4/icmp: redirect messages can use the ingress daddr as source	Paolo Abeni	4	-3/+33
	This patch allows configuring how the source address of ICMP redirect messages is selected; by default the old behaviour is retained, while setting icmp_redirects_use_orig_daddr force the usage of the destination address of the packet that caused the redirect. The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the following scenario: Two machines are set up with VRRP to act as routers out of a subnet, they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to x.x.x.254/24. If a host in said subnet needs to get an ICMP redirect from the VRRP router, i.e. to reach a destination behind a different gateway, the source IP in the ICMP redirect is chosen as the primary IP on the interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2. The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2, and will continue to use the wrong next-op. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bridge: try switchdev op first in __vlan_vid_add/del	Jiri Pirko	1	-36/+22
	Some drivers need to implement both switchdev vlan ops and vid_add/kill ndos. For that to work in bridge code, we need to try switchdev op first when adding/deleting vlan id. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	BNX2: free temp_stats_blk on error path	wangweidong	1	-0/+2
	In bnx2_init_board, missing free temp_stats_blk on error path when some operations do failed. Just add the 'kfree' operation. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge branch 'setsockopt_incoming_cpu'	David S. Miller	16	-42/+72
	Eric Dumazet says: ==================== tcp: better smp listener behavior As promised in last patch series, we implement a better SO_REUSEPORT strategy, based on cpu hints if given by the application. We also moved sk_refcnt out of the cache line containing the lookup keys, as it was considerably slowing down smp operations because of false sharing. This was simpler than converting listen sockets to conventional RCU (to avoid sk_refcnt dirtying) Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	tcp: shrink tcp_timewait_sock by 8 bytes	Eric Dumazet	2	-2/+4
	Reducing tcp_timewait_sock from 280 bytes to 272 bytes allows SLAB to pack 15 objects per page instead of 14 (on x86) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: shrink struct sock and request_sock by 8 bytes	Eric Dumazet	9	-25/+28
	One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: align sk_refcnt on 128 bytes boundary	Eric Dumazet	3	-5/+16
	sk->sk_refcnt is dirtied for every TCP/UDP incoming packet. This is a performance issue if multiple cpus hit a common socket, or multiple sockets are chained due to SO_REUSEPORT. By moving sk_refcnt 8 bytes further, first 128 bytes of sockets are mostly read. As they contain the lookup keys, this has a considerable performance impact, as cpus can cache them. These 8 bytes are not wasted, we use them as a place holder for various fields, depending on the socket type. Tested: SYN flood hitting a 16 RX queues NIC. TCP listener using 16 sockets and SO_REUSEPORT and SO_INCOMING_CPU for proper siloing. Could process 6.0 Mpps SYN instead of 4.2 Mpps Kernel profile looked like : 11.68% [kernel] [k] sha_transform 6.51% [kernel] [k] __inet_lookup_listener 5.07% [kernel] [k] __inet_lookup_established 4.15% [kernel] [k] memcpy_erms 3.46% [kernel] [k] ipt_do_table 2.74% [kernel] [k] fib_table_lookup 2.54% [kernel] [k] tcp_make_synack 2.34% [kernel] [k] tcp_conn_request 2.05% [kernel] [k] __netif_receive_skb_core 2.03% [kernel] [k] kmem_cache_alloc Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: SO_INCOMING_CPU setsockopt() support	Eric Dumazet	6	-11/+25
	SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	packet: support per-packet fwmark for af_packet sendmsg	Edward Jee	1	-1/+9
	Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	sock: support per-packet fwmark	Edward Jee	2	-0/+33
	It's useful to allow users to set fwmark for an individual packet, without changing the socket state. The function this patch adds in sock layer can be used by the protocols that need such a feature. Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge branch 'bpf-unprivileged'	David S. Miller	10	-27/+547
	Alexei Starovoitov says: ==================== bpf: unprivileged v1-v2: - this set logically depends on cb patch "bpf: fix cb access in socket filter programs": http://patchwork.ozlabs.org/patch/527391/ which is must have to allow unprivileged programs. Thanks Daniel for finding that issue. - refactored sysctl to be similar to 'modules_disabled' - dropped bpf_trace_printk - split tests into separate patch and added more tests based on discussion v1 cover letter: I think it is time to liberate eBPF from CAP_SYS_ADMIN. As was discussed when eBPF was first introduced two years ago the only piece missing in eBPF verifier is 'pointer leak detection' to make it available to non-root users. Patch 1 adds this pointer analysis. The eBPF programs, obviously, need to see and operate on kernel addresses, but with these extra checks they won't be able to pass these addresses to user space. Patch 2 adds accounting of kernel memory used by programs and maps. It changes behavoir for existing root users, but I think it needs to be done consistently for both root and non-root, since today programs and maps are only limited by number of open FDs (RLIMIT_NOFILE). Patch 2 accounts program's and map's kernel memory as RLIMIT_MEMLOCK. Unprivileged eBPF is only meaningful for 'socket filter'-like programs. eBPF programs for tracing and TC classifiers/actions will stay root only. In parallel the bpf fuzzing effort is ongoing and so far we've found only one verifier bug and that was already fixed. The 'constant blinding' pass also being worked on. It will obfuscate constant-like values that are part of eBPF ISA to make jit spraying attacks even harder. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bpf: add unprivileged bpf tests	Alexei Starovoitov	2	-10/+355
	Add new tests samples/bpf/test_verifier: unpriv: return pointer checks that pointer cannot be returned from the eBPF program unpriv: add const to pointer unpriv: add pointer to pointer unpriv: neg pointer checks that pointer arithmetic is disallowed unpriv: cmp pointer with const unpriv: cmp pointer with pointer checks that comparison of pointers is disallowed Only one case allowed 'void value = bpf_map_lookup_elem(..); if (value == 0) ...' unpriv: check that printk is disallowed since bpf_trace_printk is not available to unprivileged unpriv: pass pointer to helper function checks that pointers cannot be passed to functions that expect integers If function expects a pointer the verifier allows only that type of pointer. Like 1st argument of bpf_map_lookup_elem() must be pointer to map. (applies to non-root as well) unpriv: indirectly pass pointer on stack to helper function checks that pointer stored into stack cannot be used as part of key passed into bpf_map_lookup_elem() unpriv: mangle pointer on stack 1 unpriv: mangle pointer on stack 2 checks that writing into stack slot that already contains a pointer is disallowed unpriv: read pointer from stack in small chunks checks that < 8 byte read from stack slot that contains a pointer is disallowed unpriv: write pointer into ctx checks that storing pointers into skb->fields is disallowed unpriv: write pointer into map elem value checks that storing pointers into element values is disallowed For example: int bpf_prog(struct __sk_buff skb) { u32 key = 0; u64 value = bpf_map_lookup_elem(&map, &key); if (value) value = (u64) skb; } will be rejected. unpriv: partial copy of pointer checks that doing 32-bit register mov from register containing a pointer is disallowed unpriv: pass pointer to tail_call checks that passing pointer as an index into bpf_tail_call is disallowed unpriv: cmp map pointer with zero checks that comparing map pointer with constant is disallowed unpriv: write into frame pointer checks that frame pointer is read-only (applies to root too) unpriv: cmp of frame pointer checks that R10 cannot be using in comparison unpriv: cmp of stack pointer checks that Rx = R10 - imm is ok, but comparing Rx is not unpriv: obfuscate stack pointer checks that Rx = R10 - imm is ok, but Rx -= imm is not Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bpf: charge user for creation of BPF maps and programs	Alexei Starovoitov	5	-2/+72
	since eBPF programs and maps use kernel memory consider it 'locked' memory from user accounting point of view and charge it against RLIMIT_MEMLOCK limit. This limit is typically set to 64Kbytes by distros, so almost all bpf+tracing programs would need to increase it, since they use maps, but kernel charges maximum map size upfront. For example the hash map of 1024 elements will be charged as 64Kbyte. It's inconvenient for current users and changes current behavior for root, but probably worth doing to be consistent root vs non-root. Similar accounting logic is done by mmap of perf_event. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bpf: enable non-root eBPF programs	Alexei Starovoitov	5	-15/+120
	In order to let unprivileged users load and execute eBPF programs teach verifier to prevent pointer leaks. Verifier will prevent - any arithmetic on pointers (except R10+Imm which is used to compute stack addresses) - comparison of pointers (except if (map_value_ptr == 0) ... ) - passing pointers to helper functions - indirectly passing pointers in stack to helper functions - returning pointer from bpf program - storing pointers into ctx or maps Spill/fill of pointers into stack is allowed, but mangling of pointers stored in the stack or reading them byte by byte is not. Within bpf programs the pointers do exist, since programs need to be able to access maps, pass skb pointer to LD_ABS insns, etc but programs cannot pass such pointer values to the outside or obfuscate them. Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, so that socket filters (tcpdump), af_packet (quic acceleration) and future kcm can use it. tracing and tc cls/act program types still require root permissions, since tracing actually needs to be able to see all kernel pointers and tc is for root only. For example, the following unprivileged socket filter program is allowed: int bpf_prog1(struct __sk_buff skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 value = bpf_map_lookup_elem(&my_map, &index); if (value) value += skb->len; return 0; } but the following program is not: int bpf_prog1(struct __sk_buff skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 value = bpf_map_lookup_elem(&my_map, &index); if (value) value += (u64) skb; return 0; } since it would leak the kernel address into the map. Unprivileged socket filter bpf programs have access to the following helper functions: - map lookup/update/delete (but they cannot store kernel pointers into them) - get_random (it's already exposed to unprivileged user space) - get_smp_processor_id - tail_call into another socket filter program - ktime_get_ns The feature is controlled by sysctl kernel.unprivileged_bpf_disabled. This toggle defaults to off (0), but can be set true (1). Once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>