summaryrefslogtreecommitdiffstats
path: root/kernel/bpf/verifier.c
AgeCommit message (Collapse)AuthorFilesLines
2018-08-17bpf: fix redirect to map under tail callsDaniel Borkmann1-21/+0
Commits 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs") and 7c3001313396 ("bpf: fix ri->map_owner pointer on bpf_prog_realloc") tried to mitigate that buggy programs using bpf_redirect_map() helper call do not leave stale maps behind. Idea was to add a map_owner cookie into the per CPU struct redirect_info which was set to prog->aux by the prog making the helper call as a proof that the map is not stale since the prog is implicitly holding a reference to it. This owner cookie could later on get compared with the program calling into BPF whether they match and therefore the redirect could proceed with processing the map safely. In (obvious) hindsight, this approach breaks down when tail calls are involved since the original caller's prog->aux pointer does not have to match the one from one of the progs out of the tail call chain, and therefore the xdp buffer will be dropped instead of redirected. A way around that would be to fix the issue differently (which also allows to remove related work in fast path at the same time): once the life-time of a redirect map has come to its end we use it's map free callback where we need to wait on synchronize_rcu() for current outstanding xdp buffers and remove such a map pointer from the redirect info if found to be present. At that time no program is using this map anymore so we simply invalidate the map pointers to NULL iff they previously pointed to that instance while making sure that the redirect path only reads out the map once. Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine") Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs") Reported-by: Sebastiano Miano <sebastiano.miano@polito.it> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-11bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORTMartin KaFai Lau1-0/+9
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN. BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern" to store the bpf context instead of using the skb->cb[48]. At the SO_REUSEPORT sk lookup time, it is in the middle of transiting from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this point, it is not always clear where the bpf context can be appended in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not clear if the lower layer is only ipv4 and ipv6 in the future and will it not touch the cb[] again before transiting to the upper layer. For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB instead of IP[6]CB and it may still modify the cb[] after calling the udp[46]_lib_lookup_skb(). Because of the above reason, if sk->cb is used for the bpf ctx, saving-and-restoring is needed and likely the whole 48 bytes cb[] has to be saved and restored. Instead of saving, setting and restoring the cb[], this patch opts to create a new "struct sk_reuseport_kern" and setting the needed values in there. The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)" will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol specific usage at this point and it is also inline with the current sock_reuseport.c implementation (i.e. no protocol specific requirement). In "struct sk_reuseport_md", this patch exposes data/data_end/len with semantic similar to other existing usages. Together with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()", the bpf prog can peek anywhere in the skb. The "bind_inany" tells the bpf prog that the reuseport group is bind-ed to a local INANY address which cannot be learned from skb. The new "bind_inany" is added to "struct sock_reuseport" which will be used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order to avoid repeating the "bind INANY" test on "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can only be properly initialized when a "sk->sk_reuseport" enabled sk is adding to a hashtable (i.e. during "reuseport_alloc()" and "reuseport_add_sock()"). The new "sk_select_reuseport()" is the main helper that the bpf prog will use to select a SO_REUSEPORT sk. It is the only function that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in the earlier patch, the validity of a selected sk is checked in run time in "sk_select_reuseport()". Doing the check in verification time is difficult and inflexible (consider the map-in-map use case). The runtime check is to compare the selected sk's reuseport_id with the reuseport_id that we want. This helper will return -EXXX if the selected sk cannot serve the incoming request (e.g. reuseport_id not match). The bpf prog can decide if it wants to do SK_DROP as its discretion. When the bpf prog returns SK_PASS, the kernel will check if a valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL"). If it does , it will use the selected sk. If not, the kernel will select one from "reuse->socks[]" (as before this patch). The SK_DROP and SK_PASS handling logic will be in the next patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03bpf: introduce the bpf_get_local_storage() helper functionRoman Gushchin1-0/+18
The bpf_get_local_storage() helper function is used to get a pointer to the bpf local storage from a bpf program. It takes a pointer to a storage map and flags as arguments. Right now it accepts only cgroup storage maps, and flags argument has to be 0. Further it can be extended to support other types of local storage: e.g. thread local storage etc. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03bpf/verifier: introduce BPF_PTR_TO_MAP_VALUERoman Gushchin1-2/+6
BPF_MAP_TYPE_CGROUP_STORAGE maps are special in a way that the access from the bpf program side is lookup-free. That means the result is guaranteed to be a valid pointer to the cgroup storage; no NULL-check is required. This patch introduces BPF_PTR_TO_MAP_VALUE return type, which is required to cause the verifier accept programs, which are not checking the map value pointer for being NULL. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03bpf: introduce cgroup storage mapsRoman Gushchin1-0/+12
This commit introduces BPF_MAP_TYPE_CGROUP_STORAGE maps: a special type of maps which are implementing the cgroup storage. >From the userspace point of view it's almost a generic hash map with the (cgroup inode id, attachment type) pair used as a key. The only difference is that some operations are restricted: 1) a user can't create new entries, 2) a user can't remove existing entries. The lookup from userspace is o(log(n)). Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-31bpf: verifier: MOV64 don't mark dst reg unboundedArthur Fabre1-2/+4
When check_alu_op() handles a BPF_MOV64 between two registers, it calls check_reg_arg(DST_OP) on the dst register, marking it as unbounded. If the src and dst register are the same, this marks the src as unbounded, which can lead to unexpected errors for further checks that rely on bounds info. For example: BPF_MOV64_IMM(BPF_REG_2, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_2), BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_2), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), Results in: "math between ctx pointer and register with unbounded min value is not allowed" check_alu_op() now uses check_reg_arg(DST_OP_NO_MARK), and MOVs that need to mark the dst register (MOVIMM, MOV32) do so. Added a test case for MOV64 dst == src, and dst != src. Signed-off-by: Arthur Fabre <afabre@cloudflare.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-1/+1
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-07-20 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add sharing of BPF objects within one ASIC: this allows for reuse of the same program on multiple ports of a device, and therefore gains better code store utilization. On top of that, this now also enables sharing of maps between programs attached to different ports of a device, from Jakub. 2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature detections and unused variable exports, also from Jakub. 3) First batch of RCU annotation fixes in prog array handling, i.e. there are several __rcu markers which are not correct as well as some of the RCU handling, from Roman. 4) Two fixes in BPF sample files related to checking of the prog_cnt upper limit from sample loader, from Dan. 5) Minor cleanup in sockmap to remove a set but not used variable, from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-18bpf: offload: rename bpf_offload_dev_match() to bpf_offload_prog_map_match()Jakub Kicinski1-1/+1
A set of new API functions exported for the drivers will soon use 'bpf_offload_dev_' as a prefix. Rename the bpf_offload_dev_match() which is internal to the core (used by the verifier) to avoid any confusion. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-07-12bpf: don't leave partial mangled prog in jit_subprogs error pathDaniel Borkmann1-2/+9
syzkaller managed to trigger the following bug through fault injection: [...] [ 141.043668] verifier bug. No program starts at insn 3 [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.044648] WARNING: CPU: 3 PID: 4072 at kernel/bpf/verifier.c:1613 bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [ 141.047355] CPU: 3 PID: 4072 Comm: a.out Not tainted 4.18.0-rc4+ #51 [ 141.048446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS 1.10.2-1 04/01/2014 [ 141.049877] Call Trace: [ 141.050324] __dump_stack lib/dump_stack.c:77 [inline] [ 141.050324] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 [ 141.050950] ? dump_stack_print_info.cold.2+0x52/0x52 lib/dump_stack.c:60 [ 141.051837] panic+0x238/0x4e7 kernel/panic.c:184 [ 141.052386] ? add_taint.cold.5+0x16/0x16 kernel/panic.c:385 [ 141.053101] ? __warn.cold.8+0x148/0x1ba kernel/panic.c:537 [ 141.053814] ? __warn.cold.8+0x117/0x1ba kernel/panic.c:530 [ 141.054506] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.054506] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.054506] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [ 141.055163] __warn.cold.8+0x163/0x1ba kernel/panic.c:538 [ 141.055820] ? get_callee_stack_depth kernel/bpf/verifier.c:1612 [inline] [ 141.055820] ? fixup_call_args kernel/bpf/verifier.c:5587 [inline] [ 141.055820] ? bpf_check+0x525e/0x5e60 kernel/bpf/verifier.c:5952 [...] What happens in jit_subprogs() is that kcalloc() for the subprog func buffer is failing with NULL where we then bail out. Latter is a plain return -ENOMEM, and this is definitely not okay since earlier in the loop we are walking all subprogs and temporarily rewrite insn->off to remember the subprog id as well as insn->imm to temporarily point the call to __bpf_call_base + 1 for the initial JIT pass. Thus, bailing out in such state and handing this over to the interpreter is troublesome since later/subsequent e.g. find_subprog() lookups are based on wrong insn->imm. Therefore, once we hit this point, we need to jump to out_free path where we undo all changes from earlier loop, so that interpreter can work on unmodified insn->{off,imm}. Another point is that should find_subprog() fail in jit_subprogs() due to a verifier bug, then we also should not simply defer the program to the interpreter since also here we did partial modifications. Instead we should just bail out entirely and return an error to the user who is trying to load the program. Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Reported-by: syzbot+7d427828b2ea6e592804@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-12treewide: Use array_size() in vzalloc()Kees Cook1-3/+5
The vzalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vzalloc(a * b) with: vzalloc(array_size(a, b)) as well as handling cases of: vzalloc(a * b * c) with: vzalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vzalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | vzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(char) * COUNT + COUNT , ...) | vzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vzalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vzalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vzalloc(C1 * C2 * C3, ...) | vzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vzalloc(C1 * C2, ...) | vzalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12treewide: kzalloc() -> kcalloc()Kees Cook1-1/+1
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-07bpf: reject passing modified ctx to helper functionsDaniel Borkmann1-17/+31
As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on context pointer") already describes, f1174f77b50c ("bpf/verifier: rework value tracking") removed the specific white-listed cases we had previously where we would allow for pointer arithmetic in order to further generalize it, and allow e.g. context access via modified registers. While the dereferencing of modified context pointers had been forbidden through 28e33f9d78ee, syzkaller did recently manage to trigger several KASAN splats for slab out of bounds access and use after frees by simply passing a modified context pointer to a helper function which would then do the bad access since verifier allowed it in adjust_ptr_min_max_vals(). Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() generally could break existing programs as there's a valid use case in tracing in combination with passing the ctx to helpers as bpf_probe_read(), where the register then becomes unknown at verification time due to adding a non-constant offset to it. An access sequence may look like the following: offset = args->filename; /* field __data_loc filename */ bpf_probe_read(&dst, len, (char *)args + offset); // args is ctx There are two options: i) we could special case the ctx and as soon as we add a constant or bounded offset to it (hence ctx type wouldn't change) we could turn the ctx into an unknown scalar, or ii) we generalize the sanity test for ctx member access into a small helper and assert it on the ctx register that was passed as a function argument. Fwiw, latter is more obvious and less complex at the same time, and one case that may potentially be legitimate in future for ctx member access at least would be for ctx to carry a const offset. Therefore, fix follows approach from ii) and adds test cases to BPF kselftests. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Reported-by: syzbot+3d0b2441dbb71751615e@syzkaller.appspotmail.com Reported-by: syzbot+c8504affd4fdd0c1b626@syzkaller.appspotmail.com Reported-by: syzbot+e5190cb881d8660fb1a3@syzkaller.appspotmail.com Reported-by: syzbot+efae31b384d5badbd620@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: fix context access in tracing progs on 32 bit archsDaniel Borkmann1-1/+2
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here though underlying arch can handle it just fine. Therefore adjust this down to machine size and check and rewrite the offset for narrow access on that basis. We also need to fix corresponding pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs for tracing work on x86_32. Reported-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: avoid retpoline for lookup/update/delete calls on mapsDaniel Borkmann1-19/+49
While some of the BPF map lookup helpers provide a ->map_gen_lookup() callback for inlining the map lookup altogether it is not available for every map, so the remaining ones have to call bpf_map_lookup_elem() helper which does a dispatch to map->ops->map_lookup_elem(). In times of retpolines, this will control and trap speculative execution rather than letting it do its work for the indirect call and will therefore cause a slowdown. Likewise, bpf_map_update_elem() and bpf_map_delete_elem() do not have an inlined version and need to call into their map->ops->map_update_elem() resp. map->ops->map_delete_elem() handlers. Before: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#232656 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call bpf_map_delete_elem#215008 <-- indirect call via 16: (95) exit helper After: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#233328 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call htab_lru_map_delete_elem#238240 <-- direct call 16: (95) exit In all three lookup/update/delete cases however we can use the actual address of the map callback directly if we find that there's only a single path with a map pointer leading to the helper call, meaning when the map pointer has not been poisoned from verifier side. Example code can be seen above for the delete case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-03bpf: fixup error message from gpl helpers on license mismatchDaniel Borkmann1-1/+1
Stating 'proprietary program' in the error is just silly since it can also be a different open source license than that which is just not compatible. Reference: https://twitter.com/majek04/status/998531268039102465 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-25/+120
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-22/+64
Pull networking fixes from David Miller: "Let's begin the holiday weekend with some networking fixes: 1) Whoops need to restrict cfg80211 wiphy names even more to 64 bytes. From Eric Biggers. 2) Fix flags being ignored when using kernel_connect() with SCTP, from Xin Long. 3) Use after free in DCCP, from Alexey Kodanev. 4) Need to check rhltable_init() return value in ipmr code, from Eric Dumazet. 5) XDP handling fixes in virtio_net from Jason Wang. 6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu. 7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack Morgenstein. 8) Prevent out-of-bounds speculation using indexes in BPF, from Daniel Borkmann. 9) Fix regression added by AF_PACKET link layer cure, from Willem de Bruijn. 10) Correct ENIC dma mask, from Govindarajulu Varadarajan. 11) Missing config options for PMTU tests, from Stefano Brivio" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits) ibmvnic: Fix partial success login retries selftests/net: Add missing config options for PMTU tests mlx4_core: allocate ICM memory in page size chunks enic: set DMA mask to 47 bit ppp: remove the PPPIOCDETACH ioctl ipv4: remove warning in ip_recv_error net : sched: cls_api: deal with egdev path only if needed vhost: synchronize IOTLB message with dev cleanup packet: fix reserve calculation net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands net/mlx5e: When RXFCS is set, add FCS data into checksum calculation bpf: properly enforce index mask to prevent out-of-bounds speculation net/mlx4: Fix irq-unsafe spinlock usage net: phy: broadcom: Fix bcm_write_exp() net: phy: broadcom: Fix auxiliary control register reads net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message ibmvnic: Only do H_EOI for mobility events tuntap: correctly set SOCKWQ_ASYNC_NOSPACE virtio-net: fix leaking page for gso packet during mergeable XDP ...
2018-05-24bpf: properly enforce index mask to prevent out-of-bounds speculationDaniel Borkmann1-22/+64
While reviewing the verifier code, I recently noticed that the following two program variants in relation to tail calls can be loaded. Variant 1: # bpftool p d x i 15 0: (15) if r1 == 0x0 goto pc+3 1: (18) r2 = map[id:5] 3: (05) goto pc+2 4: (18) r2 = map[id:6] 6: (b7) r3 = 7 7: (35) if r3 >= 0xa0 goto pc+2 8: (54) (u32) r3 &= (u32) 255 9: (85) call bpf_tail_call#12 10: (b7) r0 = 1 11: (95) exit # bpftool m s i 5 5: prog_array flags 0x0 key 4B value 4B max_entries 4 memlock 4096B # bpftool m s i 6 6: prog_array flags 0x0 key 4B value 4B max_entries 160 memlock 4096B Variant 2: # bpftool p d x i 20 0: (15) if r1 == 0x0 goto pc+3 1: (18) r2 = map[id:8] 3: (05) goto pc+2 4: (18) r2 = map[id:7] 6: (b7) r3 = 7 7: (35) if r3 >= 0x4 goto pc+2 8: (54) (u32) r3 &= (u32) 3 9: (85) call bpf_tail_call#12 10: (b7) r0 = 1 11: (95) exit # bpftool m s i 8 8: prog_array flags 0x0 key 4B value 4B max_entries 160 memlock 4096B # bpftool m s i 7 7: prog_array flags 0x0 key 4B value 4B max_entries 4 memlock 4096B In both cases the index masking inserted by the verifier in order to control out of bounds speculation from a CPU via b2157399cc98 ("bpf: prevent out-of-bounds speculation") seems to be incorrect in what it is enforcing. In the 1st variant, the mask is applied from the map with the significantly larger number of entries where we would allow to a certain degree out of bounds speculation for the smaller map, and in the 2nd variant where the mask is applied from the map with the smaller number of entries, we get buggy behavior since we truncate the index of the larger map. The original intent from commit b2157399cc98 is to reject such occasions where two or more different tail call maps are used in the same tail call helper invocation. However, the check on the BPF_MAP_PTR_POISON is never hit since we never poisoned the saved pointer in the first place! We do this explicitly for map lookups but in case of tail calls we basically used the tail call map in insn_aux_data that was processed in the most recent path which the verifier walked. Thus any prior path that stored a pointer in insn_aux_data at the helper location was always overridden. Fix it by moving the map pointer poison logic into a small helper that covers both BPF helpers with the same logic. After that in fixup_bpf_calls() the poison check is then hit for tail calls and the program rejected. Latter only happens in unprivileged case since this is the *only* occasion where a rewrite needs to happen, and where such rewrite is specific to the map (max_entries, index_mask). In the privileged case the rewrite is generic for the insn->imm / insn->code update so multiple maps from different paths can be handled just fine since all the remaining logic happens in the instruction processing itself. This is similar to the case of map lookups: in case there is a collision of maps in fixup_bpf_calls() we must skip the inlined rewrite since this will turn the generic instruction sequence into a non- generic one. Thus the patch_call_imm will simply update the insn->imm location where the bpf_map_lookup_elem() will later take care of the dispatch. Given we need this 'poison' state as a check, the information of whether a map is an unpriv_array gets lost, so enforcing it prior to that needs an additional state. In general this check is needed since there are some complex and tail call intensive BPF programs out there where LLVM tends to generate such code occasionally. We therefore convert the map_ptr rather into map_state to store all this w/o extra memory overhead, and the bit whether one of the maps involved in the collision was from an unpriv_array thus needs to be retained as well there. Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24ipv6: sr: Add seg6local action End.BPFMathieu Xhonneux1-0/+1
This patch adds the End.BPF action to the LWT seg6local infrastructure. This action works like any other seg6local End action, meaning that an IPv6 header with SRH is needed, whose DA has to be equal to the SID of the action. It will also advance the SRH to the next segment, the BPF program does not have to take care of this. Since the BPF program may not be a source of instability in the kernel, it is important to ensure that the integrity of the packet is maintained before yielding it back to the IPv6 layer. The hook hence keeps track if the SRH has been altered through the helpers, and re-validates its content if needed with seg6_validate_srh. The state kept for validation is stored in a per-CPU buffer. The BPF program is not allowed to directly write into the packet, and only some fields of the SRH can be altered through the helper bpf_lwt_seg6_store_bytes. Performances profiling has shown that the SRH re-validation does not induce a significant overhead. If the altered SRH is deemed as invalid, the packet is dropped. This validation is also done before executing any action through bpf_lwt_seg6_action, and will not be performed again if the SRH is not modified after calling the action. The BPF program may return 3 types of return codes: - BPF_OK: the End.BPF action will look up the next destination through seg6_lookup_nexthop. - BPF_REDIRECT: if an action has been executed through the bpf_lwt_seg6_action helper, the BPF program should return this value, as the skb's destination is already set and the default lookup should not be performed. - BPF_DROP : the packet will be dropped. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: get kernel symbol addresses via syscallSandipan Das1-6/+1
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of kernel symbol addresses for all functions in a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. When bpf_jit_kallsyms is enabled, we can get the address of the corresponding kernel symbol for a callee function and resolve the symbol's name. The address is determined by adding the value of the call instruction's imm field to __bpf_call_base. This offset gets assigned to the imm field by the verifier. For some architectures, such as powerpc64, the imm field is not large enough to hold this offset. We resolve this by: [1] Assigning the subprog id to the imm field of a call instruction in the verifier instead of the offset of the callee's symbol's address from __bpf_call_base. [2] Determining the address of a callee's corresponding symbol by using the imm field as an index for the list of kernel symbol addresses now available from the program info. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: support 64-bit offsets for bpf function callsSandipan Das1-1/+14
The imm field of a bpf instruction is a signed 32-bit integer. For JITed bpf-to-bpf function calls, it holds the offset of the start address of the callee's JITed image from __bpf_call_base. For some architectures, such as powerpc64, this offset may be as large as 64 bits and cannot be accomodated in the imm field without truncation. We resolve this by: [1] Additionally using the auxiliary data of each function to keep a list of start addresses of the JITed images for all functions determined by the verifier. [2] Retaining the subprog id inside the off field of the call instructions and using it to index into the list mentioned above and lookup the callee's address. To make sure that the existing JIT compilers continue to work without requiring changes, we keep the imm field as it is. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-19bpf: Prevent memory disambiguation attackAlexei Starovoitov1-3/+56
Detect code patterns where malicious 'speculative store bypass' can be used and sanitize such patterns. 39: (bf) r3 = r10 40: (07) r3 += -216 41: (79) r8 = *(u64 *)(r7 +0) // slow read 42: (7a) *(u64 *)(r10 -72) = 0 // verifier inserts this instruction 43: (7b) *(u64 *)(r8 +0) = r3 // this store becomes slow due to r8 44: (79) r1 = *(u64 *)(r6 +0) // cpu speculatively executes this load 45: (71) r2 = *(u8 *)(r1 +0) // speculatively arbitrary 'load byte' // is now sanitized Above code after x86 JIT becomes: e5: mov %rbp,%rdx e8: add $0xffffffffffffff28,%rdx ef: mov 0x0(%r13),%r14 f3: movq $0x0,-0x48(%rbp) fb: mov %rdx,0x0(%r14) ff: mov 0x0(%rbx),%rdi 103: movzbq 0x0(%rdi),%rsi Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-15bpf: sockmap, add hash map supportJohn Fastabend1-2/+12
Sockmap is currently backed by an array and enforces keys to be four bytes. This works well for many use cases and was originally modeled after devmap which also uses four bytes keys. However, this has become limiting in larger use cases where a hash would be more appropriate. For example users may want to use the 5-tuple of the socket as the lookup key. To support this add hash support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09bpf: xdp: allow offloads to store into rx_queue_indexJakub Kicinski1-1/+1
It's fairly easy for offloaded XDP programs to select the RX queue packets go to. We need a way of expressing this in the software. Allow write to the rx_queue_index field of struct xdp_md for device-bound programs. Skip convert_ctx_access callback entirely for offloads. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: fix references to free_bpf_prog_info() in commentsJakub Kicinski1-2/+2
Comments in the verifier refer to free_bpf_prog_info() which seems to have never existed in tree. Replace it with free_used_maps(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: replace map pointer loads before calling into offloadsJakub Kicinski1-5/+5
Offloads may find host map pointers more useful than map fds. Map pointers can be used to identify the map, while fds are only valid within the context of loading process. Jump to skip_full_check on error in case verifier log overflow has to be handled (replace_map_fd_with_map_ptr() prints to the log, driver prep may do that too in the future). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: add faked "ending" subprogJiong Wang1-20/+14
There are quite a few code snippet like the following in verifier: subprog_start = 0; if (env->subprog_cnt == cur_subprog + 1) subprog_end = insn_cnt; else subprog_end = env->subprog_info[cur_subprog + 1].start; The reason is there is no marker in subprog_info array to tell the end of it. We could resolve this issue by introducing a faked "ending" subprog. The special "ending" subprog is with "insn_cnt" as start offset, so it is serving as the end mark whenever we iterate over all subprogs. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: centre subprog information fieldsJiong Wang1-30/+32
It is better to centre all subprog information fields into one structure. This structure could later serve as function node in call graph. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: unify main prog and subprogJiong Wang1-26/+31
Currently, verifier treat main prog and subprog differently. All subprogs detected are kept in env->subprog_starts while main prog is not kept there. Instead, main prog is implicitly defined as the prog start at 0. There is actually no difference between main prog and subprog, it is better to unify them, and register all progs detected into env->subprog_starts. This could also help simplifying some code logic. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-03bpf: implement ld_abs/ld_ind in native bpfDaniel Borkmann1-0/+24
The main part of this work is to finally allow removal of LD_ABS and LD_IND from the BPF core by reimplementing them through native eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and keeping them around in native eBPF caused way more trouble than actually worth it. To just list some of the security issues in the past: * fdfaf64e7539 ("x86: bpf_jit: support negative offsets") * 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets") * e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler") * 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call") * 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context") * 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context") For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy these days due to their limitations and more efficient/flexible alternatives that have been developed over time such as direct packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a register, the load happens in host endianness and its exception handling can yield unexpected behavior. The latter is explained in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by div/mod by 0 exception") with similar cases of exceptions we had. In native eBPF more recent program types will disable LD_ABS/LD_IND altogether through may_access_skb() in verifier, and given the limitations in terms of exception handling, it's also disabled in programs that use BPF to BPF calls. In terms of cBPF, the LD_ABS/LD_IND is used in networking programs to access packet data. It is not used in seccomp-BPF but programs that use it for socket filtering or reuseport for demuxing with cBPF. This is mostly relevant for applications that have not yet migrated to native eBPF. The main complexity and source of bugs in LD_ABS/LD_IND is coming from their implementation in the various JITs. Most of them keep the model around from cBPF times by implementing a fastpath written in asm. They use typically two from the BPF program hidden CPU registers for caching the skb's headlen (skb->len - skb->data_len) and skb->data. Throughout the JIT phase this requires to keep track whether LD_ABS/LD_IND are used and if so, the two registers need to be recached each time a BPF helper would change the underlying packet data in native eBPF case. At least in eBPF case, available CPU registers are rare and the additional exit path out of the asm written JIT helper makes it also inflexible since not all parts of the JITer are in control from plain C. A LD_ABS/LD_IND implementation in eBPF therefore allows to significantly reduce the complexity in JITs with comparable performance results for them, e.g.: test_bpf tcpdump port 22 tcpdump complex x64 - before 15 21 10 14 19 18 - after 7 10 10 7 10 15 arm64 - before 40 91 92 40 91 151 - after 51 64 73 51 62 113 For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter() and cache the skb's headlen and data in the cBPF prologue. The BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just used as a local temporary variable. This allows to shrink the image on x86_64 also for seccomp programs slightly since mapping to %rsi is not an ereg. In callee-saved R8 and R9 we now track skb data and headlen, respectively. For normal prologue emission in the JITs this does not add any extra instructions since R8, R9 are pushed to stack in any case from eBPF side. cBPF uses the convert_bpf_ld_abs() emitter which probes the fast path inline already and falls back to bpf_skb_load_helper_{8,16,32}() helper relying on the cached skb data and headlen as well. R8 and R9 never need to be reloaded due to bpf_helper_changes_pkt_data() since all skb access in cBPF is read-only. Then, for the case of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally, does neither cache skb data and headlen nor has an inlined fast path. The reason for the latter is that native eBPF does not have any extra registers available anyway, but even if there were, it avoids any reload of skb data and headlen in the first place. Additionally, for the negative offsets, we provide an alternative bpf_skb_load_bytes_relative() helper in eBPF which operates similarly as bpf_skb_load_bytes() and allows for more flexibility. Tested myself on x64, arm64, s390x, from Sandipan on ppc64. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-03bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAPBjörn Töpel1-2/+6
The xskmap is yet another BPF map, very much inspired by dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application adds AF_XDP sockets into the map, and by using the bpf_redirect_map helper, an XDP program can redirect XDP frames to an AF_XDP socket. Note that a socket that is bound to certain ifindex/queue index will *only* accept XDP frames from that netdev/queue index. If an XDP program tries to redirect from a netdev/queue index other than what the socket is bound to, the frame will not be received on the socket. A socket can reside in multiple maps. v3: Fixed race and simplified code. v2: Removed one indirection in map lookup. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29bpf/verifier: improve register value range tracking with ARSHYonghong Song1-0/+23
When helpers like bpf_get_stack returns an int value and later on used for arithmetic computation, the LSH and ARSH operations are often required to get proper sign extension into 64-bit. For example, without this patch: 54: R0=inv(id=0,umax_value=800) 54: (bf) r8 = r0 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800) 55: (67) r8 <<= 32 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000)) 56: (c7) r8 s>>= 32 57: R8=inv(id=0) With this patch: 54: R0=inv(id=0,umax_value=800) 54: (bf) r8 = r0 55: R0=inv(id=0,umax_value=800) R8_w=inv(id=0,umax_value=800) 55: (67) r8 <<= 32 56: R8_w=inv(id=0,umax_value=3435973836800,var_off=(0x0; 0x3ff00000000)) 56: (c7) r8 s>>= 32 57: R8=inv(id=0, umax_value=800,var_off=(0x0; 0x3ff)) With better range of "R8", later on when "R8" is added to other register, e.g., a map pointer or scalar-value register, the better register range can be derived and verifier failure may be avoided. In our later example, ...... usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK); if (usize < 0) return 0; ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0); ...... Without improving ARSH value range tracking, the register representing "max_len - usize" will have smin_value equal to S64_MIN and will be rejected by verifier. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29bpf: remove never-hit branches in verifier adjust_scalar_min_max_valsYonghong Song1-9/+2
In verifier function adjust_scalar_min_max_vals, when src_known is false and the opcode is BPF_LSH/BPF_RSH, early return will happen in the function. So remove the branch in handling BPF_LSH/BPF_RSH when src_known is false. Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29bpf/verifier: refine retval R0 state for bpf_get_stack helperYonghong Song1-0/+27
The special property of return values for helpers bpf_get_stack and bpf_probe_read_str are captured in verifier. Both helpers return a negative error code or a length, which is equal to or smaller than the buffer size argument. This additional information in the verifier can avoid the condition such as "retval > bufsize" in the bpf program. For example, for the code blow, usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK); if (usize < 0 || usize > max_len) return 0; The verifier may have the following errors: 52: (85) call bpf_get_stack#65 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_w=inv800 R10=fp0,call_-1 53: (bf) r8 = r0 54: (bf) r1 = r8 55: (67) r1 <<= 32 56: (bf) r2 = r1 57: (77) r2 >>= 32 58: (25) if r2 > 0x31f goto pc+33 R0=inv(id=0) R1=inv(id=0,smax_value=9223372032559808512, umax_value=18446744069414584320, var_off=(0x0; 0xffffffff00000000)) R2=inv(id=0,umax_value=799,var_off=(0x0; 0x3ff)) R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8=inv(id=0) R9=inv800 R10=fp0,call_-1 59: (1f) r9 -= r8 60: (c7) r1 s>>= 32 61: (bf) r2 = r7 62: (0f) r2 += r1 math between map_value pointer and register with unbounded min value is not allowed The failure is due to llvm compiler optimization where register "r2", which is a copy of "r1", is tested for condition while later on "r1" is used for map_ptr operation. The verifier is not able to track such inst sequence effectively. Without the "usize > max_len" condition, there is no llvm optimization and the below generated code passed verifier: 52: (85) call bpf_get_stack#65 R0=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R1_w=ctx(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R3_w=inv800 R4_w=inv256 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R9_w=inv800 R10=fp0,call_-1 53: (b7) r1 = 0 54: (bf) r8 = r0 55: (67) r8 <<= 32 56: (c7) r8 s>>= 32 57: (6d) if r1 s> r8 goto pc+24 R0=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R1=inv0 R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0) R8=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff)) R9=inv800 R10=fp0,call_-1 58: (bf) r2 = r7 59: (0f) r2 += r8 60: (1f) r9 -= r8 61: (bf) r1 = r6 Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-29bpf: add bpf_get_stack helperYonghong Song1-0/+19
Currently, stackmap and bpf_get_stackid helper are provided for bpf program to get the stack trace. This approach has a limitation though. If two stack traces have the same hash, only one will get stored in the stackmap table, so some stack traces are missing from user perspective. This patch implements a new helper, bpf_get_stack, will send stack traces directly to bpf program. The bpf program is able to see all stack traces, and then can do in-kernel processing or send stack traces to user space through shared map or bpf_perf_event_output. Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-04-24bpf: allow map helpers access to map values directlyPaul Chaignon1-17/+7
Helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE can only access stack and packet memory. Allow these helpers to directly access map values by passing registers of type PTR_TO_MAP_VALUE. This change removes the need for an extra copy to the stack when using a map value to perform a second map lookup, as in the following: struct bpf_map_def SEC("maps") infobyreq = { .type = BPF_MAP_TYPE_HASHMAP, .key_size = sizeof(struct request *), .value_size = sizeof(struct info_t), .max_entries = 1024, }; struct bpf_map_def SEC("maps") counts = { .type = BPF_MAP_TYPE_HASHMAP, .key_size = sizeof(struct info_t), .value_size = sizeof(u64), .max_entries = 1024, }; SEC("kprobe/blk_account_io_start") int bpf_blk_account_io_start(struct pt_regs *ctx) { struct info_t *info = bpf_map_lookup_elem(&infobyreq, &ctx->di); u64 *count = bpf_map_lookup_elem(&counts, info); (*count)++; } Signed-off-by: Paul Chaignon <paul.chaignon@orange.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31bpf: Hooks for sys_bindAndrey Ignatov1-0/+1
== The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31bpf: Check attach type at prog load timeAndrey Ignatov1-3/+3
== The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-26bpf: Add bpf_verifier_vlog() and bpf_verifier_log_needed()Martin KaFai Lau1-8/+11
The BTF (BPF Type Format) verifier needs to reuse the current BPF verifier log. Hence, it requires the following changes: (1) Expose log_write() in verifier.c for other users. Its name is renamed to bpf_verifier_vlog(). (2) The BTF verifier also needs to check 'log->level && log->ubuf && !bpf_verifier_log_full(log);' independently outside of the current log_write(). It is because the BTF verifier will do one-check before making multiple calls to btf_verifier_vlog to log the details of a type. Hence, this check is also re-factored to a new function bpf_verifier_log_needed(). Since it is re-factored, we can check it before va_start() in the current bpf_verifier_log_write() and verbose(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-26bpf: Rename bpf_verifer_logMartin KaFai Lau1-2/+2
bpf_verifer_log => bpf_verifier_log Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-23bpf: Remove struct bpf_verifier_env argument from print_bpf_insnJiri Olsa1-17/+27
We use print_bpf_insn in user space (bpftool and soon perf), so it'd be nice to keep it generic and strip it off the kernel struct bpf_verifier_env argument. This argument can be safely removed, because its users can use the struct bpf_insn_cbs::private_data to pass it. By changing the argument type we can no longer have clean 'verbose' alias to 'bpf_verifier_log_write' in verifier.c. Instead we're adding the 'verbose' cb_print callback and removing the alias. This way we have new cb_print callback in place, and all the 'verbose(env, ...) calls in verifier.c will cleanly cast to 'verbose(void *, ...)' so no other change is needed. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX dataJohn Fastabend1-1/+4
This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-16/+26
All of the conflicts were cases of overlapping changes. In net/core/devlink.c, we have to make care that the resouce size_params have become a struct member rather than a pointer to such an object. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23bpf: allow xadd only on aligned memoryDaniel Borkmann1-16/+26
The requirements around atomic_add() / atomic64_add() resp. their JIT implementations differ across architectures. E.g. while x86_64 seems just fine with BPF's xadd on unaligned memory, on arm64 it triggers via interpreter but also JIT the following crash: [ 830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703 [...] [ 830.916161] Internal error: Oops: 96000021 [#1] SMP [ 830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8 [ 830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017 [ 830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO) [ 831.003793] pc : __ll_sc_atomic_add+0x4/0x18 [ 831.008055] lr : ___bpf_prog_run+0x1198/0x1588 [ 831.012485] sp : ffff00001ccabc20 [ 831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00 [ 831.021087] x27: 0000000000000001 x26: 0000000000000000 [ 831.026387] x25: 000000c168d9db98 x24: 0000000000000000 [ 831.031686] x23: ffff000008203878 x22: ffff000009488000 [ 831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0 [ 831.042286] x19: ffff0000097b5080 x18: 0000000000000a03 [ 831.047585] x17: 0000000000000000 x16: 0000000000000000 [ 831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000 [ 831.058184] x13: 0000000000000000 x12: 0000000000000000 [ 831.063484] x11: 0000000000000001 x10: 0000000000000000 [ 831.068783] x9 : 0000000000000000 x8 : 0000000000000000 [ 831.074083] x7 : 0000000000000000 x6 : 000580d428000000 [ 831.079383] x5 : 0000000000000018 x4 : 0000000000000000 [ 831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001 [ 831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001 [ 831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044) [ 831.102577] Call trace: [ 831.105012] __ll_sc_atomic_add+0x4/0x18 [ 831.108923] __bpf_prog_run32+0x4c/0x70 [ 831.112748] bpf_test_run+0x78/0xf8 [ 831.116224] bpf_prog_test_run_xdp+0xb4/0x120 [ 831.120567] SyS_bpf+0x77c/0x1110 [ 831.123873] el0_svc_naked+0x30/0x34 [ 831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31) Reason for this is because memory is required to be aligned. In case of BPF, we always enforce alignment in terms of stack access, but not when accessing map values or packet data when the underlying arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set. xadd on packet data that is local to us anyway is just wrong, so forbid this case entirely. The only place where xadd makes sense in fact are map values; xadd on stack is wrong as well, but it's been around for much longer. Specifically enforce strict alignment in case of xadd, so that we handle this case generically and avoid such crashes in the first place. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-02-15bpf: Remove unused callee_saved arrayJoe Stringer1-4/+0
This array appears to be completely unused, remove it. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-26bpf: fix subprog verifier bypass by div/mod by 0 exceptionDaniel Borkmann1-8/+30
One of the ugly leftovers from the early eBPF days is that div/mod operations based on registers have a hard-coded src_reg == 0 test in the interpreter as well as in JIT code generators that would return from the BPF program with exit code 0. This was basically adopted from cBPF interpreter for historical reasons. There are multiple reasons why this is very suboptimal and prone to bugs. To name one: the return code mapping for such abnormal program exit of 0 does not always match with a suitable program type's exit code mapping. For example, '0' in tc means action 'ok' where the packet gets passed further up the stack, which is just undesirable for such cases (e.g. when implementing policy) and also does not match with other program types. While trying to work out an exception handling scheme, I also noticed that programs crafted like the following will currently pass the verifier: 0: (bf) r6 = r1 1: (85) call pc+8 caller: R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1 callee: frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1 10: (b4) (u32) r2 = (u32) 0 11: (b4) (u32) r3 = (u32) 1 12: (3c) (u32) r3 /= (u32) r2 13: (61) r0 = *(u32 *)(r1 +76) 14: (95) exit returning from callee: frame1: R0_w=pkt(id=0,off=0,r=0,imm=0) R1=ctx(id=0,off=0,imm=0) R2_w=inv0 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0,call_1 to caller at 2: R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1 from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1 2: (bf) r1 = r6 3: (61) r1 = *(u32 *)(r1 +80) 4: (bf) r2 = r0 5: (07) r2 += 8 6: (2d) if r2 > r1 goto pc+1 R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1 7: (71) r0 = *(u8 *)(r0 +0) 8: (b7) r0 = 1 9: (95) exit from 6 to 8: safe processed 16 insns (limit 131072), stack depth 0+0 Basically what happens is that in the subprog we make use of a div/mod by 0 exception and in the 'normal' subprog's exit path we just return skb->data back to the main prog. This has the implication that the verifier thinks we always get a pkt pointer in R0 while we still have the implicit 'return 0' from the div as an alternative unconditional return path earlier. Thus, R0 then contains 0, meaning back in the parent prog we get the address range of [0x0, skb->data_end] as read and writeable. Similar can be crafted with other pointer register types. Since i) BPF_ABS/IND is not allowed in programs that contain BPF to BPF calls (and generally it's also disadvised to use in native eBPF context), ii) unknown opcodes don't return zero anymore, iii) we don't return an exception code in dead branches, the only last missing case affected and to fix is the div/mod handling. What we would really need is some infrastructure to propagate exceptions all the way to the original prog unwinding the current stack and returning that code to the caller of the BPF program. In user space such exception handling for similar runtimes is typically implemented with setjmp(3) and longjmp(3) as one possibility which is not available in the kernel, though (kgdb used to implement it in kernel long time ago). I implemented a PoC exception handling mechanism into the BPF interpreter with porting setjmp()/longjmp() into x86_64 and adding a new internal BPF_ABRT opcode that can use a program specific exception code for all exception cases we have (e.g. div/mod by 0, unknown opcodes, etc). While this seems to work in the constrained BPF environment (meaning, here, we don't need to deal with state e.g. from memory allocations that we would need to undo before going into exception state), it still has various drawbacks: i) we would need to implement the setjmp()/longjmp() for every arch supported in the kernel and for x86_64, arm64, sparc64 JITs currently supporting calls, ii) it has unconditional additional cost on main program entry to store CPU register state in initial setjmp() call, and we would need some way to pass the jmp_buf down into ___bpf_prog_run() for main prog and all subprogs, but also storing on stack is not really nice (other option would be per-cpu storage for this, but it also has the drawback that we need to disable preemption for every BPF program types). All in all this approach would add a lot of complexity. Another poor-man's solution would be to have some sort of additional shared register or scratch buffer to hold state for exceptions, and test that after every call return to chain returns and pass R0 all the way down to BPF prog caller. This is also problematic in various ways: i) an additional register doesn't map well into JITs, and some other scratch space could only be on per-cpu storage, which, again has the side-effect that this only works when we disable preemption, or somewhere in the input context which is not available everywhere either, and ii) this adds significant runtime overhead by putting conditionals after each and every call, as well as implementation complexity. Yet another option is to teach verifier that div/mod can return an integer, which however is also complex to implement as verifier would need to walk such fake 'mov r0,<code>; exit;' sequeuence and there would still be no guarantee for having propagation of this further down to the BPF caller as proper exception code. For parent prog, it is also is not distinguishable from a normal return of a constant scalar value. The approach taken here is a completely different one with little complexity and no additional overhead involved in that we make use of the fact that a div/mod by 0 is undefined behavior. Instead of bailing out, we adapt the same behavior as on some major archs like ARMv8 [0] into eBPF as well: X div 0 results in 0, and X mod 0 results in X. aarch64 and aarch32 ISA do not generate any traps or otherwise aborts of program execution for unsigned divides. I verified this also with a test program compiled by gcc and clang, and the behavior matches with the spec. Going forward we adapt the eBPF verifier to emit such rewrites once div/mod by register was seen. cBPF is not touched and will keep existing 'return 0' semantics. Given the options, it seems the most suitable from all of them, also since major archs have similar schemes in place. Given this is all in the realm of undefined behavior, we still have the option to adapt if deemed necessary and this way we would also have the option of more flexibility from LLVM code generation side (which is then fully visible to verifier). Thus, this patch i) fixes the panic seen in above program and ii) doesn't bypass the verifier observations. [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf 1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV) "A division by zero results in a zero being written to the destination register, without any indication that the division by zero occurred." 2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV) "For the SDIV and UDIV instructions, division by zero always returns a zero result." Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-26bpf: make unknown opcode handling more robustDaniel Borkmann1-0/+7
Recent findings by syzcaller fixed in 7891a87efc71 ("bpf: arsh is not supported in 32 bit alu thus reject it") triggered a warning in the interpreter due to unknown opcode not being rejected by the verifier. The 'return 0' for an unknown opcode is really not optimal, since with BPF to BPF calls, this would go untracked by the verifier. Do two things here to improve the situation: i) perform basic insn sanity check early on in the verification phase and reject every non-uapi insn right there. The bpf_opcode_in_insntable() table reuses the same mapping as the jumptable in ___bpf_prog_run() sans the non-public mappings. And ii) in ___bpf_prog_run() we do need to BUG in the case where the verifier would ever create an unknown opcode due to some rewrites. Note that JITs do not have such issues since they would punt to interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON would also help to avoid such unknown opcodes in the first place. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-26bpf: improve dead code sanitizingDaniel Borkmann1-5/+12
Given we recently had c131187db2d3 ("bpf: fix branch pruning logic") and 95a762e2c8c9 ("bpf: fix incorrect sign extension in check_alu_op()") in particular where before verifier skipped verification of the wrongly assumed dead branch, we should not just replace the dead code parts with nops (mov r0,r0). If there is a bug such as fixed in 95a762e2c8c9 in future again, where runtime could execute those insns, then one of the potential issues with the current setting would be that given the nops would be at the end of the program, we could execute out of bounds at some point. The best in such case would be to just exit the BPF program altogether and return an exception code. However, given this would require two instructions, and such a dead code gap could just be a single insn long, we would need to place 'r0 = X; ret' snippet at the very end after the user program or at the start before the program (where we'd skip that region on prog entry), and then place unconditional ja's into the dead code gap. While more complex but possible, there's still another block in the road that currently prevents from this, namely BPF to BPF calls. The issue here is that such exception could be returned from a callee, but the caller would not know that it's an exception that needs to be propagated further down. Alternative that has little complexity is to just use a ja-1 code for now which will trap the execution here instead of silently doing bad things if we ever get there due to bugs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-25/+57
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-01-19 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) bpf array map HW offload, from Jakub. 2) support for bpf_get_next_key() for LPM map, from Yonghong. 3) test_verifier now runs loaded programs, from Alexei. 4) xdp cpumap monitoring, from Jesper. 5) variety of tests, cleanups and small x64 JIT optimization, from Daniel. 6) user space can now retrieve HW JITed program, from Jiong. Note there is a minor conflict between Russell's arm32 JIT fixes and removal of bpf_jit_enable variable by Daniel which should be resolved by keeping Russell's comment and removing that variable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-11/+53
The BPF verifier conflict was some minor contextual issue. The TUN conflict was less trivial. Cong Wang fixed a memory leak of tfile->tx_array in 'net'. This is an skb_array. But meanwhile in net-next tun changed tfile->tx_arry into tfile->tx_ring which is a ptr_ring. Signed-off-by: David S. Miller <davem@davemloft.net>