summaryrefslogtreecommitdiffstats
path: root/tools
AgeCommit message (Collapse)AuthorFilesLines
2017-01-24bpf: enable verifier to better track const alu opsDaniel Borkmann1-0/+82
William reported couple of issues in relation to direct packet access. Typical scheme is to check for data + [off] <= data_end, where [off] can be either immediate or coming from a tracked register that contains an immediate, depending on the branch, we can then access the data. However, in case of calculating [off] for either the mentioned test itself or for access after the test in a more "complex" way, then the verifier will stop tracking the CONST_IMM marked register and will mark it as UNKNOWN_VALUE one. Adding that UNKNOWN_VALUE typed register to a pkt() marked register, the verifier then bails out in check_packet_ptr_add() as it finds the registers imm value below 48. In the first below example, that is due to evaluate_reg_imm_alu() not handling right shifts and thus marking the register as UNKNOWN_VALUE via helper __mark_reg_unknown_value() that resets imm to 0. In the second case the same happens at the time when r4 is set to r4 &= r5, where it transitions to UNKNOWN_VALUE from evaluate_reg_imm_alu(). Later on r4 we shift right by 3 inside evaluate_reg_alu(), where the register's imm turns into 3. That is, for registers with type UNKNOWN_VALUE, imm of 0 means that we don't know what value the register has, and for imm > 0 it means that the value has [imm] upper zero bits. F.e. when shifting an UNKNOWN_VALUE register by 3 to the right, no matter what value it had, we know that the 3 upper most bits must be zero now. This is to make sure that ALU operations with unknown registers don't overflow. Meaning, once we know that we have more than 48 upper zero bits, or, in other words cannot go beyond 0xffff offset with ALU ops, such an addition will track the target register as a new pkt() register with a new id, but 0 offset and 0 range, so for that a new data/data_end test will be required. Is the source register a CONST_IMM one that is to be added to the pkt() register, or the source instruction is an add instruction with immediate value, then it will get added if it stays within max 0xffff bounds. >From there, pkt() type, can be accessed should reg->off + imm be within the access range of pkt(). [...] from 28 to 30: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=22) R2=pkt_end R3=imm144,min_value=144,max_value=144 R4=imm0,min_value=0,max_value=0 R5=inv48,min_value=2054,max_value=2054 R10=fp 30: (bf) r5 = r3 31: (07) r5 += 23 32: (77) r5 >>= 3 33: (bf) r6 = r1 34: (0f) r6 += r5 cannot add integer value with 0 upper zero bits to ptr_to_packet [...] from 52 to 80: R0=imm1,min_value=1,max_value=1 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=inv R4=imm272 R5=inv56,min_value=17,max_value=17 R6=pkt(id=0,off=26,r=34) R10=fp 80: (07) r4 += 71 81: (18) r5 = 0xfffffff8 83: (5f) r4 &= r5 84: (77) r4 >>= 3 85: (0f) r1 += r4 cannot add integer value with 3 upper zero bits to ptr_to_packet Thus to get above use-cases working, evaluate_reg_imm_alu() has been extended for further ALU ops. This is fine, because we only operate strictly within realm of CONST_IMM types, so here we don't care about overflows as they will happen in the simulated but also real execution and interaction with pkt() in check_packet_ptr_add() will check actual imm value once added to pkt(), but it's irrelevant before. With regards to 06c1c049721a ("bpf: allow helpers access to variable memory") that works on UNKNOWN_VALUE registers, the verifier becomes now a bit smarter as it can better resolve ALU ops, so we need to adapt two test cases there, as min/max bound tracking only becomes necessary when registers were spilled to stack. So while mask was set before to track upper bound for UNKNOWN_VALUE case, it's now resolved directly as CONST_IMM, and such contructs are only necessary when f.e. registers are spilled. For commit 6b17387307ba ("bpf: recognize 64bit immediate loads as consts") that initially enabled dw load tracking only for nfp jit/ analyzer, I did couple of tests on large, complex programs and we don't increase complexity badly (my tests were in ~3% range on avg). I've added a couple of tests similar to affected code above, and it works fine with verifier now. Reported-by: William Tu <u9012063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Gianluca Borello <g.borello@gmail.com> Cc: William Tu <u9012063@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: add prog tag test case to bpf selftestsDaniel Borkmann2-2/+204
Add the test case used to compare the results from fdinfo with af_alg's output on the tag. Tests are from min to max sized programs, with and without maps included. # ./test_tag test_tag: OK (40945 tests) Tested on x86_64 and s390x. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-23bpf: Add tests for the lpm trie mapDavid Herrmann3-2/+361
The first part of this program runs randomized tests against the lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm) based on simple, linear, single linked lists. The implementation should be pretty straightforward. Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies the trie-based bpf-map implementation behaves the same way as tlpm. The second part uses 'real world' IPv4 and IPv6 addresses and tests the trie with those. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-45/+107
2017-01-15Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds9-45/+107
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc race fixes uncovered by fuzzing efforts, a Sparse fix, two PMU driver fixes, plus miscellanous tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Reject non sampling events with precise_ip perf/x86/intel: Account interrupts for PEBS errors perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race perf/core: Fix sys_perf_event_open() vs. hotplug perf/x86/intel: Use ULL constant to prevent undefined shift behaviour perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code perf/x86: Set pmu->module in Intel PMU modules perf probe: Fix to probe on gcc generated symbols for offline kernel perf probe: Fix --funcs to show correct symbols for offline module perf symbols: Robustify reading of build-id from sysfs perf tools: Install tools/lib/traceevent plugins with install-bin tools lib traceevent: Fix prev/next_prio for deadline tasks perf record: Fix --switch-output documentation and comment perf record: Make __record_options static tools lib subcmd: Add OPT_STRING_OPTARG_SET option perf probe: Fix to get correct modname from elf header samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include samples/bpf sock_example: Avoid getting ethhdr from two includes perf sched timehist: Show total scheduling time
2017-01-12tools: psock_lib: harden socket filter used by psock testsSowmini Varadhan1-7/+32
The filter added by sock_setfilter is intended to only permit packets matching the pattern set up by create_payload(), but we only check the ip_len, and a single test-character in the IP packet to ensure this condition. Harden the filter by adding additional constraints so that we only permit UDP/IPv4 packets that meet the ip_len and test-character requirements. Include the bpf_asm src as a comment, in case this needs to be enhanced in the future Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-12bpf: allow b/h/w/dw access for bpf's cb in ctxDaniel Borkmann1-3/+439
When structs are used to store temporary state in cb[] buffer that is used with programs and among tail calls, then the generated code will not always access the buffer in bpf_w chunks. We can ease programming of it and let this act more natural by allowing for aligned b/h/w/dw sized access for cb[] ctx member. Various test cases are attached as well for the selftest suite. Potentially, this can also be reused for other program types to pass data around. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-5/+4
Two AF_* families adding entries to the lockdep tables at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+0
Merge fixes from Andrew Morton: "27 fixes. There are three patches that aren't actually fixes. They're simple function renamings which are nice-to-have in mainline as ongoing net development depends on them." * akpm: (27 commits) timerfd: export defines to userspace mm/hugetlb.c: fix reservation race when freeing surplus pages mm/slab.c: fix SLAB freelist randomization duplicate entries zram: support BDI_CAP_STABLE_WRITES zram: revalidate disk under init_lock mm: support anonymous stable page mm: add documentation for page fragment APIs mm: rename __page_frag functions to __page_frag_cache, drop order from drain mm: rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free mm, memcg: fix the active list aging for lowmem requests when memcg is enabled mm: don't dereference struct page fields of invalid pages mailmap: add codeaurora.org names for nameless email commits signal: protect SIGNAL_UNKILLABLE from unintentional clearing. mm: pmd dirty emulation in page fault handler ipc/sem.c: fix incorrect sem_lock pairing lib/Kconfig.debug: fix frv build failure mm: get rid of __GFP_OTHER_NODE mm: fix remote numa hits statistics mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} ocfs2: fix crash caused by stale lvb with fsdlm plugin ...
2017-01-10mm: get rid of __GFP_OTHER_NODEMichal Hocko1-1/+0
The flag was introduced by commit 78afd5612deb ("mm: add __GFP_OTHER_NODE flag") to allow proper accounting of remote node allocations done by kernel daemons on behalf of a process - e.g. khugepaged. After "mm: fix remote numa hits statistics" we do not need and actually use the flag so we can safely remove it because all allocations which are satisfied from their "home" node are accounted properly. [mhocko@suse.com: fix build] Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-09bpf: allow helpers access to variable memoryGianluca Borello1-0/+410
Currently, helpers that read and write from/to the stack can do so using a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE. ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so that the verifier can safely check the memory access. However, requiring the argument to be a constant can be limiting in some circumstances. Since the current logic keeps track of the minimum and maximum value of a register throughout the simulated execution, ARG_CONST_STACK_SIZE can be changed to also accept an UNKNOWN_VALUE register in case its boundaries have been set and the range doesn't cause invalid memory accesses. One common situation when this is useful: int len; char buf[BUFSIZE]; /* BUFSIZE is 128 */ if (some_condition) len = 42; else len = 84; some_helper(..., buf, len & (BUFSIZE - 1)); The compiler can often decide to assign the constant values 42 or 48 into a variable on the stack, instead of keeping it in a register. When the variable is then read back from stack into the register in order to be passed to the helper, the verifier will not be able to recognize the register as constant (the verifier is not currently tracking all constant writes into memory), and the program won't be valid. However, by allowing the helper to accept an UNKNOWN_VALUE register, this program will work because the bitwise AND operation will set the range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE), so the verifier can guarantee the helper call will be safe (assuming the argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more check against 0 would be needed). Custom ranges can be set not only with ALU operations, but also by explicitly comparing the UNKNOWN_VALUE register with constants. Another very common example happens when intercepting system call arguments and accessing user-provided data of variable size using bpf_probe_read(). One can load at runtime the user-provided length in an UNKNOWN_VALUE register, and then read that exact amount of data up to a compile-time determined limit in order to fit into the proper local storage allocated on the stack, without having to guess a suboptimal access size at compile time. Also, in case the helpers accepting the UNKNOWN_VALUE register operate in raw mode, disable the raw mode so that the program is required to initialize all memory, since there is no guarantee the helper will fill it completely, leaving possibilities for data leak (just relevant when the memory used by the helper is the stack, not when using a pointer to map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will be treated as ARG_PTR_TO_STACK. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09bpf: allow adjusted map element values to spillGianluca Borello1-0/+46
commit 484611357c19 ("bpf: allow access into map value arrays") introduces the ability to do pointer math inside a map element value via the PTR_TO_MAP_VALUE_ADJ register type. The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ is spilled into the stack, limiting several use cases, especially when generating bpf code from a compiler. Handle this case by explicitly enabling the register type PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and max_value are reset just for BPF_LDX operations that don't result in a restore of a spilled register from stack. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09bpf: allow helpers access to map element valuesGianluca Borello1-0/+491
Enable helpers to directly access a map element value by passing a register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK. This enables several use cases. For example, a typical tracing program might want to capture pathnames passed to sys_open() with: struct trace_data { char pathname[PATHLEN]; }; SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs *ctx) { struct trace_data data; bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di); /* consume data.pathname, for example via * bpf_trace_printk() or bpf_perf_event_output() */ } Such a program could easily hit the stack limit in case PATHLEN needs to be large or more local variables need to exist, both of which are quite common scenarios. Allowing direct helper access to map element values, one could do: struct bpf_map_def SEC("maps") scratch_map = { .type = BPF_MAP_TYPE_PERCPU_ARRAY, .key_size = sizeof(u32), .value_size = sizeof(struct trace_data), .max_entries = 1, }; SEC("kprobe/sys_open") int bpf_sys_open(struct pt_regs *ctx) { int id = 0; struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id); if (!p) return; bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di); /* consume p->pathname, for example via * bpf_trace_printk() or bpf_perf_event_output() */ } And wouldn't risk exhausting the stack. Code changes are loosely modeled after commit 6841de8b0d03 ("bpf: allow helpers access the packet directly"). Unlike with PTR_TO_PACKET, these changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be trivial, but since there is not currently a use case for that, it's reasonable to limit the set of changes. Also, add new tests to make sure accesses to map element values from helpers never go out of boundary, even when adjusted. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-05selftests: x86/pkeys: fix spelling mistake: "itertation" -> "iteration"Colin King1-1/+1
Fix spelling mistake in print test pass message. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-01-05selftests: do not require bash to run netsocktests testcaseRolf Eike Beer1-1/+1
Nothing in this minimal script seems to require bash. We often run these tests on embedded devices where the only shell available is the busybox ash. Use sh instead. Signed-off-by: Rolf Eike Beer <eb@emlix.com> Cc: stable@vger.kernel.org Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-01-05selftests: do not require bash to run bpf testsRolf Eike Beer1-1/+1
Nothing in this minimal script seems to require bash. We often run these tests on embedded devices where the only shell available is the busybox ash. Use sh instead. Signed-off-by: Rolf Eike Beer <eb@emlix.com> Cc: stable@vger.kernel.org Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-01-05selftests: do not require bash for the generated testRolf Eike Beer1-1/+1
Nothing in this minimal script seems to require bash. We often run these tests on embedded devices where the only shell available is the busybox ash. Use sh instead. Signed-off-by: Rolf Eike Beer <eb@emlix.com> Cc: stable@vger.kernel.org Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
2017-01-05tools: psock_tpacket: block Rx until socket filter has been added and socket ↵Sowmini Varadhan1-3/+3
has been bound to loopback. Packets from any/all interfaces may be queued up on the PF_PACKET socket before it is bound to the loopback interface by psock_tpacket, and when these are passed up by the kernel, they could interfere with the Rx tests. Avoid interference from spurious packet by blocking Rx until the socket filter has been set up, and the packet has been bound to the desired (lo) interface. The effective sequence is socket(PF_PACKET, SOCK_RAW, 0); set up ring Invoke SO_ATTACH_FILTER bind to sll_protocol set to ETH_P_ALL, sll_ifindex for lo After this sequence, the only packets that will be passed up are those received on loopback that pass the attached filter. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-05Merge tag 'perf-urgent-for-mingo-4.10-20170104' of ↵Ingo Molnar9-45/+107
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes and one improvement from Arnaldo Carvalho de Melo: Fixes: - Fix prev/next_prio formatting for deadline tasks in libtraceevent (Daniel Bristot de Oliveira) - Robustify reading of build-ids from /sys/kernel/note (Arnaldo Carvalho de Melo) - Fix building some sample/bpf in Alpine Linux 3.4 (Arnaldo Carvalho de Melo) - Fix 'make install-bin' to install libtraceevent plugins (Arnaldo Carvalho de Melo) - Fix 'perf record --switch-output' documentation and comment (Jiri Olsa) - Fix 'perf probe' for cross arch probing (Masami Hiramatsu) Improvement: - Show total scheduling time in 'perf sched timehist' (Namhyumg Kim) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-04perf probe: Fix to probe on gcc generated symbols for offline kernelMasami Hiramatsu1-1/+47
Fix perf-probe to show probe definition on gcc generated symbols for offline kernel (including cross-arch kernel image). gcc sometimes optimizes functions and generate new symbols with suffixes such as ".constprop.N" or ".isra.N" etc. Since those symbol names are not recorded in DWARF, we have to find correct generated symbols from offline ELF binary to probe on it (kallsyms doesn't correct it). For online kernel or uprobes we don't need it because those are rebased on _text, or a section relative address. E.g. Without this: $ perf probe -k build-arm/vmlinux -F __slab_alloc* __slab_alloc.constprop.9 $ perf probe -k build-arm/vmlinux -D __slab_alloc p:probe/__slab_alloc __slab_alloc+0 If you put above definition on target machine, it should fail because there is no __slab_alloc in kallsyms. With this fix, perf probe shows correct probe definition on __slab_alloc.constprop.9: $ perf probe -k build-arm/vmlinux -D __slab_alloc p:probe/__slab_alloc __slab_alloc.constprop.9+0 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/148350060434.19001.11864836288580083501.stgit@devbox Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-04perf probe: Fix --funcs to show correct symbols for offline moduleMasami Hiramatsu1-19/+6
Fix --funcs (-F) option to show correct symbols for offline module. Since previous perf-probe uses machine__findnew_module_map() for offline module, even if user passes a module file (with full path) which is for other architecture, perf-probe always tries to load symbol map for current kernel module. This fix uses dso__new_map() to load the map from given binary as same as a map for user applications. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/148350053478.19001.15435255244512631545.stgit@devbox Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03perf symbols: Robustify reading of build-id from sysfsArnaldo Carvalho de Melo1-0/+6
Markus reported that perf segfaults when reading /sys/kernel/notes from a kernel linked with GNU gold, due to what looks like a gold bug, so do some bounds checking to avoid crashing in that case. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Report-Link: http://lkml.kernel.org/r/20161219161821.GA294@x4 Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-ryhgs6a6jxvz207j2636w31c@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03perf tools: Install tools/lib/traceevent plugins with install-binArnaldo Carvalho de Melo1-2/+2
Those are binaries as well, so should be installed by: make -C tools/perf install-bin' too. Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/n/tip-3841b37u05evxrs1igkyu6ks@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03tools lib traceevent: Fix prev/next_prio for deadline tasksDaniel Bristot de Oliveira1-2/+2
Currently, the sched:sched_switch tracepoint reports deadline tasks with priority -1. But when reading the trace via perf script I've got the following output: # ./d & # (d is a deadline task, see [1]) # perf record -e sched:sched_switch -a sleep 1 # perf script ... swapper 0 [000] 2146.962441: sched:sched_switch: swapper/0:0 [120] R ==> d:2593 [4294967295] d 2593 [000] 2146.972472: sched:sched_switch: d:2593 [4294967295] R ==> g:2590 [4294967295] The task d reports the wrong priority [4294967295]. This happens because the "int prio" is stored in an unsigned long long val. Although it is set as a %lld, as int is shorter than unsigned long long, trace_seq_printf prints it as a positive number. The fix is just to cast the val as an int, and print it as a %d, as in the sched:sched_switch tracepoint's "format". The output with the fix is: # ./d & # perf record -e sched:sched_switch -a sleep 1 # perf script ... swapper 0 [000] 4306.374037: sched:sched_switch: swapper/0:0 [120] R ==> d:10941 [-1] d 10941 [000] 4306.383823: sched:sched_switch: d:10941 [-1] R ==> swapper/0:0 [120] [1] d.c --- #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #include <linux/types.h> #include <linux/sched.h> struct sched_attr { __u32 size, sched_policy; __u64 sched_flags; __s32 sched_nice; __u32 sched_priority; __u64 sched_runtime, sched_deadline, sched_period; }; int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags) { return syscall(__NR_sched_setattr, pid, attr, flags); } int main(void) { struct sched_attr attr = { .size = sizeof(attr), .sched_policy = SCHED_DEADLINE, /* This creates a 10ms/30ms reservation */ .sched_runtime = 10 * 1000 * 1000, .sched_period = attr.sched_deadline = 30 * 1000 * 1000, }; if (sched_setattr(0, &attr, 0) < 0) { perror("sched_setattr"); return -1; } for(;;); } --- Committer notes: Got the program from the provided URL, http://bristot.me/lkml/d.c, trimmed it and included in the cset log above, so that we have everything needed to test it in one place. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/866ef75bcebf670ae91c6a96daa63597ba981f0d.1483443552.git.bristot@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03tools: test case for TPACKET_V3/TX_RING supportSowmini Varadhan1-17/+74
Add a test case and sample code for (TPACKET_V3, PACKET_TX_RING) Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03perf record: Fix --switch-output documentation and commentJiri Olsa2-1/+5
There's no --signal-trigger option, also adding the code comment into record man page. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1483431600-19887-4-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03perf record: Make __record_options staticJiri Olsa1-1/+1
There's no need for this one to be global. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1483431600-19887-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03tools lib subcmd: Add OPT_STRING_OPTARG_SET optionJiri Olsa2-0/+8
To allow string options with a default argument and variable set when the option is used. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Cc: David Ahern <dsahern@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1483431600-19887-2-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-02perf probe: Fix to get correct modname from elf headerMasami Hiramatsu1-16/+16
Since 'perf probe' supports cross-arch probes, it is possible to analyze different arch kernel image which has different bits-per-long. In that case, it fails to get the module name because it uses the MOD_NAME_OFFSET macro based on the host machine bits-per-long, instead of the target arch bits-per-long. This fixes above issue by changing modname-offset based on the target archs bit width. This is ok because linux kernel uses LP64 model on 64bit arch. E.g. without this (on x86_64, and target module is arm32): $ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup p:probe/configfs_lookup :configfs_lookup+0 ^-Here is an empty module name. With this fix, you can see correct module name: $ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup p:probe/configfs_lookup configfs:configfs_lookup+0 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/148337043836.6752.383495516397005695.stgit@devbox Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-27perf sched timehist: Show total scheduling timeNamhyung Kim1-3/+14
Show length of analyzed sample time and rate of idle task running. This also takes care of time range given by --time option. $ perf sched timehist -sI | tail Samples do not have callchains. Idle stats: CPU 0 idle for 930.316 msec ( 92.93%) CPU 1 idle for 963.614 msec ( 96.25%) CPU 2 idle for 885.482 msec ( 88.45%) CPU 3 idle for 938.635 msec ( 93.76%) Total number of unique tasks: 118 Total number of context switches: 2337 Total run time (msec): 3718.048 Total scheduling time (msec): 1001.131 (x 4) Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-25Merge branch 'turbostat' of ↵Linus Torvalds3-371/+673
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat updates from Len Brown. * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power turbostat: remove obsolete -M, -m, -C, -c options tools/power turbostat: Make extensible via the --add parameter tools/power turbostat: Denverton uses a 25 MHz crystal, not 19.2 MHz tools/power turbostat: line up headers when -M is used tools/power turbostat: fix SKX PKG_CSTATE_LIMIT decoding tools/power turbostat: Support Knights Mill (KNM) tools/power turbostat: Display HWP OOB status tools/power turbostat: fix Denverton BCLK tools/power turbostat: use intel-family.h model strings tools/power/turbostat: Add Denverton RAPL support tools/power/turbostat: Add Denverton support tools/power/turbostat: split core MSR support into status + limit tools/power turbostat: fix error case overflow read of slm_freq_table[] tools/power turbostat: Allocate correct amount of fd and irq entries tools/power turbostat: switch to tab delimited output tools/power turbostat: Gracefully handle ACPI S3 tools/power turbostat: tidy up output on Joule counter overflow
2016-12-24tools/power turbostat: remove obsolete -M, -m, -C, -c optionsLen Brown2-110/+2
The new --add option has replaced the -M, -m, -C, -c options Eg. -M 0x10 is now --add msr0x10,raw -m 0x10 is now --add msr0x10,raw,u32 -C 0x10 is now --add msr0x10,delta -c 0x10 is now --add msr0x10,delta,u32 The --add option can be repeated to add any number of counters, while the previous options were limited to adding one of each type. In addition, the --add option can accept a column label, and can also display a counter as a percentage of elapsed cycles. Eg. --add msr0x3fe,core,percent,MY_CC3 Signed-off-by: Len Brown <len.brown@intel.com>
2016-12-24tools/power turbostat: Make extensible via the --add parameterLen Brown2-9/+409
Create the "--add" parameter. This can be used to teach an existing turbostat binary about any number of any type of counter. turbostat(8) details the syntax for --add. Signed-off-by: Len Brown <len.brown@intel.com>
2016-12-23Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds27-405/+865
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "On the kernel side there's two x86 PMU driver fixes and a uprobes fix, plus on the tooling side there's a number of fixes and some late updates" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) perf sched timehist: Fix invalid period calculation perf sched timehist: Remove hardcoded 'comm_width' check at print_summary perf sched timehist: Enlarge default 'comm_width' perf sched timehist: Honour 'comm_width' when aligning the headers perf/x86: Fix overlap counter scheduling bug perf/x86/pebs: Fix handling of PEBS buffer overflows samples/bpf: Move open_raw_sock to separate header samples/bpf: Remove perf_event_open() declaration samples/bpf: Be consistent with bpf_load_program bpf_insn parameter tools lib bpf: Add bpf_prog_{attach,detach} samples/bpf: Switch over to libbpf perf diff: Do not overwrite valid build id perf annotate: Don't throw error for zero length symbols perf bench futex: Fix lock-pi help string perf trace: Check if MAP_32BIT is defined (again) samples/bpf: Make perf_event_read() static uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation samples/bpf: Make samples more libbpf-centric tools lib bpf: Add flags to bpf_create_map() tools lib bpf: use __u32 from linux/types.h ...
2016-12-22perf sched timehist: Fix invalid period calculationNamhyung Kim1-1/+1
When --time option is given with a value outside recorded time, the last sample time (tprev) was set to that value and run time calculation might be incorrect. This is a problem of the first samples for each cpus since it would skip the runtime update when tprev is 0. But with --time option it had non-zero (which is invalid) value so the calculation is also incorrect. For example, let's see the followging: $ perf sched timehist time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ ------------------------------ --------- --------- --------- 3195.968367 [0003] <idle> 0.000 0.000 0.000 3195.968386 [0002] Timer[4306/4277] 0.000 0.000 0.018 3195.968397 [0002] Web Content[4277] 0.000 0.000 0.000 3195.968595 [0001] JS Helper[4302/4277] 0.000 0.000 0.000 3195.969217 [0000] <idle> 0.000 0.000 0.621 3195.969251 [0001] kworker/1:1H[291] 0.000 0.000 0.033 The sample starts at 3195.968367 but when I gave a time interval from 3194 to 3196 (in sec) it will calculate the whole 2 second as runtime. In below, 2 cpus accounted it as runtime, other 2 cpus accounted it as idle time. Before: $ perf sched timehist --time 3194,3196 -s | tail Idle stats: CPU 0 idle for 1995.991 msec CPU 1 idle for 20.793 msec CPU 2 idle for 30.191 msec CPU 3 idle for 1999.852 msec Total number of unique tasks: 23 Total number of context switches: 128 Total run time (msec): 3724.940 After: $ perf sched timehist --time 3194,3196 -s | tail Idle stats: CPU 0 idle for 10.811 msec CPU 1 idle for 20.793 msec CPU 2 idle for 30.191 msec CPU 3 idle for 18.337 msec Total number of unique tasks: 23 Total number of context switches: 128 Total run time (msec): 18.139 Committer notes: Further testing: Before: Idle stats: CPU 0 idle for 229.785 msec CPU 1 idle for 937.944 msec CPU 2 idle for 188.931 msec CPU 3 idle for 986.185 msec After: # perf sched timehist --time 40602,40603 -s | tail Idle stats: CPU 0 idle for 229.785 msec CPU 1 idle for 175.407 msec CPU 2 idle for 188.931 msec CPU 3 idle for 223.657 msec Total number of unique tasks: 68 Total number of context switches: 814 Total run time (msec): 97.688 # for cpu in `seq 0 3` ; do echo -n "CPU $cpu idle for " ; perf sched timehist --time 40602,40603 | grep "\[000${cpu}\].*\<idle\>" | tr -s ' ' | cut -d' ' -f7 | awk '{entries++ ; s+=$1} END {print s " msec (entries: " entries ")"}' ; done CPU 0 idle for 229.721 msec (entries: 123) CPU 1 idle for 175.381 msec (entries: 65) CPU 2 idle for 188.903 msec (entries: 56) CPU 3 idle for 223.61 msec (entries: 102) Difference due to the idle stats being accounted at nanoseconds precision while the <idle> entries in 'perf sched timehist' are trucated at msec.usec. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixes: 853b74071110 ("perf sched timehist: Add option to specify time window of interest") Link: http://lkml.kernel.org/r/20161222060350.17655-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-22perf sched timehist: Remove hardcoded 'comm_width' check at print_summaryNamhyung Kim1-3/+0
Now that the default 'comm_width' value is 30, no need to check that at print_summary, Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org [ Split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-22perf sched timehist: Enlarge default 'comm_width'Namhyung Kim1-1/+1
Current default value is 20 but it's easily changed to a bigger value as task has a long name and different tid and pid. And it makes the output not aligned. So change it to have a large value as summary shows. Committer notes: Before: # perf sched record ^C # perf sched timehist <SNIP> 40602.770537 [0001] rcuos/2[29] 7.970 0.002 0.020 40602.771512 [0003] <idle> 0.003 0.000 0.986 40602.771586 [0001] <idle> 0.020 0.000 1.049 40602.771606 [0001] qemu-system-x86[3593/3510] 0.000 0.002 0.020 40602.771629 [0003] qemu-system-x86[3510] 0.000 0.003 0.116 40602.771776 [0000] <idle> 0.001 0.000 1.892 <SNIP> After: # perf sched timehist <SNIP> 40602.770537 [0001] rcuos/2[29] 7.970 0.002 0.020 40602.771512 [0003] <idle> 0.003 0.000 0.986 40602.771586 [0001] <idle> 0.020 0.000 1.049 40602.771606 [0001] qemu-system-x86[3593/3510] 0.000 0.002 0.020 40602.771629 [0003] qemu-system-x86[3510] 0.000 0.003 0.116 <SNIP> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org [ Split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-22perf sched timehist: Honour 'comm_width' when aligning the headersNamhyung Kim1-3/+4
Current default value is 20, but that may change in the future, so make places where we have 20 hardcoded use 'comm_width'. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org [ Split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-20tools lib bpf: Add bpf_prog_{attach,detach}Joe Stringer2-0/+26
Commit d8c5b17f2bc0 ("samples: bpf: add userspace example for attaching eBPF programs to cgroups") added these functions to samples/libbpf, but during this merge all of the samples libbpf functionality is shifting to tools/lib/bpf. Shift these functions there. Committer notes: Use bzero + attr.FIELD = value instead of 'attr = { .FIELD = value, just like the other wrapper calls to sys_bpf with bpf_attr to make this build in older toolchais, such as the ones in CentOS 5 and 6. Signed-off-by: Joe Stringer <joe@ovn.org> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-au2zvtsh55vqeo3v3uw7jr4c@git.kernel.org Link: https://github.com/joestringer/linux/commit/353e6f298c3d0a92fa8bfa61ff898c5050261a12.patch Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-20perf diff: Do not overwrite valid build idKan Liang1-1/+2
Fixes a perf diff regression issue which was introduced by commit 5baecbcd9c9a ("perf symbols: we can now read separate debug-info files based on a build ID") The binary name could be same when perf diff different binaries. Build id is used to distinguish between them. However, the previous patch assumes the same binary name has same build id. So it overwrites the build id according to the binary name, regardless of whether the build id is set or not. Check the has_build_id in dso__load. If the build id is already set, use it. Before the fix: $ perf diff 1.perf.data 2.perf.data # Event 'cycles' # # Baseline Delta Shared Object Symbol # ........ ....... ................ ............................. # 99.83% -99.80% tchain_edit [.] f2 0.12% +99.81% tchain_edit [.] f3 0.02% -0.01% [ixgbe] [k] ixgbe_read_reg After the fix: $ perf diff 1.perf.data 2.perf.data # Event 'cycles' # # Baseline Delta Shared Object Symbol # ........ ....... ................ ............................. # 99.83% +0.10% tchain_edit [.] f3 0.12% -0.08% tchain_edit [.] f2 Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> CC: Dima Kogan <dima@secretsauce.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Fixes: 5baecbcd9c9a ("perf symbols: we can now read separate debug-info files based on a build ID") Link: http://lkml.kernel.org/r/1481642984-13593-1-git-send-email-kan.liang@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-20perf annotate: Don't throw error for zero length symbolsRavi Bangoria1-1/+2
'perf report --tui' exits with error when it finds a sample of zero length symbol (i.e. addr == sym->start == sym->end). Actually these are valid samples. Don't exit TUI and show report with such symbols. Reported-and-Tested-by: Anton Blanchard <anton@samba.org> Link: https://lkml.org/lkml/2016/10/8/189 Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Riyder <chris.ryder@arm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org # v4.9+ Link: http://lkml.kernel.org/r/1479804050-5028-1-git-send-email-ravi.bangoria@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-20perf bench futex: Fix lock-pi help stringDavidlohr Bueso1-1/+1
Obvious copy/paste typo from the requeue program. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Davidlohr Bueso <dbueso@suse.de> Link: http://lkml.kernel.org/r/1481830584-30909-1-git-send-email-dave@stgolabs.net Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-20perf trace: Check if MAP_32BIT is defined (again)Jiri Olsa1-0/+2
There might be systems where MAP_32BIT is not defined, like some some RHEL7 powerpc versions. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Kyle McMartin <kyle@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixes: 256763b01741 ("perf trace beauty mmap: Add more conditional defines") Link: http://lkml.kernel.org/r/1481831814-23683-1-git-send-email-jolsa@kernel.org [ Changed the Fixme cset to the one removing the conditional switch case for MAP_32BIT ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-18Merge tag 'libnvdimm-for-4.10' of ↵Linus Torvalds1-7/+23
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The libnvdimm pull request is relatively small this time around due to some development topics being deferred to 4.11. As for this pull request the bulk of it has been in -next for several releases leading to one late fix being added (commit 868f036fee4b ("libnvdimm: fix mishandled nvdimm_clear_poison() return value")). It has received a build success notification from the 0day-kbuild robot and passes the latest libnvdimm unit tests. Summary: - Dynamic label support: To date namespace label support has been limited to disambiguating cases where PMEM (direct load/store) and BLK (mmio aperture) accessed-capacity alias on the same DIMM. Since 4.9 added support for multiple namespaces per PMEM-region there is value to support namespace labels even in the non-aliasing case. The presence of a valid namespace index block force-enables label support when the kernel would otherwise rely on region boundaries, and permits the region to be sub-divided. - Handle media errors in namespace metadata: Complement the error handling for media errors in namespace data areas with support for clearing errors on writes, and downgrading potential machine-check exceptions to simple i/o errors on read. - Device-DAX region attributes: Add 'align', 'id', and 'size' as attributes for device-dax regions. In particular this enables userspace tooling to generically size memory mapping and i/o operations. Prevent userspace from growing assumptions / dependencies about the parent device topology for a dax region. A libnvdimm namespace may not always be the parent device of a dax region. - Various cleanups and small fixes" * tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: add region 'id', 'size', and 'align' attributes libnvdimm: fix mishandled nvdimm_clear_poison() return value libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held libnvdimm, pfn: fix align attribute libnvdimm, e820: use module_platform_driver libnvdimm, namespace: use octal for permissions libnvdimm, namespace: avoid multiple sector calculations libnvdimm: remove else after return in nsio_rw_bytes() libnvdimm, namespace: fix the type of name variable libnvdimm: use consistent naming for request_mem_region() nvdimm: use the right length of "pmem" libnvdimm: check and clear poison before writing to pmem tools/testing/nvdimm: dynamic label support libnvdimm: allow a platform to force enable label support libnvdimm: use generic iostat interfaces
2016-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+29
Pull networking fixes and cleanups from David Miller: 1) Revert bogus nla_ok() change, from Alexey Dobriyan. 2) Various bpf validator fixes from Daniel Borkmann. 3) Add some necessary SET_NETDEV_DEV() calls to hsis_femac and hip04 drivers, from Dongpo Li. 4) Several ethtool ksettings conversions from Philippe Reynes. 5) Fix bugs in inet port management wrt. soreuseport, from Tom Herbert. 6) XDP support for virtio_net, from John Fastabend. 7) Fix NAT handling within a vrf, from David Ahern. 8) Endianness fixes in dpaa_eth driver, from Claudiu Manoil * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (63 commits) net: mv643xx_eth: fix build failure isdn: Constify some function parameters mlxsw: spectrum: Mark split ports as such cgroup: Fix CGROUP_BPF config qed: fix old-style function definition net: ipv6: check route protocol when deleting routes r6040: move spinlock in r6040_close as SOFTIRQ-unsafe lock order detected irda: w83977af_ir: cleanup an indent issue net: sfc: use new api ethtool_{get|set}_link_ksettings net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings net: chelsio: cxgb3: use new api ethtool_{get|set}_link_ksettings net: chelsio: cxgb2: use new api ethtool_{get|set}_link_ksettings bpf: fix mark_reg_unknown_value for spilled regs on map value marking bpf: fix overflow in prog accounting bpf: dynamically allocate digest scratch buffer gtp: Fix initialization of Flags octet in GTPv1 header gtp: gtp_check_src_ms_ipv4() always return success net/x25: use designated initializers isdn: use designated initializers ...
2016-12-17Merge branch 'kbuild' of ↵Linus Torvalds12-16/+12
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: - prototypes for x86 asm-exported symbols (Adam Borowski) and a warning about missing CRCs (Nick Piggin) - asm-exports fix for LTO (Nicolas Pitre) - thin archives improvements (Nick Piggin) - linker script fix for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION (Nick Piggin) - genksyms support for __builtin_va_list keyword - misc minor fixes * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: x86/kbuild: enable modversions for symbols exported from asm kbuild: fix scripts/adjust_autoksyms.sh* for the no modules case scripts/kallsyms: remove last remnants of --page-offset option make use of make variable CURDIR instead of calling pwd kbuild: cmd_export_list: tighten the sed script kbuild: minor improvement for thin archives build kbuild: modpost warn if export version crc is missing kbuild: keep data tables through dead code elimination kbuild: improve linker compatibility with lib-ksyms.o build genksyms: Regenerate parser kbuild/genksyms: handle va_list type kbuild: thin archives for multi-y targets kbuild: kallsyms allow 3-pass generation if symbols size has changed
2016-12-17Merge branch 'for-4.10/libnvdimm' into libnvdimm-for-nextDan Williams1-7/+23
2016-12-17bpf, test_verifier: fix a test case error result on unprivilegedDaniel Borkmann1-1/+1
Running ./test_verifier as unprivileged lets 1 out of 98 tests fail: [...] #71 unpriv: check that printk is disallowed FAIL Unexpected error message! 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r1 = r10 2: (07) r1 += -8 3: (b7) r2 = 8 4: (bf) r3 = r1 5: (85) call bpf_trace_printk#6 unknown func bpf_trace_printk#6 [...] The test case is correct, just that the error outcome changed with ebb676daa1a3 ("bpf: Print function name in addition to function id"). Same as with e00c7b216f34 ("bpf: fix multiple issues in selftest suite and samples") issue 2), so just fix up the function name. Fixes: ebb676daa1a3 ("bpf: Print function name in addition to function id") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-17bpf: fix regression on verifier pruning wrt map lookupsDaniel Borkmann1-0/+28
Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") introduced a regression where existing programs stopped loading due to reaching the verifier's maximum complexity limit, whereas prior to this commit they were loading just fine; the affected program has roughly 2k instructions. What was found is that state pruning couldn't be performed effectively anymore due to mismatches of the verifier's register state, in particular in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per se, but rather that verifier needs to perform a lot more work for the same program with regards to involved map lookups. Since commit 57a09bf0a416 is only about tracking registers with type PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers until they are promoted through pattern matching with a NULL check to either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the id becomes irrelevant for the transitioned types. For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(), but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even transferred further into other types that don't make use of it. Among others, one example is where UNKNOWN_VALUE is set on function call return with RET_INTEGER return type. states_equal() will then fall through the memcmp() on register state; note that the second memcmp() uses offsetofend(), so the id is part of that since d2a4dd37f6b4 ("bpf: fix state equivalence"). But the bisect pointed already to 57a09bf0a416, where we really reach beyond complexity limit. What I found was that states_equal() often failed in this case due to id mismatches in spilled regs with registers in type PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform a memcmp() on their reg state and don't have any other optimizations in place, therefore also id was relevant in this case for making a pruning decision. We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE. For the affected program, it resulted in a ~17 fold reduction of complexity and let the program load fine again. Selftest suite also runs fine. The only other place where env->id_gen is used currently is through direct packet access, but for these cases id is long living, thus a different scenario. Also, the current logic in mark_map_regs() is not fully correct when marking NULL branch with UNKNOWN_VALUE. We need to cache the destination reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE, it's id is reset and any subsequent registers that hold the original id and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE anymore, since mark_map_reg() reuses the uncached regs[regno].id that was just overridden. Note, we don't need to cache it outside of mark_map_regs(), since it's called once on this_branch and the other time on other_branch, which are both two independent verifier states. A test case for this is added here, too. Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-16Merge tag 'powerpc-4.10-1' of ↵Linus Torvalds53-337/+3307
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for the kexec_file_load() syscall, which is a prereq for secure and trusted boot. - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN). - Sort the exception tables at build time, to save time at boot, and store them as relative offsets to save space in the kernel image & memory. - Allow building the kernel with thin archives, which should allow us to build an allyesconfig once some other fixes land. - Build fixes to allow us to correctly rebuild when changing the kernel endian from big to little or vice versa. - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix. - Initial stack protector support (-fstack-protector). - Support for dumping the radix (aka. Linux) and hash page tables via debugfs. - Fix an oops in cxl coredump generation when cxl_get_fd() is used. - Freescale updates from Scott: "Highlights include 8xx hugepage support, qbman fixes/cleanup, device tree updates, and some misc cleanup." - Many and varied fixes and minor enhancements as always. Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain" [ And thanks to Michael, who took time off from a new baby to get this pull request done. - Linus ] * tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits) powerpc/fsl/dts: add FMan node for t1042d4rdb powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb powerpc/fsl/dts: add QMan and BMan nodes on t1024 powerpc/fsl/dts: add QMan and BMan nodes on t1023 soc/fsl/qman: test: use DEFINE_SPINLOCK() powerpc/fsl-lbc: use DEFINE_SPINLOCK() powerpc/8xx: Implement support of hugepages powerpc: get hugetlbpage handling more generic powerpc: port 64 bits pgtable_cache to 32 bits powerpc/boot: Request no dynamic linker for boot wrapper soc/fsl/bman: Use resource_size instead of computation soc/fsl/qe: use builtin_platform_driver powerpc/fsl_pmc: use builtin_platform_driver powerpc/83xx/suspend: use builtin_platform_driver powerpc/ftrace: Fix the comments for ftrace_modify_code powerpc/perf: macros for power9 format encoding powerpc/perf: power9 raw event format encoding powerpc/perf: update attribute_group data structure powerpc/perf: factor out the event format field powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown ...