summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-01-14tracing: Add ustring operation to filtering string pointersSteven Rostedt1-24/+57
Since referencing user space pointers is special, if the user wants to filter on a field that is a pointer to user space, then they need to specify it. Add a ".ustring" attribute to the field name for filters to state that the field is pointing to user space such that the kernel can take the appropriate action to read that pointer. Link: https://lore.kernel.org/all/yt9d8rvmt2jq.fsf@linux.ibm.com/ Fixes: 77360f9bbc7e ("tracing: Add test for user space strings when filtering on string pointers") Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() failsNikita Yushchenko1-4/+16
If start_per_cpu_kthreads() called from osnoise_workload_start() returns error, event hooks are left in broken state: unhook_irq_events() called but unhook_thread_events() and unhook_softirq_events() not called, and trace_osnoise_callback_enabled flag not cleared. On the next tracer enable, hooks get not installed due to trace_osnoise_callback_enabled flag. And on the further tracer disable an attempt to remove non-installed hooks happened, hitting a WARN_ON_ONCE() in tracepoint_remove_func(). Fix the error path by adding the missing part of cleanup. While at this, introduce osnoise_unhook_events() to avoid code duplication between this error path and normal tracer disable. Link: https://lkml.kernel.org/r/20220109153459.3701773-1-nikita.yushchenko@virtuozzo.com Cc: stable@vger.kernel.org Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing: Remove duplicate warnings when calling trace_create_file()Yuntao Wang1-9/+3
Since the same warning message is already printed in the trace_create_file() function, there is no need to print it again. Link: https://lkml.kernel.org/r/20220109162232.361747-1-ytcoode@gmail.com Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing/kprobes: 'nmissed' not showed correctly for kretprobeXiangyang Zhang1-1/+4
The 'nmissed' column of the 'kprobe_profile' file for kretprobe is not showed correctly, kretprobe can be skipped by two reasons, shortage of kretprobe_instance which is counted by tk->rp.nmissed, and kprobe itself is missed by some reason, so to show the sum. Link: https://lkml.kernel.org/r/20220107150242.5019-1-xyz.sun.ok@gmail.com Cc: stable@vger.kernel.org Fixes: 4a846b443b4e ("tracing/kprobes: Cleanup kprobe tracer code") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Xiangyang Zhang <xyz.sun.ok@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing: Add test for user space strings when filtering on string pointersSteven Rostedt1-3/+63
Pingfan reported that the following causes a fault: echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter echo 1 > events/syscalls/sys_enter_at/enable The reason is that trace event filter treats the user space pointer defined by "filename" as a normal pointer to compare against the "cpu" string. The following bug happened: kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60 #PF: supervisor read access in kernel mode #PF: error_code(0x0001) - permissions violation PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867 Oops: 0001 [#1] PREEMPT SMP PTI CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:strlen+0x0/0x20 Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246 RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000 RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60 RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000 R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380 R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0 FS: 00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: filter_pred_pchar+0x18/0x40 filter_match_preds+0x31/0x70 ftrace_syscall_enter+0x27a/0x2c0 syscall_trace_enter.constprop.0+0x1aa/0x1d0 do_syscall_64+0x16/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa861d88664 The above happened because the kernel tried to access user space directly and triggered a "supervisor read access in kernel mode" fault. Worse yet, the memory could not even be loaded yet, and a SEGFAULT could happen as well. This could be true for kernel space accessing as well. To be even more robust, test both kernel and user space strings. If the string fails to read, then simply have the filter fail. Note, TASK_SIZE is used to determine if the pointer is user or kernel space and the appropriate strncpy_from_kernel/user_nofault() function is used to copy the memory. For some architectures, the compare to TASK_SIZE may always pick user space or kernel space. If it gets it wrong, the only thing is that the filter will fail to match. In the future, this needs to be fixed to have the event denote which should be used. But failing a filter is much better than panicing the machine, and that can be solved later. Link: https://lore.kernel.org/all/20220107044951.22080-1-kernelfans@gmail.com/ Link: https://lkml.kernel.org/r/20220110115532.536088fd@gandalf.local.home Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Reported-by: Pingfan Liu <kernelfans@gmail.com> Tested-by: Pingfan Liu <kernelfans@gmail.com> Fixes: 87a342f5db69d ("tracing/filters: Support filtering for char * strings") Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing: Have syscall trace events use trace_event_buffer_lock_reserve()Steven Rostedt1-4/+2
Currently, the syscall trace events call trace_buffer_lock_reserve() directly, which means that it misses out on some of the filtering optimizations provided by the helper function trace_event_buffer_lock_reserve(). Have the syscall trace events call that instead, as it was missed when adding the update to use the temp buffer when filtering. Link: https://lkml.kernel.org/r/20220107225839.823118570@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 0fc1b09ff1ff4 ("tracing: Use temp buffer when filtering events") Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13ftrace: Add test to make sure compiled time sorts workSteven Rostedt (VMware)2-0/+37
Now that ftrace function pointers are sorted at compile time, add a test that makes sure they are sorted at run time. This test is only run if it is configured in. Link: https://lkml.kernel.org/r/20211206151858.4d21a24d@gandalf.local.home Cc: Yinan Liu <yinan@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2022-01-13scripts: ftrace - move the sort-processing in ftrace_initYinan Liu1-2/+9
When the kernel starts, the initialization of ftrace takes up a portion of the time (approximately 6~8ms) to sort mcount addresses. We can save this time by moving mcount-sorting to compile time. Link: https://lkml.kernel.org/r/20211212113358.34208-2-yinan@linux.alibaba.com Signed-off-by: Yinan Liu <yinan@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing/probes: check the return value of kstrndup() for pbufXiaoke Wang1-0/+2
kstrndup() is a memory allocation-related function, it returns NULL when some internal memory errors happen. It is better to check the return value of it so to catch the memory error in time. Link: https://lkml.kernel.org/r/tencent_4D6E270731456EB88712ED7F13883C334906@qq.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support") Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing/uprobes: Check the return value of kstrdup() for tu->filenameXiaoke Wang1-0/+5
kstrdup() returns NULL when some internal memory errors happen, it is better to check the return value of it so to catch the memory error in time. Link: https://lkml.kernel.org/r/tencent_3C2E330722056D7891D2C83F29C802734B06@qq.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 33ea4b24277b ("perf/core: Implement the 'perf_uprobe' PMU") Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-13tracing: Account bottom half disabled sections.Sebastian Andrzej Siewior2-2/+8
Disabling only bottom halves via local_bh_disable() disables also preemption but this remains invisible to tracing. On a CONFIG_PREEMPT kernel one might wonder why there is no scheduling happening despite the N flag in the trace. The reason might be the a rcu_read_lock_bh() section. Add a 'b' to the tracing output if in task context with disabled bottom halves. Link: https://lkml.kernel.org/r/YbcbtdtC/bjCKo57@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-10tracing: Add helper functions to simplify event_command.parse() callback ↵Tom Zanussi2-0/+366
handling The event_command.parse() callback is responsible for parsing and registering triggers. The existing command implementions for this callback duplicate a lot of the same code, so to clean up and consolidate those implementations, introduce a handful of helper functions for implementors to use. This also makes it easier for new commands to be implemented and allows them to focus more on the customizations they provide rather than obscuring and complicating it with boilerplate code. Link: https://lkml.kernel.org/r/c1ff71f594d45177706571132bd3119491097221.1641823001.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-10tracing: Remove ops param from event_command reg()/unreg() callbacksTom Zanussi4-28/+20
The event_trigger_ops for an event_command are already accessible via event_trigger_data.ops so remove the redundant ops from the callback. Link: https://lkml.kernel.org/r/4c6f2a41820452f9cacddc7634ad442928aa2aa6.1641823001.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-10tracing: Change event_trigger_ops func() to trigger()Tom Zanussi4-26/+37
The name of the func() callback on event_trigger_ops is too generic and is easily confused with other callbacks with that name, so change it to something that reflects its actual purpose. In this case, the main purpose of the callback is to implement an event trigger, so call it trigger() instead. Also add some more documentation to event_trigger_ops describing the callbacks a bit better. Link: https://lkml.kernel.org/r/36ab812e3ee74ee03ae0043fda41a858ee728c00.1641823001.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2022-01-10tracing: Change event_command func() to parse()Tom Zanussi4-39/+44
The name of the func() callback on event_command is too generic and is easily confused with other callbacks with that name, so change it to something that reflects its actual purpose. In this case, the main purpose of the callback is to parse an event command, so call it parse() instead. Link: https://lkml.kernel.org/r/7784e321840752ed88aac0b349c0c685fc9247b1.1641823001.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2021-12-11tracing: Use trace_iterator_reset() in tracing_read_pipe()Steven Rostedt (VMware)1-2/+1
Currently tracing_read_pipe() open codes trace_iterator_reset(). Just have it use trace_iterator_reset() instead. Link: https://lkml.kernel.org/r/20211210202616.64d432d2@gandalf.local.home Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-11tracing: Use memset_startat helper in trace_iterator_reset()Xiu Jianfeng1-8/+1
Make use of memset_startat helper to simplify the code, there should be no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20211210012245.207489-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-11tracing: Do not let synth_events block other dyn_event systems during createBeau Belgrave1-6/+7
synth_events is returning -EINVAL if the dyn_event create command does not contain ' \t'. This prevents other systems from getting called back. synth_events needs to return -ECANCELED in these cases when the command is not targeting the synth_event system. Link: https://lore.kernel.org/linux-trace-devel/20210930223821.11025-1-beaub@linux.microsoft.com Fixes: c9e759b1e8456 ("tracing: Rework synthetic event command parsing") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-11tracing: Iterate trace_[ku]probe objects directlyJiri Olsa2-24/+12
As suggested by Linus [1] using list_for_each_entry to iterate directly trace_[ku]probe objects so we can skip another call to container_of in these loops. [1] https://lore.kernel.org/r/CAHk-=wjakjw6-rDzDDBsuMoDCqd+9ogifR_EE1F0K-jYek1CdA@mail.gmail.com Link: https://lkml.kernel.org/r/20211125202852.406405-1-jolsa@kernel.org Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-07tracing: Make trace_marker{,_raw} stream-likeJohn Keeping1-10/+8
The tracing marker files are write-only streams with no meaningful concept of file position. Using stream_open() to mark them as stream-link indicates this and has the added advantage that a single file descriptor can now be used from multiple threads without contention thanks to clearing FMODE_ATOMIC_POS. Note that this has the potential to break existing userspace by since both lseek(2) and pwrite(2) will now return ESPIPE when previously lseek would have updated the stored offset and pwrite would have appended to the trace. A survey of libtracefs and several other projects found to use trace_marker(_raw) [1][2][3] suggests that everyone limits themselves to calling write(2) and close(2) on these file descriptors so there is a good chance this will go unnoticed and the benefits of reduced overhead and lock contention seem worth the risk. [1] https://github.com/google/perfetto [2] https://github.com/intel/media-driver/ [3] https://w1.fi/cgit/hostap/ Link: https://lkml.kernel.org/r/20211207142558.347029-1-john@metanate.com Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Switch to kvfree_rcu() APIUladzislau Rezki (Sony)2-4/+2
Instead of invoking a synchronize_rcu() to free a pointer after a grace period we can directly make use of new API that does the same but in more efficient way. Link: https://lkml.kernel.org/r/20211124110308.2053-10-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Fix synth_event_add_val() kernel-doc commentQiujun Huang1-1/+1
It's named field here. Link: https://lkml.kernel.org/r/20210516022410.64271-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing/uprobes: Use trace_event_buffer_reserve() helperSteven Rostedt (VMware)1-7/+4
To be consistent with kprobes and eprobes, use trace_event_buffer_reserver() and trace_event_buffer_commit(). This will ensure that any updates to trace events will also be implemented on uprobe events. Link: https://lkml.kernel.org/r/20211206162440.69fbf96c@gandalf.local.home Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing/kprobes: Do not open code event reserve logicSteven Rostedt (VMware)1-18/+7
As kprobe events use trace_event_buffer_commit() to commit the event to the ftrace ring buffer, for consistency, it should use trace_event_buffer_reserve() to allocate it, as the two functions are related. Link: https://lkml.kernel.org/r/20211130024319.257430762@goodmis.org Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Have eprobes use filtering logic of trace eventsSteven Rostedt (VMware)1-11/+5
The eprobes open code the reserving of the event on the ring buffer for ftrace instead of using the ftrace event wrappers, which means that it doesn't get affected by the filters, breaking the filtering logic on user space. Link: https://lkml.kernel.org/r/20211130024319.068451680@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Disable preemption when using the filter bufferSteven Rostedt (VMware)2-27/+36
In case trace_event_buffer_lock_reserve() is called with preemption enabled, the algorithm that defines the usage of the per cpu filter buffer may fail if the task schedules to another CPU after determining which buffer it will use. Disable preemption when using the filter buffer. And because that same buffer must be used throughout the call, keep preemption disabled until the filter buffer is released. This will also keep the semantics between the use case of when the filter buffer is used, and when the ring buffer itself is used, as that case also disables preemption until the ring buffer is released. Link: https://lkml.kernel.org/r/20211130024318.880190623@goodmis.org [ Fixed warning of assignment in if statement Reported-by: kernel test robot <lkp@intel.com> ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Use __this_cpu_read() in trace_event_buffer_lock_reserver()Steven Rostedt (VMware)1-1/+1
The value read by this_cpu_read() is used later and its use is expected to stay on the same CPU as being read. But this_cpu_read() does not warn if it is called without preemption disabled, where as __this_cpu_read() will check if preemption is disabled on CONFIG_DEBUG_PREEMPT Currently all callers have preemption disabled, but there may be new callers in the future that may not. Link: https://lkml.kernel.org/r/20211130024318.698165354@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Add '__rel_loc' using trace event macrosMasami Hiramatsu1-0/+3
Add '__rel_loc' using trace event macros. These macros are usually not used in the kernel, except for testing purpose. This also add "rel_" variant of macros for dynamic_array string, and bitmask. Link: https://lkml.kernel.org/r/163757342119.510314.816029622439099016.stgit@devnote2 Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Support __rel_loc relative dynamic data location attributeMasami Hiramatsu4-6/+59
Add '__rel_loc' new dynamic data location attribute which encodes the data location from the next to the field itself. The '__data_loc' is used for encoding the dynamic data location on the trace event record. But '__data_loc' is not useful if the writer doesn't know the event header (e.g. user event), because it records the dynamic data offset from the entry of the record, not the field itself. This new '__rel_loc' attribute encodes the data location relatively from the next of the field. For example, when there is a record like below (the number in the parentheses is the size of fields) |header(N)|common(M)|fields(K)|__data_loc(4)|fields(L)|data(G)| In this case, '__data_loc' field will be __data_loc = (G << 16) | (N+M+K+4+L) If '__rel_loc' is used, this will be |header(N)|common(M)|fields(K)|__rel_loc(4)|fields(L)|data(G)| where __rel_loc = (G << 16) | (L) This case shows L bytes after the '__rel_loc' attribute field, if there is no fields after the __rel_loc field, L must be 0. This is relatively easy (and no need to consider the kernel header change) when the event data fields are composed by user who doesn't know header and common fields. Link: https://lkml.kernel.org/r/163757341258.510314.4214431827833229956.stgit@devnote2 Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06tracing: Fix spelling mistake "aritmethic" -> "arithmetic"Colin Ian King1-1/+1
There is a spelling mistake in the tracing mini-HOWTO text. Fix it. Link: https://lkml.kernel.org/r/20211108201513.42876-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-05Merge tag 'timers_urgent_for_v5.16_rc4' of ↵Linus Torvalds2-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Prevent a tick storm when a dedicated timekeeper CPU in nohz_full mode runs for prolonged periods with interrupts disabled and ends up programming the next tick in the past, leading to that storm * tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/nohz: Last resort update jiffies on nohz_full IRQ entry
2021-12-05Merge tag 'sched_urgent_for_v5.16_rc4' of ↵Linus Torvalds2-6/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Properly init uclamp_flags of a runqueue, on first enqueuing - Fix preempt= callback return values - Correct utime/stime resource usage reporting on nohz_full to return the proper times instead of shorter ones * tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/uclamp: Fix rq->uclamp_max not set on first enqueue preempt/dynamic: Fix setup_preempt_mode() return value sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
2021-12-04sched/uclamp: Fix rq->uclamp_max not set on first enqueueQais Yousef1-1/+1
Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com
2021-12-04preempt/dynamic: Fix setup_preempt_mode() return valueAndrew Halaney1-2/+2
__setup() callbacks expect 1 for success and 0 for failure. Correct the usage here to reflect that. Fixes: 826bfeb37bb4 ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com
2021-12-02sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_fullFrederic Weisbecker1-3/+9
getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime than the actual time. task_cputime_adjusted() snapshots utime and stime and then adjust their sum to match the scheduler maintained cputime.sum_exec_runtime. Unfortunately in nohz_full, sum_exec_runtime is only updated once per second in the worst case, causing a discrepancy against utime and stime that can be updated anytime by the reader using vtime. To fix this situation, perform an update of cputime.sum_exec_runtime when the cputime snapshot reports the task as actually running while the tick is disabled. The related overhead is then contained within the relevant situations. Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org
2021-12-02timers/nohz: Last resort update jiffies on nohz_full IRQ entryFrederic Weisbecker2-1/+9
When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick. Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine. If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm. Here is a scenario where it matters: 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. 1) A stop machine callback is queued to execute somewhere. 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs. 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled. 5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock. 6) The tick fires immediately, goto 3) 7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
2021-12-01kprobes: Limit max data_size of the kretprobe instancesMasami Hiramatsu1-0/+3
The 'kprobe::data_size' is unsigned, thus it can not be negative. But if user sets it enough big number (e.g. (size_t)-8), the result of 'data_size + sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct kretprobe_instance) or zero. In result, the kretprobe_instance are allocated without enough memory, and kretprobe accesses outside of allocated memory. To avoid this issue, introduce a max limitation of the kretprobe::data_size. 4KB per instance should be OK. Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: f47cd9b553aa ("kprobes: kretprobe user entry-handler") Reported-by: zhangyue <zhangyue1@kylinos.cn> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-01tracing: Fix a kmemleak false positive in tracing_mapChen Jun1-0/+3
Doing the command: echo 'hist:key=common_pid.execname,common_timestamp' > /sys/kernel/debug/tracing/events/xxx/trigger Triggers many kmemleak reports: unreferenced object 0xffff0000c7ea4980 (size 128): comm "bash", pid 338, jiffies 4294912626 (age 9339.324s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0 [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178 [<00000000633bd154>] tracing_map_init+0x1f8/0x268 [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0 [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128 [<00000000f549355a>] event_trigger_write+0x7c/0x120 [<00000000b80f898d>] vfs_write+0xc4/0x380 [<00000000823e1055>] ksys_write+0x74/0xf8 [<000000008a9374aa>] __arm64_sys_write+0x24/0x30 [<0000000087124017>] do_el0_svc+0x88/0x1c0 [<00000000efd0dcd1>] el0_svc+0x1c/0x28 [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0 [<00000000e7399680>] el0_sync+0x148/0x180 unreferenced object 0xffff0000c7ea4980 (size 128): comm "bash", pid 338, jiffies 4294912626 (age 9339.324s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0 [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178 [<00000000633bd154>] tracing_map_init+0x1f8/0x268 [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0 [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128 [<00000000f549355a>] event_trigger_write+0x7c/0x120 [<00000000b80f898d>] vfs_write+0xc4/0x380 [<00000000823e1055>] ksys_write+0x74/0xf8 [<000000008a9374aa>] __arm64_sys_write+0x24/0x30 [<0000000087124017>] do_el0_svc+0x88/0x1c0 [<00000000efd0dcd1>] el0_svc+0x1c/0x28 [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0 [<00000000e7399680>] el0_sync+0x148/0x180 The reason is elts->pages[i] is alloced by get_zeroed_page. and kmemleak will not scan the area alloced by get_zeroed_page. The address stored in elts->pages will be regarded as leaked. That is, the elts->pages[i] will have pointers loaded onto it as well, and without telling kmemleak about it, those pointers will look like memory without a reference. To fix this, call kmemleak_alloc to tell kmemleak to scan elts->pages[i] Link: https://lkml.kernel.org/r/20211124140801.87121-1-chenjun102@huawei.com Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-01tracing/histograms: String compares should not care about signed valuesSteven Rostedt (VMware)1-1/+1
When comparing two strings for the "onmatch" histogram trigger, fields that are strings use string comparisons, which do not care about being signed or not. Do not fail to match two string fields if one is unsigned char array and the other is a signed char array. Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/ Cc: stable@vgerk.kernel.org Cc: Tom Zanussi <zanussi@kernel.org> Cc: Yafang Shao <laoar.shao@gmail.com> Fixes: b05e89ae7cf3b ("tracing: Accept different type for synthetic event fields") Reviewed-by: Masami Hiramatsu <mhiramatsu@kernel.org> Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-28Merge tag 'sched-urgent-2021-11-28' of ↵Linus Torvalds2-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "A single scheduler fix to ensure that there is no stale KASAN shadow state left on the idle task's stack when a CPU is brought up after it was brought down before" * tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/scs: Reset task stack state in bringup_cpu()
2021-11-28Merge tag 'perf-urgent-2021-11-28' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Thomas Gleixner: "A single fix for perf to prevent it from sending SIGTRAP to another task from a trace point event as it's not possible to deliver a synchronous signal to a different task from there" * tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ignore sigtrap for tracepoints destined for other tasks
2021-11-28Merge tag 'locking-urgent-2021-11-28' of ↵Linus Torvalds1-93/+89
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two regression fixes for reader writer semaphores: - Plug a race in the lock handoff which is caused by inconsistency of the reader and writer path and can lead to corruption of the underlying counter. - down_read_trylock() is suboptimal when the lock is contended and multiple readers trylock concurrently. That's due to the initial value being read non-atomically which results in at least two compare exchange loops. Making the initial readout atomic reduces this significantly. Whith 40 readers by 11% in a benchmark which enforces contention on mmap_sem" * tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Optimize down_read_trylock() under highly contended case locking/rwsem: Make handoff bit handling more consistent
2021-11-28Merge tag 'trace-v5.16-rc2-3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull another tracing fix from Steven Rostedt: "Fix the fix of pid filtering The setting of the pid filtering flag tested the "trace only this pid" case twice, and ignored the "trace everything but this pid" case. The 5.15 kernel does things a little differently due to the new sparse pid mask introduced in 5.16, and as the bug was discovered running the 5.15 kernel, and the first fix was initially done for that kernel, that fix handled both cases (only pid and all but pid), but the forward port to 5.16 created this bug" * tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Test the 'Do not trace this pid' case in create event
2021-11-27tracing: Test the 'Do not trace this pid' case in create eventSteven Rostedt (VMware)1-1/+1
When creating a new event (via a module, kprobe, eprobe, etc), the descriptors that are created must add flags for pid filtering if an instance has pid filtering enabled, as the flags are used at the time the event is executed to know if pid filtering should be done or not. The "Only trace this pid" case was added, but a cut and paste error made that case checked twice, instead of checking the "Trace all but this pid" case. Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/ Fixes: 6cb206508b62 ("tracing: Check pid filtering when creating events") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-27Merge tag 'trace-v5.16-rc2-2' of ↵Linus Torvalds2-6/+30
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Two fixes to event pid filtering: - Make sure newly created events reflect the current state of pid filtering - Take pid filtering into account when recording trigger events. (Also clean up the if statement to be cleaner)" * tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix pid filtering when triggers are attached tracing: Check pid filtering when creating events
2021-11-26tracing: Fix pid filtering when triggers are attachedSteven Rostedt (VMware)1-6/+18
If a event is filtered by pid and a trigger that requires processing of the event to happen is a attached to the event, the discard portion does not take the pid filtering into account, and the event will then be recorded when it should not have been. Cc: stable@vger.kernel.org Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-26Merge tag 'pm-5.16-rc3' of ↵Linus Torvalds2-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These address three issues in the intel_pstate driver and fix two problems related to hibernation. Specifics: - Make intel_pstate work correctly on Ice Lake server systems with out-of-band performance control enabled (Adamos Ttofari). - Fix EPP handling in intel_pstate during CPU offline and online in the active mode (Rafael Wysocki). - Make intel_pstate support ITMT on asymmetric systems with overclocking enabled (Srinivas Pandruvada). - Fix hibernation image saving when using the user space interface based on the snapshot special device file (Evan Green). - Make the hibernation code release the snapshot block device using the same mode that was used when acquiring it (Thomas Zeitlhofer)" * tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: hibernate: Fix snapshot partial write lengths PM: hibernate: use correct mode for swsusp_close() cpufreq: intel_pstate: ITMT support for overclocked system cpufreq: intel_pstate: Fix active mode offline/online EPP handling cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
2021-11-26tracing: Check pid filtering when creating eventsSteven Rostedt (VMware)1-0/+12
When pid filtering is activated in an instance, all of the events trace files for that instance has the PID_FILTER flag set. This determines whether or not pid filtering needs to be done on the event, otherwise the event is executed as normal. If pid filtering is enabled when an event is created (via a dynamic event or modules), its flag is not updated to reflect the current state, and the events are not filtered properly. Cc: stable@vger.kernel.org Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-24PM: hibernate: Fix snapshot partial write lengthsEvan Green1-1/+1
snapshot_write() is inappropriately limiting the amount of data that can be written in cases where a partial page has already been written. For example, one would expect to be able to write 1 byte, then 4095 bytes to the snapshot device, and have both of those complete fully (since now we're aligned to a page again). But what ends up happening is we write 1 byte, then 4094/4095 bytes complete successfully. The reason is that simple_write_to_buffer()'s second argument is the total size of the buffer, not the size of the buffer minus the offset. Since simple_write_to_buffer() accounts for the offset in its implementation, snapshot_write() can just pass the full page size directly down. Signed-off-by: Evan Green <evgreen@chromium.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-11-24PM: hibernate: use correct mode for swsusp_close()Thomas Zeitlhofer1-3/+3
Commit 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") changed the opening mode of the block device to (FMODE_READ | FMODE_EXCL). In the corresponding calls to swsusp_close(), the mode is still just FMODE_READ which triggers the warning in blkdev_flush_mapping() on resume from hibernate. So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the device. Fixes: 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>