Age | Commit message (Collapse) | Author | Files | Lines |
|
The metrics "LLC Ld Miss" and "Load Dram" overlap with each other for
accouting items:
"LLC Ld Miss" = "lcl_dram" + "rmt_dram" + "rmt_hit" + "rmt_hitm"
"Load Dram" = "lcl_dram" + "rmt_dram"
Furthermore, the metrics "LLC Ld Miss" is not directive to show
statistics due to it contains summary value and cannot give out
breakdown details.
For this reason, add a new metrics "RMT Load Hit" which is used to
present the remote cache hit; it contains two items:
"RMT Load Hit" = remote hit ("rmt_hit") + remote hitm ("rmt_hitm")
As result, the metrics "LLC Ld Miss" is perfectly divided into two
metrics "RMT Load Hit" and "Load Dram". It's not necessary to keep
metrics "LLC Ld Miss", so remove it.
Before:
# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- LLC --- Load Dram ----
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm Ld Miss Lcl Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 548 2615 66 169 481 0 0 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 187 361 27 11 78 0 0 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 131 0 10 263 1 0 0 0
After:
# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ----
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 548 2615 66 169 481 0 0 0 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 187 361 27 11 78 0 0 0 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 131 0 10 263 1 0 0 0 0
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-9-leo.yan@linaro.org
|
|
"rmt_hit" is accounted into two metrics: one is accounted into the
metrics "LLC Ld Miss" (see the function llc_miss() for calculation
"llcmiss"); and it's accounted into metrics "LLC Load Hit". Thus,
for the literal meaning, it is contradictory that "rmt_hit" is
accounted for both "LLC Ld Miss" (LLC miss) and "LLC Load Hit"
(LLC hit).
Thus this is easily to introduce confusion: "LLC Load Hit" gives
impression that all items belong to it are LLC hit; in fact "rmt_hit"
is LLC miss and remote cache hit.
To give out clear semantics for metric "LLC Load Hit", "rmt_hit" is
moved out from it and changes "LLC Load Hit" to contain two items:
LLC Load Hit = LLC's hit ("ld_llchit") + LLC's hitm ("lcl_hitm")
For output alignment, adjusts the header for "LLC Load Hit".
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-8-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Replace the header string "Lcl" with "LclHit", which is more explicit
to express the event type is LLC local hit.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-7-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Local and remote HITM use the headers 'Lcl' and 'Rmt' respectively,
suppose if we want to extend the tool to display these two dimensions
under any one metrics, users cannot understand the semantics if only
based on the header string 'Lcl' or 'Rmt'.
To explicit express the meaning for HITM items, this patch changes the
headers string as "LclHitm" and "RmtHitm", the strings are more readable
and this allows to extend metrics for using HITM items.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-6-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The metrics "LLC Load Hitm" contains two items: one is "local Hitm" and
another is "remote Hitm".
"local Hitm" means: L3 HIT and was serviced by another processor core
with a cross core snoop where modified copies were found; it's no doubt
that "local Hitm" belongs to LLC access.
But for "remote Hitm", based on the code in util/mem-events, it's the
event for remote cache HIT and was serviced by another processor core
with modified copies. Thus the remote Hitm is a remote cache's hit and
actually it's LLC load miss.
Now the display format gives users the impression that "local Hitm" and
"remote Hitm" both belong to the LLC load, but this is not the fact as
described.
This patch changes the header from "LLC Load Hitm" to "Load Hitm", this
can avoid the give the wrong impression that all Hitm belong to LLC.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-5-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The metrics are not organized based on memory hierarchy, e.g. the tool
doesn't organize the metrics order based on memory nodes from the close
node (e.g. L1/L2 cache) to far node (e.g. L3 cache and DRAM).
To output metrics with more friendly form, this patch refines the
metrics order based on memory hierarchy:
"Core Load Hit" => "LLC Load Hit" => "LLC Ld Miss" => "Load Dram"
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-4-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The total stores is displayed under the metrics "Store Reference", to
output the same format with total records and all loads, extract the
total stores number as a standalone metrics "Total Stores".
After this patch, the tool shows the summary numbers ("Total records",
"Total loads", "Total Stores") in the unified form.
Before:
# ----------- Cacheline ---------- Tot ----- LLC Load Hitm ----- Total Total ---- Store Reference ---- --- Load Dram ---- LLC ----- Core Load Hit ----- -- LLC Load Hit --
# Index Address Node PA cnt Hitm Total Lcl Rmt records Loads Total L1Hit L1Miss Lcl Rmt Ld Miss FB L1 L2 Llc Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 0 0 0 548 2615 66 169 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 0 0 0 187 361 27 11 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 0 0 0 131 0 10 263 0
After:
# ----------- Cacheline ---------- Tot ----- LLC Load Hitm ----- Total Total Total ---- Stores ---- --- Load Dram ---- LLC ----- Core Load Hit ----- -- LLC Load Hit --
# Index Address Node PA cnt Hitm Total Lcl Rmt records Loads Stores L1Hit L1Miss Lcl Rmt Ld Miss FB L1 L2 Llc Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ........ ........
#
0 0x55f07d580100 0 1499 85.89% 481 481 0 7243 3879 3364 2599 765 0 0 0 548 2615 66 169 0
1 0x55f07d580080 0 1 13.93% 78 78 0 664 664 0 0 0 0 0 0 187 361 27 11 0
2 0x55f07d5800c0 0 1 0.18% 1 1 0 405 405 0 0 0 0 0 0 131 0 10 263 0
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-3-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
To view the statistics with "breakdown" mode, it's good to show the
summary numbers for the total records, all stores and all loads, then
the sequential conlumns can be used to break into more detailed items.
To achieve this purpose, this patch displays the summary numbers for
records/stores/loads continuously and places them before breakdown
items, this can allow uses to easily read the summarized statistics.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201014050921.5591-2-leo.yan@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The existing approach to synchronization between threads in the numa
benchmark is unbalanced mutexes.
This synchronization causes thread sanitizer to warn of locks being
taken twice on a thread without an unlock, as well as unlocks with no
corresponding locks.
This change replaces the synchronization with more regular condition
variables.
While this fixes one class of thread sanitizer warnings, there still
remain warnings of data races due to threads reading and writing shared
memory without any atomics.
Committer testing:
Basic run on a non-NUMA machine.
# perf bench numa
# List of available benchmarks for collection 'numa':
mem: Benchmark for NUMA workloads
all: Run all NUMA benchmarks
# perf bench numa all
# Running numa/mem benchmark...
# Running main, "perf bench numa numa-mem"
#
# Running test on: Linux five 5.8.12-200.fc32.x86_64 #1 SMP Mon Sep 28 12:17:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
#
# Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk"
20.076 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.073 secs average thread-runtime
0.190 % difference between max/avg runtime
241.828 GB data processed, per thread
241.828 GB data processed, total
0.083 nsecs/byte/thread runtime
12.045 GB/sec/thread speed
12.045 GB/sec total speed
# Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1"
20.045 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.014 secs average thread-runtime
0.111 % difference between max/avg runtime
234.304 GB data processed, per thread
234.304 GB data processed, total
0.086 nsecs/byte/thread runtime
11.689 GB/sec/thread speed
11.689 GB/sec total speed
# Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk"
Test not applicable, system has only 1 nodes.
# Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
20.138 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.121 secs average thread-runtime
0.342 % difference between max/avg runtime
135.961 GB data processed, per thread
271.922 GB data processed, total
0.148 nsecs/byte/thread runtime
6.752 GB/sec/thread speed
13.503 GB/sec total speed
# Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk"
Test not applicable, system has only 1 nodes.
# Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"
Test not applicable, system has only 1 nodes.
# Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1"
0.747 secs latency to NUMA-converge
0.747 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.714 secs average thread-runtime
50.000 % difference between max/avg runtime
3.228 GB data processed, per thread
9.683 GB data processed, total
0.231 nsecs/byte/thread runtime
4.321 GB/sec/thread speed
12.964 GB/sec total speed
# Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1.127 secs latency to NUMA-converge
1.127 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.089 secs average thread-runtime
5.624 % difference between max/avg runtime
3.765 GB data processed, per thread
15.062 GB data processed, total
0.299 nsecs/byte/thread runtime
3.342 GB/sec/thread speed
13.368 GB/sec total speed
# Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
1.003 secs latency to NUMA-converge
1.003 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.889 secs average thread-runtime
50.000 % difference between max/avg runtime
2.141 GB data processed, per thread
12.847 GB data processed, total
0.469 nsecs/byte/thread runtime
2.134 GB/sec/thread speed
12.805 GB/sec total speed
# Running 2x3-convergence, "perf bench numa mem -p 2 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
1.814 secs latency to NUMA-converge
1.814 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.716 secs average thread-runtime
22.440 % difference between max/avg runtime
3.747 GB data processed, per thread
22.483 GB data processed, total
0.484 nsecs/byte/thread runtime
2.065 GB/sec/thread speed
12.393 GB/sec total speed
# Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1"
2.065 secs latency to NUMA-converge
2.065 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.947 secs average thread-runtime
25.788 % difference between max/avg runtime
2.855 GB data processed, per thread
25.694 GB data processed, total
0.723 nsecs/byte/thread runtime
1.382 GB/sec/thread speed
12.442 GB/sec total speed
# Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
1.912 secs latency to NUMA-converge
1.912 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.775 secs average thread-runtime
23.852 % difference between max/avg runtime
1.479 GB data processed, per thread
23.668 GB data processed, total
1.293 nsecs/byte/thread runtime
0.774 GB/sec/thread speed
12.378 GB/sec total speed
# Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
1.783 secs latency to NUMA-converge
1.783 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.633 secs average thread-runtime
21.960 % difference between max/avg runtime
1.345 GB data processed, per thread
21.517 GB data processed, total
1.326 nsecs/byte/thread runtime
0.754 GB/sec/thread speed
12.067 GB/sec total speed
# Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1"
5.396 secs latency to NUMA-converge
5.396 secs slowest (max) thread-runtime
4.000 secs fastest (min) thread-runtime
4.928 secs average thread-runtime
12.937 % difference between max/avg runtime
2.721 GB data processed, per thread
65.306 GB data processed, total
1.983 nsecs/byte/thread runtime
0.504 GB/sec/thread speed
12.102 GB/sec total speed
# Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1"
3.121 secs latency to NUMA-converge
3.121 secs slowest (max) thread-runtime
2.000 secs fastest (min) thread-runtime
2.836 secs average thread-runtime
17.962 % difference between max/avg runtime
1.194 GB data processed, per thread
38.192 GB data processed, total
2.615 nsecs/byte/thread runtime
0.382 GB/sec/thread speed
12.236 GB/sec total speed
# Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1"
4.302 secs latency to NUMA-converge
4.302 secs slowest (max) thread-runtime
3.000 secs fastest (min) thread-runtime
4.045 secs average thread-runtime
15.133 % difference between max/avg runtime
1.631 GB data processed, per thread
52.178 GB data processed, total
2.638 nsecs/byte/thread runtime
0.379 GB/sec/thread speed
12.128 GB/sec total speed
# Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1"
4.418 secs latency to NUMA-converge
4.418 secs slowest (max) thread-runtime
3.000 secs fastest (min) thread-runtime
4.104 secs average thread-runtime
16.045 % difference between max/avg runtime
1.664 GB data processed, per thread
53.254 GB data processed, total
2.655 nsecs/byte/thread runtime
0.377 GB/sec/thread speed
12.055 GB/sec total speed
# Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
0.973 secs latency to NUMA-converge
0.973 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.955 secs average thread-runtime
50.000 % difference between max/avg runtime
4.124 GB data processed, per thread
12.372 GB data processed, total
0.236 nsecs/byte/thread runtime
4.238 GB/sec/thread speed
12.715 GB/sec total speed
# Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
0.820 secs latency to NUMA-converge
0.820 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.808 secs average thread-runtime
50.000 % difference between max/avg runtime
2.555 GB data processed, per thread
10.220 GB data processed, total
0.321 nsecs/byte/thread runtime
3.117 GB/sec/thread speed
12.468 GB/sec total speed
# Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1"
0.667 secs latency to NUMA-converge
0.667 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.607 secs average thread-runtime
50.000 % difference between max/avg runtime
1.009 GB data processed, per thread
8.069 GB data processed, total
0.661 nsecs/byte/thread runtime
1.512 GB/sec/thread speed
12.095 GB/sec total speed
# Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1"
1.546 secs latency to NUMA-converge
1.546 secs slowest (max) thread-runtime
1.000 secs fastest (min) thread-runtime
1.485 secs average thread-runtime
17.664 % difference between max/avg runtime
1.162 GB data processed, per thread
18.594 GB data processed, total
1.331 nsecs/byte/thread runtime
0.752 GB/sec/thread speed
12.025 GB/sec total speed
# Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1"
0.812 secs latency to NUMA-converge
0.812 secs slowest (max) thread-runtime
0.000 secs fastest (min) thread-runtime
0.739 secs average thread-runtime
50.000 % difference between max/avg runtime
0.309 GB data processed, per thread
9.874 GB data processed, total
2.630 nsecs/byte/thread runtime
0.380 GB/sec/thread speed
12.166 GB/sec total speed
# Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
20.044 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.020 secs average thread-runtime
0.109 % difference between max/avg runtime
125.750 GB data processed, per thread
251.501 GB data processed, total
0.159 nsecs/byte/thread runtime
6.274 GB/sec/thread speed
12.548 GB/sec total speed
# Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
20.148 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.090 secs average thread-runtime
0.367 % difference between max/avg runtime
85.267 GB data processed, per thread
255.800 GB data processed, total
0.236 nsecs/byte/thread runtime
4.232 GB/sec/thread speed
12.696 GB/sec total speed
# Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1"
20.169 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.100 secs average thread-runtime
0.419 % difference between max/avg runtime
63.144 GB data processed, per thread
252.576 GB data processed, total
0.319 nsecs/byte/thread runtime
3.131 GB/sec/thread speed
12.523 GB/sec total speed
# Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1"
20.175 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.107 secs average thread-runtime
0.433 % difference between max/avg runtime
31.267 GB data processed, per thread
250.133 GB data processed, total
0.645 nsecs/byte/thread runtime
1.550 GB/sec/thread speed
12.398 GB/sec total speed
# Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
20.216 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.113 secs average thread-runtime
0.535 % difference between max/avg runtime
30.998 GB data processed, per thread
247.981 GB data processed, total
0.652 nsecs/byte/thread runtime
1.533 GB/sec/thread speed
12.266 GB/sec total speed
# Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1"
20.234 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.174 secs average thread-runtime
0.577 % difference between max/avg runtime
15.377 GB data processed, per thread
246.039 GB data processed, total
1.316 nsecs/byte/thread runtime
0.760 GB/sec/thread speed
12.160 GB/sec total speed
# Running 1x4-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1"
20.040 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.028 secs average thread-runtime
0.099 % difference between max/avg runtime
66.832 GB data processed, per thread
267.328 GB data processed, total
0.300 nsecs/byte/thread runtime
3.335 GB/sec/thread speed
13.340 GB/sec total speed
# Running 1x8-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1"
20.064 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.034 secs average thread-runtime
0.160 % difference between max/avg runtime
32.911 GB data processed, per thread
263.286 GB data processed, total
0.610 nsecs/byte/thread runtime
1.640 GB/sec/thread speed
13.122 GB/sec total speed
# Running 1x16-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1"
20.092 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.052 secs average thread-runtime
0.230 % difference between max/avg runtime
16.131 GB data processed, per thread
258.088 GB data processed, total
1.246 nsecs/byte/thread runtime
0.803 GB/sec/thread speed
12.845 GB/sec total speed
# Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1"
20.099 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.063 secs average thread-runtime
0.247 % difference between max/avg runtime
7.962 GB data processed, per thread
254.773 GB data processed, total
2.525 nsecs/byte/thread runtime
0.396 GB/sec/thread speed
12.676 GB/sec total speed
# Running 2x3-bw-process, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1"
20.150 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.120 secs average thread-runtime
0.372 % difference between max/avg runtime
44.827 GB data processed, per thread
268.960 GB data processed, total
0.450 nsecs/byte/thread runtime
2.225 GB/sec/thread speed
13.348 GB/sec total speed
# Running 4x4-bw-process, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1"
20.258 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.168 secs average thread-runtime
0.636 % difference between max/avg runtime
17.079 GB data processed, per thread
273.263 GB data processed, total
1.186 nsecs/byte/thread runtime
0.843 GB/sec/thread speed
13.489 GB/sec total speed
# Running 4x6-bw-process, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1"
20.559 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.382 secs average thread-runtime
1.359 % difference between max/avg runtime
10.758 GB data processed, per thread
258.201 GB data processed, total
1.911 nsecs/byte/thread runtime
0.523 GB/sec/thread speed
12.559 GB/sec total speed
# Running 4x8-bw-process, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1"
20.744 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.516 secs average thread-runtime
1.792 % difference between max/avg runtime
8.069 GB data processed, per thread
258.201 GB data processed, total
2.571 nsecs/byte/thread runtime
0.389 GB/sec/thread speed
12.447 GB/sec total speed
# Running 4x8-bw-process-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1"
20.855 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.561 secs average thread-runtime
2.050 % difference between max/avg runtime
8.069 GB data processed, per thread
258.201 GB data processed, total
2.585 nsecs/byte/thread runtime
0.387 GB/sec/thread speed
12.381 GB/sec total speed
# Running 3x3-bw-process, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1"
20.134 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.077 secs average thread-runtime
0.333 % difference between max/avg runtime
28.091 GB data processed, per thread
252.822 GB data processed, total
0.717 nsecs/byte/thread runtime
1.395 GB/sec/thread speed
12.557 GB/sec total speed
# Running 5x5-bw-process, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1"
20.588 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.375 secs average thread-runtime
1.427 % difference between max/avg runtime
10.177 GB data processed, per thread
254.436 GB data processed, total
2.023 nsecs/byte/thread runtime
0.494 GB/sec/thread speed
12.359 GB/sec total speed
# Running 2x16-bw-process, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1"
20.657 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.429 secs average thread-runtime
1.589 % difference between max/avg runtime
8.170 GB data processed, per thread
261.429 GB data processed, total
2.528 nsecs/byte/thread runtime
0.395 GB/sec/thread speed
12.656 GB/sec total speed
# Running 1x32-bw-process, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1"
22.981 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
21.996 secs average thread-runtime
6.486 % difference between max/avg runtime
8.863 GB data processed, per thread
283.606 GB data processed, total
2.593 nsecs/byte/thread runtime
0.386 GB/sec/thread speed
12.341 GB/sec total speed
# Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1"
20.047 secs slowest (max) thread-runtime
19.000 secs fastest (min) thread-runtime
20.026 secs average thread-runtime
2.611 % difference between max/avg runtime
8.441 GB data processed, per thread
270.111 GB data processed, total
2.375 nsecs/byte/thread runtime
0.421 GB/sec/thread speed
13.474 GB/sec total speed
# Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1"
20.088 secs slowest (max) thread-runtime
19.000 secs fastest (min) thread-runtime
20.025 secs average thread-runtime
2.709 % difference between max/avg runtime
8.411 GB data processed, per thread
269.142 GB data processed, total
2.388 nsecs/byte/thread runtime
0.419 GB/sec/thread speed
13.398 GB/sec total speed
# Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1"
20.293 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.175 secs average thread-runtime
0.721 % difference between max/avg runtime
7.918 GB data processed, per thread
253.374 GB data processed, total
2.563 nsecs/byte/thread runtime
0.390 GB/sec/thread speed
12.486 GB/sec total speed
# Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1"
20.411 secs slowest (max) thread-runtime
20.000 secs fastest (min) thread-runtime
20.226 secs average thread-runtime
1.006 % difference between max/avg runtime
7.931 GB data processed, per thread
253.778 GB data processed, total
2.574 nsecs/byte/thread runtime
0.389 GB/sec/thread speed
12.434 GB/sec total speed
#
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20201012161611.366482-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The event code for events referencing std arch events is incorrectly
evaluated in json_events().
The issue is that je.event is evaluated properly from try_fixup(), but
later NULLified from the real_event() call, as "event" may be NULL.
Fix by setting "event" same je.event in try_fixup().
Also remove support for overwriting event code for events using std arch
events, as it is not used.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-By: Kajol Jain<kjain@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/1602170368-11892-1-git-send-email-john.garry@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
This patch enables perf-diff with "--stream" option.
"--stream": Enable hot streams comparison
Now let's see example.
perf record -b ... Generate perf.data.old with branch data
perf record -b ... Generate perf.data with branch data
perf diff --stream
[ Matched hot streams ]
hot chain pair 1:
cycles: 1, hits: 27.77% cycles: 1, hits: 9.24%
--------------------------- --------------------------
main div.c:39 main div.c:39
main div.c:44 main div.c:44
hot chain pair 2:
cycles: 34, hits: 20.06% cycles: 27, hits: 16.98%
--------------------------- --------------------------
__random_r random_r.c:360 __random_r random_r.c:360
__random_r random_r.c:388 __random_r random_r.c:388
__random_r random_r.c:388 __random_r random_r.c:388
__random_r random_r.c:380 __random_r random_r.c:380
__random_r random_r.c:357 __random_r random_r.c:357
__random random.c:293 __random random.c:293
__random random.c:293 __random random.c:293
__random random.c:291 __random random.c:291
__random random.c:291 __random random.c:291
__random random.c:291 __random random.c:291
__random random.c:288 __random random.c:288
rand rand.c:27 rand rand.c:27
rand rand.c:26 rand rand.c:26
rand@plt rand@plt
rand@plt rand@plt
compute_flag div.c:25 compute_flag div.c:25
compute_flag div.c:22 compute_flag div.c:22
main div.c:40 main div.c:40
main div.c:40 main div.c:40
main div.c:39 main div.c:39
hot chain pair 3:
cycles: 9, hits: 4.48% cycles: 6, hits: 4.51%
--------------------------- --------------------------
__random_r random_r.c:360 __random_r random_r.c:360
__random_r random_r.c:388 __random_r random_r.c:388
__random_r random_r.c:388 __random_r random_r.c:388
__random_r random_r.c:380 __random_r random_r.c:380
[ Hot streams in old perf data only ]
hot chain 1:
cycles: 18, hits: 6.75%
--------------------------
__random_r random_r.c:360
__random_r random_r.c:388
__random_r random_r.c:388
__random_r random_r.c:380
__random_r random_r.c:357
__random random.c:293
__random random.c:293
__random random.c:291
__random random.c:291
__random random.c:291
__random random.c:288
rand rand.c:27
rand rand.c:26
rand@plt
rand@plt
compute_flag div.c:25
compute_flag div.c:22
main div.c:40
hot chain 2:
cycles: 29, hits: 2.78%
--------------------------
compute_flag div.c:22
main div.c:40
main div.c:40
main div.c:39
[ Hot streams in new perf data only ]
hot chain 1:
cycles: 4, hits: 4.54%
--------------------------
main div.c:42
compute_flag div.c:28
hot chain 2:
cycles: 5, hits: 3.51%
--------------------------
main div.c:39
main div.c:44
main div.c:42
compute_flag div.c:28
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-8-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
We show the streams separately. They are divided into different sections.
1. "Matched hot streams"
2. "Hot streams in old perf data only"
3. "Hot streams in new perf data only".
For each stream, we report the cycles and hot percent (hits%).
For example,
cycles: 2, hits: 4.08%
--------------------------
main div.c:42
compute_flag div.c:28
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-7-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
We have used callchain_node->hit to measure the hot level of one stream.
This patch calculates the sum of hits of total streams.
Thus in next patch, we can use following formula to report hot percent
for one stream.
hot percent = callchain_node->hit / sum of total hits
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-6-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
In previous patch, we have created an evsel_streams for one event, and
top N hottest streams will be saved in a stream array in evsel_streams.
This patch compares total streams among two evsel_streams.
Once two streams are fully matched, they will be linked as a pair. From
the pair, we can know which streams are matched.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-5-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Stream is the branch history which is aggregated by the branch records
from perf samples. Now we support the callchain as stream.
If the callchain entries of one stream are fully matched with the
callchain entries of another stream, we think two streams are matched.
For example,
cycles: 1, hits: 26.80% cycles: 1, hits: 27.30%
----------------------- -----------------------
main div.c:39 main div.c:39
main div.c:44 main div.c:44
Above two streams are matched (we don't consider the case that source
code is changed).
The matching logic is, compare the chain string first. If it's not
matched, fallback to dso address comparison.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-4-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
In previous patch, we have created evsel_streams array.
This patch returns the specified evsel_streams according to the
evsel_idx.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-3-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
We define a stream as the branch history which is aggregated by the
branch records from perf samples. For example, the callchains aggregated
from the branch records are considered as streams. By browsing the hot
stream, we can understand the hot code path.
Now we only support the callchain for stream. For measuring the hot
level for a stream, we use the callchain_node->hit, higher is hotter.
There may be many callchains sampled so we only focus on the top N
hottest callchains. N is a user defined parameter or predefined default
value (nr_streams_max).
This patch creates an evsel_streams array per event, and saves the top N
hottest streams in a stream array.
So now we can get the per-event top N hottest streams.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20201009022845.13141-2-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Document the higher level --insn-trace etc. perf script options.
Include the howto how to build xed into the manpage
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Link: http://lore.kernel.org/lkml/20201014035346.4772-1-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Peter suggested that using the exclusive mode in perf could avoid some
problems with bad scheduling of groups. Exclusive is implemented in the
kernel, but wasn't exposed by the perf tool, so hard to use without
custom low level API users.
Add support for marking groups or events with :e for exclusive in the
perf tool. The implementation is basically the same as the existing
pinned attribute.
Committer testing:
# perf test "parse event"
6: Parse event definition strings : Ok
# perf test -v "parse event" |& grep :u*e
running test 56 'instructions:uep'
running test 57 '{cycles,cache-misses,branch-misses}:e'
#
#
# grep "model name" -m1 /proc/cpuinfo
model name : AMD Ryzen 9 3900X 12-Core Processor
#
# perf stat -a -e '{cycles,cache-misses,branch-misses}:e' sleep 1
Performance counter stats for 'system wide':
<not counted> cycles (0.00%)
<not counted> cache-misses (0.00%)
<not counted> branch-misses (0.00%)
1.001269893 seconds time elapsed
Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog
# echo 0 > /proc/sys/kernel/nmi_watchdog
# perf stat -a -e '{cycles,cache-misses,branch-misses}:e' sleep 1
Performance counter stats for 'system wide':
1,298,663,141 cycles
30,962,215 cache-misses
5,325,150 branch-misses
1.001474934 seconds time elapsed
#
# The output for asking for precise events on AMD needs to improve, it
# supposedly works only for system wide or per CPU
#
# perf stat -a -e '{cycles,cache-misses,branch-misses}:uep' sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cycles).
/bin/dmesg | grep -i perf may provide additional information.
# perf stat -a -e '{cycles,cache-misses,branch-misses}:ue' sleep 1
Performance counter stats for 'system wide':
746,363,126 cycles
16,881,611 cache-misses
2,871,259 branch-misses
1.001636066 seconds time elapsed
#
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20201014144255.22699-1-andi@firstfloor.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Add a test for the build id cache that adds a binary with sha1 and md5
build ids and verifies it's added properly.
The test updates build id cache with 'perf record' and 'perf buildid-cache -a'.
Committer testing:
# perf test "build id"
82: build id cache operations : Ok
#
# perf test -v "build id"
82: build id cache operations :
--- start ---
test child forked, pid 447218
test binaries: /tmp/perf.ex.SHA1.B8I /tmp/perf.ex.MD5.7Nv
Adding d1abc1eb7568358cf23c959566f23462461834d1 /tmp/perf.ex.SHA1.B8I: Ok
build id: d1abc1eb7568358cf23c959566f23462461834d1
link: /tmp/perf.debug.sS2/.build-id/d1/abc1eb7568358cf23c959566f23462461834d1
file: /tmp/perf.debug.sS2/.build-id/d1/../../tmp/perf.ex.SHA1.B8I/d1abc1eb7568358cf23c959566f23462461834d1/elf
OK for /tmp/perf.ex.SHA1.B8I
Adding a50e350e97c43b4708d09bcd85ebfff7 /tmp/perf.ex.MD5.7Nv: Ok
build id: a50e350e97c43b4708d09bcd85ebfff7
link: /tmp/perf.debug.IuW/.build-id/a5/0e350e97c43b4708d09bcd85ebfff7
file: /tmp/perf.debug.IuW/.build-id/a5/../../tmp/perf.ex.MD5.7Nv/a50e350e97c43b4708d09bcd85ebfff7/elf
OK for /tmp/perf.ex.MD5.7Nv
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.034 MB /tmp/perf.data.xrH ]
build id: d1abc1eb7568358cf23c959566f23462461834d1
link: /tmp/perf.debug.eGR/.build-id/d1/abc1eb7568358cf23c959566f23462461834d1
file: /tmp/perf.debug.eGR/.build-id/d1/../../tmp/perf.ex.SHA1.B8I/d1abc1eb7568358cf23c959566f23462461834d1/elf
OK for /tmp/perf.ex.SHA1.B8I
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.034 MB /tmp/perf.data.cbE ]
build id: a50e350e97c43b4708d09bcd85ebfff7
link: /tmp/perf.debug.82t/.build-id/a5/0e350e97c43b4708d09bcd85ebfff7
file: /tmp/perf.debug.82t/.build-id/a5/../../tmp/perf.ex.MD5.7Nv/a50e350e97c43b4708d09bcd85ebfff7/elf
OK for /tmp/perf.ex.MD5.7Nv
test child finished with 0
---- end ----
build id cache operations: Ok
#
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-10-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
With shorter md5 build ids we need to align their paths properly with
other build ids:
$ perf buildid-list
17f4e448cc746582ea1881528deb549f7fdb3fd5 [kernel.kallsyms]
a50e350e97c43b4708d09bcd85ebfff7 .../tools/perf/buildid-ex-md5
1805c738c8f3ec0f47b7ea09080c28f34d18a82b /usr/lib64/ld-2.31.so
$
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-9-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
We do not store size with build ids in perf data, but there's enough
space to do it. Adding misc bit PERF_RECORD_MISC_BUILD_ID_SIZE to mark
build id event with size.
With this fix the dso with md5 build id will have correct build id data
and will be usable for debuginfod processing if needed (coming in
following patches).
Committer notes:
Use %zu with size_t to fix this error on 32-bit arches:
util/header.c: In function '__event_process_build_id':
util/header.c:2105:3: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'size_t' [-Werror=format=]
pr_debug("build id event received for %s: %s [%lu]\n",
^
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-8-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Passing build_id object to dso__build_id_equal(), so we can properly
check build id with different size than sha1.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-7-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Passing build_id object to dso__set_build_id(), so it's easier
to initialize dos's build id object.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Passing build_id object to build_id__sprintf function, so it can operate
with the proper size of build id.
This will create proper md5 build id readable names,
like following:
a50e350e97c43b4708d09bcd85ebfff7
instead of:
a50e350e97c43b4708d09bcd85ebfff700000000
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-5-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Passing build id object to sysfs__read_build_id function, so it can
populate the size of the build_id object.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Pass a build_id object to filename__read_build_id function, so it can
populate the size of the build_id object.
Changing filename__read_build_id() code for both ELF/non-ELF code.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-3-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Replace build_id byte array with struct build_id object and all the code
that references it.
The objective is to carry size together with build id array, so it's
better to keep both together.
This is preparatory change for following patches, and there's no
functional change.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20201013192441.1299447-2-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
We'll use it to ask for extra config files to be loaded, profile like
stuff that will be used first to make 'perf trace' mimic 'strace' output
via a 'perf strace' command that just sets up 'perf trace' output.
At some point it'll be used for regression tests, where we'll run some
simple commands like:
perf strace ls > perf-strace.output
strace ls > strace.output
And then do some mutable syscall arg aware diff like tool to deal with
arguments for things like mmap, that change at each execution, to be
first ignored and then properly tracked when used accoss multiple
syscalls.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Some distros don't come with python2 and only have python3 available.
This causes the "'import perf' in python" self test to fail.
This change adds python3 to the list of possible python versions
that are autodetected but maintains the priorities for
'python2' and 'python' detection. Python3 has the lowest priority.
Committer notes:
On a fedora system without python2 packages the 'perf test python'
continues to work:
# python2
bash: python2: command not found...
Similar command is: 'python'
# rpm -qa | grep python2
#
That "Similar command" gives the clue:
# rpm -qf /usr/bin/python
python-unversioned-command-3.8.5-5.fc32.noarch
# rpm -ql python-unversioned-command
/usr/bin/python
/usr/share/man/man1/python.1.gz
#
With it in place the 'python' binary is found and perf builds the python
binding using python3:
# perf test -v python
19: 'import perf' in python :
--- start ---
test child forked, pid 379988
python usage test: "echo "import sys ; sys.path.append('/tmp/build/perf/python'); import perf" | '/usr/bin/python' "
test child finished with 0
---- end ----
'import perf' in python: Ok
#
Looking at that path:
# ls -la /tmp/build/perf/python
total 1864
drwxrwxr-x. 2 acme acme 60 Oct 13 16:20 .
drwxrwxr-x. 18 acme acme 4420 Oct 13 16:28 ..
-rwxrwxr-x. 1 acme acme 1907216 Oct 13 16:28 perf.cpython-38-x86_64-linux-gnu.so
#
And:
# ldd ~/bin/perf | grep python
libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007f5471187000)
#
As soon as we remove it:
# rpm -e python-unversioned-command-3.8.5-5.fc32.noarch
# hash -r
# python
bash: python: command not found...
Install package 'python-unversioned-command' to provide command 'python'? [N/y] n
#
And rebuilding perf now doesn't find python in the system:
make: Entering directory '/home/acme/git/perf/tools/perf'
BUILD: Doing 'make -j24' parallel build
<SNIP>
Makefile.config:786: No python interpreter was found: disables Python support - please install python-devel/python-dev
<SNIP>
After this patch:
$ rpm -qi python-unversioned-command
package python-unversioned-command is not installed
$
$ python
bash: python: command not found...
Install package 'python-unversioned-command' to provide command 'python'? [N/y] ^C
$
$ m
make: Entering directory '/home/acme/git/perf/tools/perf'
BUILD: Doing 'make -j24' parallel build
<SNIP>
CC /tmp/build/perf/tests/attr.o
CC /tmp/build/perf/tests/python-use.o
DESCEND plugins
GEN /tmp/build/perf/python/perf.so
INSTALL trace_plugins
LD /tmp/build/perf/tests/perf-in.o
LD /tmp/build/perf/perf-in.o
LINK /tmp/build/perf/perf
<SNIP>
make: Leaving directory '/home/acme/git/perf/tools/perf'
19: 'import perf' in python : Ok
$ ldd ~/bin/perf | grep python
libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007f2c8c708000)
$ ls -la /tmp/build/perf/python
total 1864
drwxrwxr-x. 2 acme acme 60 Oct 13 16:20 .
drwxrwxr-x. 18 acme acme 4420 Oct 13 16:31 ..
-rwxrwxr-x. 1 acme acme 1907216 Oct 13 16:31 perf.cpython-38-x86_64-linux-gnu.so
$
Signed-off-by: James Clark <james.clark@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LPU-Reference: 20201005080645.6588-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
To help figure out where it is getting the binding.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Currently BUILD_BUG() macro is expanded to smth like the following:
do {
extern void __compiletime_assert_0(void)
__attribute__((error("BUILD_BUG failed")));
if (!(!(1)))
__compiletime_assert_0();
} while (0);
If used in a function body this obviously would produce build errors
with -Wnested-externs and -Werror.
To enable BUILD_BUG() usage in tools/arch/x86/lib/insn.c which perf
includes in intel-pt-decoder, build perf without -Wnested-externs.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> # build tested
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/lkml/patch-1.thread-251403.git-2514037e9477.your-ad-here.call-01602244460-ext-7088@work.hours
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
'perf trace ls' started crashing after commit d21cb73a9025 on
!HAVE_SYSCALL_TABLE_SUPPORT configs (armv7l here) like this:
0 strlen () at ../sysdeps/arm/armv6t2/strlen.S:126
1 0xb6800780 in __vfprintf_internal (s=0xbeff9908, s@entry=0xbeff9900, format=0xa27160 "]: %s()", ap=..., mode_flags=<optimized out>) at vfprintf-internal.c:1688
...
5 0x0056ecdc in fprintf (__fmt=0xa27160 "]: %s()", __stream=<optimized out>) at /usr/include/bits/stdio2.h:100
6 trace__sys_exit (trace=trace@entry=0xbeffc710, evsel=evsel@entry=0xd968d0, event=<optimized out>, sample=sample@entry=0xbeffc3e8) at builtin-trace.c:2475
7 0x00566d40 in trace__handle_event (sample=0xbeffc3e8, event=<optimized out>, trace=0xbeffc710) at builtin-trace.c:3122
...
15 main (argc=2, argv=0xbefff6e8) at perf.c:538
It is because memset in trace__read_syscall_info zeroes wrong memory:
1) when initializing for the first time, it does not reset the last id.
2) in other cases, it resets the last id of previous buffer.
ad 1) it causes the crash above as sc->name used in the fprintf above
contains garbage.
ad 2) it sets nonexistent from true back to false for id 11 here. Not
sure, what the consequences are.
So fix it by introducing a special case for the initial initialization
and do the right +1 in both cases.
Fixes: d21cb73a9025 ("perf trace: Grow the syscall table as needed when using libaudit")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20201001093419.15761-1-jslaby@suse.cz
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Since commit b027cc6fdf1b ("perf c2c: Fix 'perf c2c record -e list' to
show the default events used"), "perf c2c" tool can show the memory
events properly, it's no reason to still suggest user to use the
command "perf mem record -e list" for showing events.
This patch updates the usage for showing memory events with command
"perf c2c record -e list".
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20201011121022.22409-1-leo.yan@linaro.org
|
|
To pick fixes that missed v5.9.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
There are internal library functions, which are not declared as a static.
They are used inside the library from different files. Hide them from
the library users, as they are not part of the API.
These functions are made hidden and are renamed without the prefix "tep_":
tep_free_plugin_paths
tep_peek_char
tep_buffer_init
tep_get_input_buf_ptr
tep_get_input_buf
tep_read_token
tep_free_token
tep_free_event
tep_free_format_field
__tep_parse_format
Link: https://lore.kernel.org/linux-trace-devel/e4afdd82deb5e023d53231bb13e08dca78085fb0.camel@decadent.org.uk/
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: linux-trace-devel@vger.kernel.org
Link: http://lore.kernel.org/lkml/20200930110733.280534-1-tz.stoyanov@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The 'perf sched latency' tool is really useful at showing worst-case
latencies that task encountered since wakeup. However it shows only the
end of the latency. Often times the start of a latency is interesting as
it can show what else was going on at the time to cause the latency. I
certainly myself spending a lot of time backtracking to the start of the
latency in "perf sched script" which wastes a lot of time.
This patch therefore adds a new column "Max delay start". Considering
this, also rename "Maximum delay at" to "Max delay end" as its easier to
understand.
Example of the new output:
----------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Max delay start | Max delay end |
----------------------------------------------------------------------------------------------------------------------------------
MediaScannerSer:11936 | 651.296 ms | 67978 | avg: 0.113 ms | max: 77.250 ms | max start: 477.691360 s | max end: 477.768610 s
audio@2.0-servi:(3) | 0.000 ms | 3440 | avg: 0.034 ms | max: 72.267 ms | max start: 477.697051 s | max end: 477.769318 s
AudioOut_1D:8112 | 0.000 ms | 2588 | avg: 0.083 ms | max: 64.020 ms | max start: 477.710740 s | max end: 477.774760 s
Time-limited te:14973 | 7966.090 ms | 24807 | avg: 0.073 ms | max: 15.563 ms | max start: 477.162746 s | max end: 477.178309 s
surfaceflinger:8049 | 9.680 ms | 603 | avg: 0.063 ms | max: 13.275 ms | max start: 476.931791 s | max end: 476.945067 s
HeapTaskDaemon:(3) | 1588.830 ms | 7040 | avg: 0.065 ms | max: 6.880 ms | max start: 473.666043 s | max end: 473.672922 s
mount-passthrou:(3) | 1370.809 ms | 68904 | avg: 0.011 ms | max: 6.524 ms | max start: 478.090630 s | max end: 478.097154 s
ReferenceQueueD:(3) | 11.794 ms | 1725 | avg: 0.014 ms | max: 6.521 ms | max start: 476.119782 s | max end: 476.126303 s
writer:14077 | 18.410 ms | 1427 | avg: 0.036 ms | max: 6.131 ms | max start: 474.169675 s | max end: 474.175805 s
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20200925235634.4089867-1-joel@joelfernandes.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
This replaces the incorrectly spelled word "localtion" with "location"
in some power8 PMU event descriptions.
Fixes: 2a81fa3bb5ed ("perf vendor events: Add power8 PMU events")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Link: http://lore.kernel.org/lkml/20201012050205.328523-1-sandipan@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
For comparison, it now runs the benchmark twice - one if regular -b and
another for --buildid-all.
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 21.002 msec (+- 0.172 msec)
Average time per event: 2.059 usec (+- 0.017 usec)
Average memory usage: 8169 KB (+- 0 KB)
Average build-id-all injection took: 19.543 msec (+- 0.124 msec)
Average time per event: 1.916 usec (+- 0.012 usec)
Average memory usage: 7348 KB (+- 0 KB)
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-7-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Like 'perf record', we can even more speedup build-id processing by just
using all DSOs. Then we don't need to look at all the sample events
anymore. The following patch will update 'perf bench' to show the result
of the --buildid-all option too.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Original-patch-by: Stephane Eranian <eranian@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-6-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
No need to load symbols in a DSO when injecting build-id. I guess the
reason was to check the DSO is a special file like anon files. Use some
helper functions in map.c to check them before reading build-id. Also
pass sample event's cpumode to a new build-id event.
It brought a speedup in the benchmark of 25 -> 21 msec on my laptop.
Also the memory usage (Max RSS) went down by ~200 KB.
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 21.389 msec (+- 0.138 msec)
Average time per event: 2.097 usec (+- 0.014 usec)
Average memory usage: 8225 KB (+- 0 KB)
Committer notes:
Before:
$ perf stat -r5 perf bench internals inject-build-id > /dev/null
Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
4,020.56 msec task-clock:u # 1.271 CPUs utilized ( +- 0.74% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
123,354 page-faults:u # 0.031 M/sec ( +- 0.81% )
7,119,951,568 cycles:u # 1.771 GHz ( +- 1.74% ) (83.27%)
230,086,969 stalled-cycles-frontend:u # 3.23% frontend cycles idle ( +- 1.97% ) (83.41%)
1,168,298,765 stalled-cycles-backend:u # 16.41% backend cycles idle ( +- 1.13% ) (83.44%)
11,173,083,669 instructions:u # 1.57 insn per cycle
# 0.10 stalled cycles per insn ( +- 1.58% ) (83.31%)
2,413,908,936 branches:u # 600.392 M/sec ( +- 1.69% ) (83.26%)
46,576,289 branch-misses:u # 1.93% of all branches ( +- 2.20% ) (83.31%)
3.1638 +- 0.0309 seconds time elapsed ( +- 0.98% )
$
After:
$ perf stat -r5 perf bench internals inject-build-id > /dev/null
Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
2,379.94 msec task-clock:u # 1.473 CPUs utilized ( +- 0.18% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
62,584 page-faults:u # 0.026 M/sec ( +- 0.07% )
2,372,389,668 cycles:u # 0.997 GHz ( +- 0.29% ) (83.14%)
106,937,862 stalled-cycles-frontend:u # 4.51% frontend cycles idle ( +- 4.89% ) (83.20%)
581,697,915 stalled-cycles-backend:u # 24.52% backend cycles idle ( +- 0.71% ) (83.47%)
3,659,692,199 instructions:u # 1.54 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.10% ) (83.63%)
791,372,961 branches:u # 332.518 M/sec ( +- 0.27% ) (83.39%)
10,648,083 branch-misses:u # 1.35% of all branches ( +- 0.22% ) (83.16%)
1.61570 +- 0.00172 seconds time elapsed ( +- 0.11% )
$
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Original-patch-by: Stephane Eranian <eranian@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20201012070214.2074921-5-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
It should be in a proper mnt namespace when accessing the file.
I think this had no problem since the build-id was actually read from
map__load() -> dso__load() already. But I'd like to change it in the
following commit.
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
I found some events (like PERF_RECORD_CGROUP) are not copied by perf
inject due to the missing callbacks. Let's add them.
While at it, I've changed the order of the callbacks to match with
struct perf_tool so that we can compare them easily.
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Sometimes I can see that 'perf record' piped with 'perf inject' take a
long time processing build-ids.
So introduce a inject-build-id benchmark to the internals benchmark
suite to measure its overhead regularly.
It runs the 'perf inject' command internally and feeds the given number
of synthesized events (MMAP2 + SAMPLE basically).
Usage: perf bench internals inject-build-id <options>
-i, --iterations <n> Number of iterations used to compute average (default: 100)
-m, --nr-mmaps <n> Number of mmap events for each iteration (default: 100)
-n, --nr-samples <n> Number of sample events per mmap event (default: 100)
-v, --verbose be more verbose (show iteration count, DSO name, etc)
By default, it measures average processing time of 100 MMAP2 events
and 10000 SAMPLE events. Below is a result on my laptop.
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 25.789 msec (+- 0.202 msec)
Average time per event: 2.528 usec (+- 0.020 usec)
Average memory usage: 8411 KB (+- 7 KB)
Committer testing:
$ perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections:
sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
all: All benchmarks
$ perf bench internals
# List of available benchmarks for collection 'internals':
synthesize: Benchmark perf event synthesis
kallsyms-parse: Benchmark kallsyms parsing
inject-build-id: Benchmark build-id injection
$ perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.202 msec (+- 0.059 msec)
Average time per event: 1.392 usec (+- 0.006 usec)
Average memory usage: 12650 KB (+- 10 KB)
Average build-id-all injection took: 12.831 msec (+- 0.071 msec)
Average time per event: 1.258 usec (+- 0.007 usec)
Average memory usage: 11895 KB (+- 10 KB)
$
$ perf stat -r5 perf bench internals inject-build-id
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.380 msec (+- 0.056 msec)
Average time per event: 1.410 usec (+- 0.006 usec)
Average memory usage: 12608 KB (+- 11 KB)
Average build-id-all injection took: 11.889 msec (+- 0.064 msec)
Average time per event: 1.166 usec (+- 0.006 usec)
Average memory usage: 11838 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.246 msec (+- 0.065 msec)
Average time per event: 1.397 usec (+- 0.006 usec)
Average memory usage: 12744 KB (+- 10 KB)
Average build-id-all injection took: 12.019 msec (+- 0.066 msec)
Average time per event: 1.178 usec (+- 0.006 usec)
Average memory usage: 11963 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.321 msec (+- 0.067 msec)
Average time per event: 1.404 usec (+- 0.007 usec)
Average memory usage: 12690 KB (+- 10 KB)
Average build-id-all injection took: 11.909 msec (+- 0.041 msec)
Average time per event: 1.168 usec (+- 0.004 usec)
Average memory usage: 11938 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.287 msec (+- 0.059 msec)
Average time per event: 1.401 usec (+- 0.006 usec)
Average memory usage: 12864 KB (+- 10 KB)
Average build-id-all injection took: 11.862 msec (+- 0.058 msec)
Average time per event: 1.163 usec (+- 0.006 usec)
Average memory usage: 12103 KB (+- 10 KB)
# Running 'internals/inject-build-id' benchmark:
Average build-id injection took: 14.402 msec (+- 0.053 msec)
Average time per event: 1.412 usec (+- 0.005 usec)
Average memory usage: 12876 KB (+- 10 KB)
Average build-id-all injection took: 11.826 msec (+- 0.061 msec)
Average time per event: 1.159 usec (+- 0.006 usec)
Average memory usage: 12111 KB (+- 10 KB)
Performance counter stats for 'perf bench internals inject-build-id' (5 runs):
4,267.48 msec task-clock:u # 1.502 CPUs utilized ( +- 0.14% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
102,092 page-faults:u # 0.024 M/sec ( +- 0.08% )
3,894,589,578 cycles:u # 0.913 GHz ( +- 0.19% ) (83.49%)
140,078,421 stalled-cycles-frontend:u # 3.60% frontend cycles idle ( +- 0.77% ) (83.34%)
948,581,189 stalled-cycles-backend:u # 24.36% backend cycles idle ( +- 0.46% ) (83.25%)
5,835,587,719 instructions:u # 1.50 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.21% ) (83.24%)
1,267,423,636 branches:u # 296.996 M/sec ( +- 0.22% ) (83.12%)
17,484,290 branch-misses:u # 1.38% of all branches ( +- 0.12% ) (83.55%)
2.84176 +- 0.00222 seconds time elapsed ( +- 0.08% )
$
Acked-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20201012070214.2074921-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
It was reported that 'perf stat' crashed when using with armv8_pmu (CPU)
events with the task mode. As 'perf stat' uses an empty cpu map for
task mode but armv8_pmu has its own cpu mask, it has confused which map
it should use when accessing file descriptors and this causes segfaults:
(gdb) bt
#0 0x0000000000603fc8 in perf_evsel__close_fd_cpu (evsel=<optimized out>,
cpu=<optimized out>) at evsel.c:122
#1 perf_evsel__close_cpu (evsel=evsel@entry=0x716e950, cpu=7) at evsel.c:156
#2 0x00000000004d4718 in evlist__close (evlist=0x70a7cb0) at util/evlist.c:1242
#3 0x0000000000453404 in __run_perf_stat (argc=3, argc@entry=1, argv=0x30,
argv@entry=0xfffffaea2f90, run_idx=119, run_idx@entry=1701998435)
at builtin-stat.c:929
#4 0x0000000000455058 in run_perf_stat (run_idx=1701998435, argv=0xfffffaea2f90,
argc=1) at builtin-stat.c:947
#5 cmd_stat (argc=1, argv=0xfffffaea2f90) at builtin-stat.c:2357
#6 0x00000000004bb888 in run_builtin (p=p@entry=0x9764b8 <commands+288>,
argc=argc@entry=4, argv=argv@entry=0xfffffaea2f90) at perf.c:312
#7 0x00000000004bbb54 in handle_internal_command (argc=argc@entry=4,
argv=argv@entry=0xfffffaea2f90) at perf.c:364
#8 0x0000000000435378 in run_argv (argcp=<synthetic pointer>,
argv=<synthetic pointer>) at perf.c:408
#9 main (argc=4, argv=0xfffffaea2f90) at perf.c:538
To fix this, I simply used the given cpu map unless the evsel actually
is not a system-wide event (like uncore events).
Fixes: 7736627b865d ("perf stat: Use affinity for closing file descriptors")
Reported-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201007081311.1831003-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Hagen reported broken strings in python3 tracepoint scripts:
make PYTHON=python3
perf record -e sched:sched_switch -a -- sleep 5
perf script --gen-script py
perf script -s ./perf-script.py
[..]
sched__sched_switch 7 563231.759525792 0 swapper prev_comm=bytearray(b'swapper/7\x00\x00\x00\x00\x00\x00\x00'), prev_pid=0, prev_prio=120, prev_state=, next_comm=bytearray(b'mutex-thread-co\x00'),
The problem is in the is_printable_array function that does not take the
zero byte into account and claim such string as not printable, so the
code will create byte array instead of string.
Committer testing:
After this fix:
sched__sched_switch 3 484522.497072626 1158680 kworker/3:0-eve prev_comm=kworker/3:0, prev_pid=1158680, prev_prio=120, prev_state=I, next_comm=swapper/3, next_pid=0, next_prio=120
Sample: {addr=0, cpu=3, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1158680, tid=1158680, time=484522497072626, transaction=0, values=[(0, 0)], weight=0}
sched__sched_switch 4 484522.497085610 1225814 perf prev_comm=perf, prev_pid=1225814, prev_prio=120, prev_state=, next_comm=migration/4, next_pid=30, next_prio=0
Sample: {addr=0, cpu=4, datasrc=84410401, datasrc_decode=N/A|SNP N/A|TLB N/A|LCK N/A, ip=18446744071841817196, period=1, phys_addr=0, pid=1225814, tid=1225814, time=484522497085610, transaction=0, values=[(0, 0)], weight=0}
Fixes: 249de6e07458 ("perf script python: Fix string vs byte array resolving")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lore.kernel.org/lkml/20200928201135.3633850-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
No change in behaviour:
# perf trace -e mmap sleep 1
0.000 ( 0.009 ms): sleep/751870 mmap(len: 143317, prot: READ, flags: PRIVATE, fd: 3) = 0x7fa96d0f7000
0.028 ( 0.004 ms): sleep/751870 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7fa96d0f5000
0.037 ( 0.005 ms): sleep/751870 mmap(len: 1872744, prot: READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7fa96cf2b000
0.044 ( 0.011 ms): sleep/751870 mmap(addr: 0x7fa96cf50000, len: 1376256, prot: READ|EXEC, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x25000) = 0x7fa96cf50000
0.056 ( 0.007 ms): sleep/751870 mmap(addr: 0x7fa96d0a0000, len: 307200, prot: READ, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x175000) = 0x7fa96d0a0000
0.064 ( 0.007 ms): sleep/751870 mmap(addr: 0x7fa96d0eb000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 0x1bf000) = 0x7fa96d0eb000
0.075 ( 0.005 ms): sleep/751870 mmap(addr: 0x7fa96d0f1000, len: 13160, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7fa96d0f1000
0.253 ( 0.005 ms): sleep/751870 mmap(len: 218049136, prot: READ, flags: PRIVATE, fd: 3) = 0x7fa95ff38000
#
#
# set -o vi
# strace -e mmap sleep 1
mmap(NULL, 143317, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f333bd83000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f333bd81000
mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f333bbb7000
mmap(0x7f333bbdc000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f333bbdc000
mmap(0x7f333bd2c000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f333bd2c000
mmap(0x7f333bd77000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f333bd77000
mmap(0x7f333bd7d000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f333bd7d000
mmap(NULL, 218049136, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f332ebc4000
+++ exited with 0 +++
#
And you can as well tweak 'perf trace's output to more closely match
strace's:
# perf config trace.show_arg_names=no
# perf config trace.show_duration=no
# perf config trace.show_prefix=yes
# perf config trace.show_timestamp=no
# perf config trace.show_zeros=yes
# perf config trace.no_inherit=yes
# perf trace -e mmap sleep 1
mmap(NULL, 143317, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d287ca000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS) = 0x7f0d287c8000
mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0d285fe000
mmap(0x7f0d28623000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0d28623000
mmap(0x7f0d28773000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f0d28773000
mmap(0x7f0d287be000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f0d287be000
mmap(0x7f0d287c4000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS) = 0x7f0d287c4000
mmap(NULL, 218049136, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d1b60b000
#
# perf config | grep ^trace
trace.show_arg_names=no
trace.show_duration=no
trace.show_prefix=yes
trace.show_timestamp=no
trace.show_zeros=yes
trace.no_inherit=yes
#
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Will be wired up in the following csets:
$ tools/perf/trace/beauty/mmap_prot.sh
static const char *mmap_prot[] = {
[ilog2(0x1) + 1] = "READ",
#ifndef PROT_READ
#define PROT_READ 0x1
#endif
[ilog2(0x2) + 1] = "WRITE",
#ifndef PROT_WRITE
#define PROT_WRITE 0x2
#endif
[ilog2(0x4) + 1] = "EXEC",
#ifndef PROT_EXEC
#define PROT_EXEC 0x4
#endif
[ilog2(0x8) + 1] = "SEM",
#ifndef PROT_SEM
#define PROT_SEM 0x8
#endif
[ilog2(0x01000000) + 1] = "GROWSDOWN",
#ifndef PROT_GROWSDOWN
#define PROT_GROWSDOWN 0x01000000
#endif
[ilog2(0x02000000) + 1] = "GROWSUP",
#ifndef PROT_GROWSUP
#define PROT_GROWSUP 0x02000000
#endif
};
$
$
$
$ tools/perf/trace/beauty/mmap_prot.sh alpha
static const char *mmap_prot[] = {
[ilog2(0x4) + 1] = "EXEC",
#ifndef PROT_EXEC
#define PROT_EXEC 0x4
#endif
[ilog2(0x01000000) + 1] = "GROWSDOWN",
#ifndef PROT_GROWSDOWN
#define PROT_GROWSDOWN 0x01000000
#endif
[ilog2(0x02000000) + 1] = "GROWSUP",
#ifndef PROT_GROWSUP
#define PROT_GROWSUP 0x02000000
#endif
[ilog2(0x1) + 1] = "READ",
#ifndef PROT_READ
#define PROT_READ 0x1
#endif
[ilog2(0x8) + 1] = "SEM",
#ifndef PROT_SEM
#define PROT_SEM 0x8
#endif
[ilog2(0x2) + 1] = "WRITE",
#ifndef PROT_WRITE
#define PROT_WRITE 0x2
#endif
};
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
So that in older systems we get it in the mmap flags scnprintf routines:
$ tools/perf/trace/beauty/mmap_flags.sh | head -9 2> /dev/null
static const char *mmap_flags[] = {
[ilog2(0x40) + 1] = "32BIT",
#ifndef MAP_32BIT
#define MAP_32BIT 0x40
#endif
[ilog2(0x01) + 1] = "SHARED",
#ifndef MAP_SHARED
#define MAP_SHARED 0x01
#endif
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
table
It'll also conditionally generate the defines, so that if we don't have
those when building a new tool tarball in an older systems, we get
those, and we need them sometimes in the actual scnprintf routine, such
as when checking if a flags means we have an extra arg, like with
MREMAP_FIXED.
$ tools/perf/trace/beauty/mremap_flags.sh
static const char *mremap_flags[] = {
[ilog2(1) + 1] = "MAYMOVE",
#ifndef MREMAP_MAYMOVE
#define MREMAP_MAYMOVE 1
#endif
[ilog2(2) + 1] = "FIXED",
#ifndef MREMAP_FIXED
#define MREMAP_FIXED 2
#endif
[ilog2(4) + 1] = "DONTUNMAP",
#ifndef MREMAP_DONTUNMAP
#define MREMAP_DONTUNMAP 4
#endif
};
$
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|