Hello,
I got a report that the overhead of perf lock contention is too big in
some cases. It was running the task aggregation mode (-t) at the moment
and there were lots of tasks contending each other.
It turned out that the hash map update is a problem. The result is saved
in the lock_stat hash map which is pre-allocated. The BPF program never
deletes data in the map, but just adds. But if the map is full, (try to)
update the map becomes a very heavy operation - since it needs to check
every CPU's freelist to get a new node to save the result. But we know
it'd fail when the map is full. No need to update then.
I've checked it on my 64 CPU machine with this.
$ perf bench sched messaging -g 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 1000 groups == 40000 processes run
Total time: 2.825 [sec]
And I used the task mode, so that it can guarantee the map is full.
The default map entry size is 16K and this workload has 40K tasks.
Before:
$ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 1000 groups == 40000 processes run
Total time: 11.299 [sec]
contended total wait max wait avg wait pid comm
19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging
243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging
177 66.35 ms 12.08 ms 374.88 us 1220416 node
After:
$ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 1000 groups == 40000 processes run
Total time: 3.044 [sec]
contended total wait max wait avg wait pid comm
18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging
51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging
81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging
=== output for debug ===
bad: 1164137, total: 2253341
bad rate: 51.66 %
histogram of failure reasons
task: 0
stack: 0
time: 0
data: 1164137
The first few patches are small cleanups and fixes. You can get the code
from 'perf/lock-map-v1' branch in
git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
Thanks,
Namhyung
Namhyung Kim (7):
perf lock contention: Simplify parse_lock_type()
perf lock contention: Use -M for --map-nr-entries
perf lock contention: Update default map size to 16384
perf lock contention: Add data failure stat
perf lock contention: Update total/bad stats for hidden entries
perf lock contention: Revise needs_callstack() condition
perf lock contention: Do not try to update if hash map is full
tools/perf/Documentation/perf-lock.txt | 4 +-
tools/perf/builtin-lock.c | 64 ++++++++-----------
tools/perf/util/bpf_lock_contention.c | 7 +-
.../perf/util/bpf_skel/lock_contention.bpf.c | 29 +++++++--
tools/perf/util/bpf_skel/lock_data.h | 3 +
tools/perf/util/lock-contention.h | 2 +
6 files changed, 60 insertions(+), 49 deletions(-)
base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
--
2.40.0.577.gac1e443424-goog
The get_type_flag() should check both str and name fields in the
lock_type_table so that it can find the appropriate flag without retrying
with ':R' or ':W' suffix from the caller.
Also fix a typo in the rt-mutex.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-lock.c | 43 ++++++++-------------------------------
1 file changed, 8 insertions(+), 35 deletions(-)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 32ec58fb80e4..695ce6bd64f7 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1557,7 +1557,7 @@ static const struct {
{ LCB_F_SPIN | LCB_F_WRITE, "rwlock:W", "rwlock" },
{ LCB_F_READ, "rwsem:R", "rwsem" },
{ LCB_F_WRITE, "rwsem:W", "rwsem" },
- { LCB_F_RT, "rt=mutex", "rt-mutex" },
+ { LCB_F_RT, "rt-mutex", "rt-mutex" },
{ LCB_F_RT | LCB_F_READ, "rwlock-rt:R", "rwlock-rt" },
{ LCB_F_RT | LCB_F_WRITE, "rwlock-rt:W", "rwlock-rt" },
{ LCB_F_PERCPU | LCB_F_READ, "pcpu-sem:R", "percpu-rwsem" },
@@ -1596,6 +1596,10 @@ static unsigned int get_type_flag(const char *str)
if (!strcmp(lock_type_table[i].name, str))
return lock_type_table[i].flags;
}
+ for (unsigned int i = 0; i < ARRAY_SIZE(lock_type_table); i++) {
+ if (!strcmp(lock_type_table[i].str, str))
+ return lock_type_table[i].flags;
+ }
return UINT_MAX;
}
@@ -2121,46 +2125,15 @@ static int parse_lock_type(const struct option *opt __maybe_unused, const char *
unsigned int flags = get_type_flag(tok);
if (flags == -1U) {
- char buf[32];
-
- if (strchr(tok, ':'))
- continue;
-
- /* try :R and :W suffixes for rwlock, rwsem, ... */
- scnprintf(buf, sizeof(buf), "%s:R", tok);
- flags = get_type_flag(buf);
- if (flags != UINT_MAX) {
- if (!add_lock_type(flags)) {
- ret = -1;
- break;
- }
- }
-
- scnprintf(buf, sizeof(buf), "%s:W", tok);
- flags = get_type_flag(buf);
- if (flags != UINT_MAX) {
- if (!add_lock_type(flags)) {
- ret = -1;
- break;
- }
- }
- continue;
+ pr_err("Unknown lock flags: %s\n", tok);
+ ret = -1;
+ break;
}
if (!add_lock_type(flags)) {
ret = -1;
break;
}
-
- if (!strcmp(tok, "mutex")) {
- flags = get_type_flag("mutex-spin");
- if (flags != UINT_MAX) {
- if (!add_lock_type(flags)) {
- ret = -1;
- break;
- }
- }
- }
}
free(s);
--
2.40.0.577.gac1e443424-goog
The BPF hash map will align the map size to a power of 2. So 10k would
be 16k anyway. Let's have the actual size to avoid confusions.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Documentation/perf-lock.txt | 3 ++-
tools/perf/builtin-lock.c | 2 +-
tools/perf/util/bpf_skel/lock_contention.bpf.c | 5 +----
tools/perf/util/bpf_skel/lock_data.h | 3 +++
4 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt
index b5e5d088d51c..6e5ba3cd2b72 100644
--- a/tools/perf/Documentation/perf-lock.txt
+++ b/tools/perf/Documentation/perf-lock.txt
@@ -157,7 +157,8 @@ CONTENTION OPTIONS
-M::
--map-nr-entries=<value>::
- Maximum number of BPF map entries (default: 10240).
+ Maximum number of BPF map entries (default: 16384).
+ This will be aligned to a power of 2.
--max-stack=<value>::
Maximum stack depth when collecting lock contention (default: 8).
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index ef56bf90058d..9b92c7a5aefb 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -60,7 +60,7 @@ static bool show_thread_stats;
static bool show_lock_addrs;
static bool show_lock_owner;
static bool use_bpf;
-static unsigned long bpf_map_entries = 10240;
+static unsigned long bpf_map_entries = MAX_ENTRIES;
static int max_stack_depth = CONTENTION_STACK_DEPTH;
static int stack_skip = CONTENTION_STACK_SKIP;
static int print_nr_entries = INT_MAX / 2;
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 3f4ee3992e81..f9d2d792ccc8 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -7,9 +7,6 @@
#include "lock_data.h"
-/* default buffer size */
-#define MAX_ENTRIES 10240
-
/* for collect_lock_syms(). 4096 was rejected by the verifier */
#define MAX_CPUS 1024
@@ -63,7 +60,7 @@ struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(key_size, sizeof(__u64));
__uint(value_size, sizeof(__u32));
- __uint(max_entries, 16384);
+ __uint(max_entries, MAX_ENTRIES);
} lock_syms SEC(".maps");
struct {
diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
index 1ba61cb4d480..260062a9f2ab 100644
--- a/tools/perf/util/bpf_skel/lock_data.h
+++ b/tools/perf/util/bpf_skel/lock_data.h
@@ -15,6 +15,9 @@ struct contention_task_data {
char comm[TASK_COMM_LEN];
};
+/* default buffer size */
+#define MAX_ENTRIES 16384
+
/*
* Upper bits of the flags in the contention_data are used to identify
* some well-known locks which do not have symbols (non-global locks).
--
2.40.0.577.gac1e443424-goog
It needs callstacks for two reasons:
* for stack aggregation mode, the map key is the stack id and it can
also show the full stack traces when -v is used
* for other aggregation modes, the stack filter can be used to limit
lock contentions from known call paths
The -v option is meaningful (in terms of stack trace) only for stack
aggregation mode, so it should not set the save_callstack for other
mode like with -t or -l options.
I've noticed this with the following command line:
$ sudo ./perf lock con -ablv -E 3 -M 16 -- ./perf bench sched messaging
...
contended total wait max wait avg wait address symbol
88 4.59 ms 108.07 us 52.13 us ffff935757f46ec0 (spinlock)
33 905.22 us 73.67 us 27.43 us ffff935757f41700 (spinlock)
28 703.69 us 79.28 us 25.13 us ffff938a3d9b0c80 rq_lock (spinlock)
=== output for debug ===
bad: 12272, total: 12421
bad rate: 98.80 %
histogram of failure reasons
task: 8285
stack: 3987 <---------- here
time: 0
data: 0
It should not have any failure on stacks since it doesn't use it.
No functional change intended.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-lock.c | 2 +-
tools/perf/util/bpf_lock_contention.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 7742fa255c44..4e24351b18bd 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -77,7 +77,7 @@ static enum lock_aggr_mode aggr_mode = LOCK_AGGR_ADDR;
static bool needs_callstack(void)
{
- return verbose > 0 || !list_empty(&callstack_filters);
+ return !list_empty(&callstack_filters);
}
static struct thread_stat *thread_stat_find(u32 tid)
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index ea4f697d2a9f..9e20fa8ade09 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -346,7 +346,7 @@ int lock_contention_read(struct lock_contention *con)
if (data.count)
st->avg_wait_time = data.total_time / data.count;
- if (con->save_callstack) {
+ if (con->aggr_mode == LOCK_AGGR_CALLER && verbose > 0) {
st->callstack = memdup(stack_trace, stack_size);
if (st->callstack == NULL)
break;
--
2.40.0.577.gac1e443424-goog
It's possible to fail to update the data when the lock_stat map is full.
We should check that case and show the number at the end.
$ sudo ./perf lock con -ablv -E3 -- ./perf bench sched messaging
...
contended total wait max wait avg wait address symbol
6157 208.48 ms 69.29 us 33.86 us ffff934c001c1f00 (spinlock)
4030 72.04 ms 61.84 us 17.88 us ffff934c000415c0 (spinlock)
3201 50.30 ms 47.73 us 15.71 us ffff934c2eead850 (spinlock)
=== output for debug ===
bad: 0, total: 13388
bad rate: 0.00 %
histogram of failure reasons
task: 0
stack: 0
time: 0
data: 0 <----- added
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-lock.c | 4 +++-
tools/perf/util/bpf_lock_contention.c | 1 +
tools/perf/util/bpf_skel/lock_contention.bpf.c | 4 +++-
tools/perf/util/lock-contention.h | 1 +
4 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 9b92c7a5aefb..01b318d6c80a 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1626,7 +1626,7 @@ static void sort_contention_result(void)
static void print_bpf_events(int total, struct lock_contention_fails *fails)
{
/* Output for debug, this have to be removed */
- int broken = fails->task + fails->stack + fails->time;
+ int broken = fails->task + fails->stack + fails->time + fails->data;
if (quiet || total == 0 || (broken == 0 && verbose <= 0))
return;
@@ -1640,7 +1640,9 @@ static void print_bpf_events(int total, struct lock_contention_fails *fails)
pr_info(" %10s: %d\n", "task", fails->task);
pr_info(" %10s: %d\n", "stack", fails->stack);
pr_info(" %10s: %d\n", "time", fails->time);
+ pr_info(" %10s: %d\n", "data", fails->data);
}
+
static void print_contention_result(struct lock_contention *con)
{
struct lock_stat *st;
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 8a5d0eb44189..0071058ac3d2 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -262,6 +262,7 @@ int lock_contention_read(struct lock_contention *con)
con->fails.task = skel->bss->task_fail;
con->fails.stack = skel->bss->stack_fail;
con->fails.time = skel->bss->time_fail;
+ con->fails.data = skel->bss->data_fail;
stack_trace = zalloc(stack_size);
if (stack_trace == NULL)
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index f9d2d792ccc8..cb87c98e5340 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -124,6 +124,7 @@ int aggr_mode;
int task_fail;
int stack_fail;
int time_fail;
+int data_fail;
static inline int can_record(u64 *ctx)
{
@@ -380,7 +381,8 @@ int contention_end(u64 *ctx)
if (aggr_mode == LOCK_AGGR_ADDR)
first.flags |= check_lock_type(pelem->lock, pelem->flags);
- bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST);
+ if (bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST) < 0)
+ __sync_fetch_and_add(&data_fail, 1);
bpf_map_delete_elem(&tstamp, &pid);
return 0;
}
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index 10c28302420c..3ed1cad370fc 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -126,6 +126,7 @@ struct lock_contention_fails {
int task;
int stack;
int time;
+ int data;
};
struct lock_contention {
--
2.40.0.577.gac1e443424-goog
It doesn't delete data in the task_data and lock_stat maps. The data
is kept there until it's consumed by userspace at the end. But it calls
bpf_map_update_elem() again and again, and the data will be discarded if
the map is full. This is not good.
Worse, in the bpf_map_update_elem(), it keeps trying to get a new node
even if the map was full. I guess it makes sense if it deletes some node
like in the tstamp map (that's why I didn't make the change there).
In a pre-allocated hash map, that means it'd iterate all CPU to check the
freelist. And it has a bad performance impact on large machines.
I've checked it on my 64 CPU machine with this.
$ perf bench sched messaging -g 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 1000 groups == 40000 processes run
Total time: 2.825 [sec]
And I used the task mode, so that it can guarantee the map is full.
The default map entry size is 16K and this workload has 40K tasks.
Before:
$ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 1000 groups == 40000 processes run
Total time: 11.299 [sec]
contended total wait max wait avg wait pid comm
19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging
243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging
177 66.35 ms 12.08 ms 374.88 us 1220416 node
For some reason, it didn't report the data failures. But you can see the
total time in the workload is increased a lot (2.8 -> 11.3). If it fails
early when the map is full, it goes back to normal.
After:
$ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 1000 groups == 40000 processes run
Total time: 3.044 [sec]
contended total wait max wait avg wait pid comm
18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging
51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging
81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging
=== output for debug ===
bad: 1164137, total: 2253341
bad rate: 51.66 %
histogram of failure reasons
task: 0
stack: 0
time: 0
data: 1164137
Signed-off-by: Namhyung Kim <[email protected]>
---
.../perf/util/bpf_skel/lock_contention.bpf.c | 22 ++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index cb87c98e5340..23f6e63544ed 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -4,6 +4,7 @@
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
+#include <asm-generic/errno-base.h>
#include "lock_data.h"
@@ -126,6 +127,9 @@ int stack_fail;
int time_fail;
int data_fail;
+int task_map_full;
+int data_map_full;
+
static inline int can_record(u64 *ctx)
{
if (has_cpu) {
@@ -177,11 +181,12 @@ static inline int update_task_data(struct task_struct *task)
return -1;
p = bpf_map_lookup_elem(&task_data, &pid);
- if (p == NULL) {
+ if (p == NULL && !task_map_full) {
struct contention_task_data data = {};
BPF_CORE_READ_STR_INTO(&data.comm, task, comm);
- bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST);
+ if (bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST) == -E2BIG)
+ task_map_full = 1;
}
return 0;
@@ -370,6 +375,12 @@ int contention_end(u64 *ctx)
data = bpf_map_lookup_elem(&lock_stat, &key);
if (!data) {
+ if (data_map_full) {
+ bpf_map_delete_elem(&tstamp, &pid);
+ __sync_fetch_and_add(&data_fail, 1);
+ return 0;
+ }
+
struct contention_data first = {
.total_time = duration,
.max_time = duration,
@@ -377,12 +388,17 @@ int contention_end(u64 *ctx)
.count = 1,
.flags = pelem->flags,
};
+ int err;
if (aggr_mode == LOCK_AGGR_ADDR)
first.flags |= check_lock_type(pelem->lock, pelem->flags);
- if (bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST) < 0)
+ err = bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST);
+ if (err < 0) {
+ if (err == -E2BIG)
+ data_map_full = 1;
__sync_fetch_and_add(&data_fail, 1);
+ }
bpf_map_delete_elem(&tstamp, &pid);
return 0;
}
--
2.40.0.577.gac1e443424-goog
Users often want to change the map size, let's add a short option (-M)
for that.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/Documentation/perf-lock.txt | 1 +
tools/perf/builtin-lock.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt
index 37aae194a2a1..b5e5d088d51c 100644
--- a/tools/perf/Documentation/perf-lock.txt
+++ b/tools/perf/Documentation/perf-lock.txt
@@ -155,6 +155,7 @@ CONTENTION OPTIONS
--tid=<value>::
Record events on existing thread ID (comma separated list).
+-M::
--map-nr-entries=<value>::
Maximum number of BPF map entries (default: 10240).
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 695ce6bd64f7..ef56bf90058d 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -2294,7 +2294,7 @@ int cmd_lock(int argc, const char **argv)
"Trace on existing process id"),
OPT_STRING(0, "tid", &target.tid, "tid",
"Trace on existing thread id (exclusive to --pid)"),
- OPT_CALLBACK(0, "map-nr-entries", &bpf_map_entries, "num",
+ OPT_CALLBACK('M', "map-nr-entries", &bpf_map_entries, "num",
"Max number of BPF map entries", parse_map_entry),
OPT_CALLBACK(0, "max-stack", &max_stack_depth, "num",
"Set the maximum stack depth when collecting lopck contention, "
--
2.40.0.577.gac1e443424-goog
When -E option is used, it only prints the given number of entries but
the event stat at the end should have the numbers for entire entries.
Likewise, -S option will hide entries that don't have the named
function in the callstack. Also update event stat for them.
Signed-off-by: Namhyung Kim <[email protected]>
---
tools/perf/builtin-lock.c | 11 +++++++++++
tools/perf/util/bpf_lock_contention.c | 4 +++-
tools/perf/util/lock-contention.h | 1 +
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 01b318d6c80a..7742fa255c44 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -1726,6 +1726,17 @@ static void print_contention_result(struct lock_contention *con)
break;
}
+ if (print_nr_entries) {
+ /* update the total/bad stats */
+ while ((st = pop_from_result())) {
+ total += use_bpf ? st->nr_contended : 1;
+ if (st->broken)
+ bad++;
+ }
+ }
+ /* some entries are collected but hidden by the callstack filter */
+ total += con->nr_filtered;
+
if (use_bpf)
print_bpf_events(total, &con->fails);
else
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 0071058ac3d2..ea4f697d2a9f 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -299,8 +299,10 @@ int lock_contention_read(struct lock_contention *con)
if (con->save_callstack) {
bpf_map_lookup_elem(stack, &key.stack_id, stack_trace);
- if (!match_callstack_filter(machine, stack_trace))
+ if (!match_callstack_filter(machine, stack_trace)) {
+ con->nr_filtered += data.count;
goto next;
+ }
}
switch (con->aggr_mode) {
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index 3ed1cad370fc..fa16532c971c 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -141,6 +141,7 @@ struct lock_contention {
int stack_skip;
int aggr_mode;
int owner;
+ int nr_filtered;
bool save_callstack;
};
--
2.40.0.577.gac1e443424-goog
On Thu, Apr 6, 2023 at 2:06 PM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> I got a report that the overhead of perf lock contention is too big in
> some cases. It was running the task aggregation mode (-t) at the moment
> and there were lots of tasks contending each other.
>
> It turned out that the hash map update is a problem. The result is saved
> in the lock_stat hash map which is pre-allocated. The BPF program never
> deletes data in the map, but just adds. But if the map is full, (try to)
> update the map becomes a very heavy operation - since it needs to check
> every CPU's freelist to get a new node to save the result. But we know
> it'd fail when the map is full. No need to update then.
>
> I've checked it on my 64 CPU machine with this.
>
> $ perf bench sched messaging -g 1000
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1000 groups == 40000 processes run
>
> Total time: 2.825 [sec]
>
> And I used the task mode, so that it can guarantee the map is full.
> The default map entry size is 16K and this workload has 40K tasks.
>
> Before:
> $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1000 groups == 40000 processes run
>
> Total time: 11.299 [sec]
> contended total wait max wait avg wait pid comm
>
> 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging
> 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging
> 177 66.35 ms 12.08 ms 374.88 us 1220416 node
>
> After:
> $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1000 groups == 40000 processes run
>
> Total time: 3.044 [sec]
> contended total wait max wait avg wait pid comm
>
> 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging
> 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging
> 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging
>
> === output for debug ===
>
> bad: 1164137, total: 2253341
> bad rate: 51.66 %
> histogram of failure reasons
> task: 0
> stack: 0
> time: 0
> data: 1164137
>
> The first few patches are small cleanups and fixes. You can get the code
> from 'perf/lock-map-v1' branch in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Thanks,
> Namhyung
>
> Namhyung Kim (7):
> perf lock contention: Simplify parse_lock_type()
> perf lock contention: Use -M for --map-nr-entries
> perf lock contention: Update default map size to 16384
> perf lock contention: Add data failure stat
> perf lock contention: Update total/bad stats for hidden entries
> perf lock contention: Revise needs_callstack() condition
> perf lock contention: Do not try to update if hash map is full
Series:
Acked-by: Ian Rogers <[email protected]>
Thanks,
Ian
> tools/perf/Documentation/perf-lock.txt | 4 +-
> tools/perf/builtin-lock.c | 64 ++++++++-----------
> tools/perf/util/bpf_lock_contention.c | 7 +-
> .../perf/util/bpf_skel/lock_contention.bpf.c | 29 +++++++--
> tools/perf/util/bpf_skel/lock_data.h | 3 +
> tools/perf/util/lock-contention.h | 2 +
> 6 files changed, 60 insertions(+), 49 deletions(-)
>
>
> base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
> --
> 2.40.0.577.gac1e443424-goog
>
Em Thu, Apr 06, 2023 at 02:06:04PM -0700, Namhyung Kim escreveu:
> Hello,
>
> I got a report that the overhead of perf lock contention is too big in
> some cases. It was running the task aggregation mode (-t) at the moment
> and there were lots of tasks contending each other.
>
> It turned out that the hash map update is a problem. The result is saved
> in the lock_stat hash map which is pre-allocated. The BPF program never
> deletes data in the map, but just adds. But if the map is full, (try to)
> update the map becomes a very heavy operation - since it needs to check
> every CPU's freelist to get a new node to save the result. But we know
> it'd fail when the map is full. No need to update then.
Thanks, applied.
- Arnaldo
> I've checked it on my 64 CPU machine with this.
>
> $ perf bench sched messaging -g 1000
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1000 groups == 40000 processes run
>
> Total time: 2.825 [sec]
>
> And I used the task mode, so that it can guarantee the map is full.
> The default map entry size is 16K and this workload has 40K tasks.
>
> Before:
> $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1000 groups == 40000 processes run
>
> Total time: 11.299 [sec]
> contended total wait max wait avg wait pid comm
>
> 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging
> 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging
> 177 66.35 ms 12.08 ms 374.88 us 1220416 node
>
> After:
> $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1000 groups == 40000 processes run
>
> Total time: 3.044 [sec]
> contended total wait max wait avg wait pid comm
>
> 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging
> 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging
> 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging
>
> === output for debug ===
>
> bad: 1164137, total: 2253341
> bad rate: 51.66 %
> histogram of failure reasons
> task: 0
> stack: 0
> time: 0
> data: 1164137
>
> The first few patches are small cleanups and fixes. You can get the code
> from 'perf/lock-map-v1' branch in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Thanks,
> Namhyung
>
> Namhyung Kim (7):
> perf lock contention: Simplify parse_lock_type()
> perf lock contention: Use -M for --map-nr-entries
> perf lock contention: Update default map size to 16384
> perf lock contention: Add data failure stat
> perf lock contention: Update total/bad stats for hidden entries
> perf lock contention: Revise needs_callstack() condition
> perf lock contention: Do not try to update if hash map is full
>
> tools/perf/Documentation/perf-lock.txt | 4 +-
> tools/perf/builtin-lock.c | 64 ++++++++-----------
> tools/perf/util/bpf_lock_contention.c | 7 +-
> .../perf/util/bpf_skel/lock_contention.bpf.c | 29 +++++++--
> tools/perf/util/bpf_skel/lock_data.h | 3 +
> tools/perf/util/lock-contention.h | 2 +
> 6 files changed, 60 insertions(+), 49 deletions(-)
>
>
> base-commit: e5116f46d44b72ede59a6923829f68a8b8f84e76
> --
> 2.40.0.577.gac1e443424-goog
>
--
- Arnaldo