From: Feng Zhou <[email protected]>
This patch add is_empty in pcpu_freelist_head to check freelist
having free or not. If having, grab spin_lock, or check next cpu's
freelist.
Before patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 975345 events per sec
4:hash_map_perf pre-alloc 855367 events per sec
12:hash_map_perf pre-alloc 860862 events per sec
8:hash_map_perf pre-alloc 849561 events per sec
3:hash_map_perf pre-alloc 849074 events per sec
6:hash_map_perf pre-alloc 847120 events per sec
10:hash_map_perf pre-alloc 845047 events per sec
5:hash_map_perf pre-alloc 841266 events per sec
14:hash_map_perf pre-alloc 849740 events per sec
2:hash_map_perf pre-alloc 839598 events per sec
9:hash_map_perf pre-alloc 838695 events per sec
11:hash_map_perf pre-alloc 845390 events per sec
7:hash_map_perf pre-alloc 834865 events per sec
13:hash_map_perf pre-alloc 842619 events per sec
1:hash_map_perf pre-alloc 804231 events per sec
15:hash_map_perf pre-alloc 795314 events per sec
hash_map the worst: no free
./map_perf_test 2048
6:worse hash_map_perf pre-alloc 28628 events per sec
5:worse hash_map_perf pre-alloc 28553 events per sec
11:worse hash_map_perf pre-alloc 28543 events per sec
3:worse hash_map_perf pre-alloc 28444 events per sec
1:worse hash_map_perf pre-alloc 28418 events per sec
7:worse hash_map_perf pre-alloc 28427 events per sec
13:worse hash_map_perf pre-alloc 28330 events per sec
14:worse hash_map_perf pre-alloc 28263 events per sec
9:worse hash_map_perf pre-alloc 28211 events per sec
15:worse hash_map_perf pre-alloc 28193 events per sec
12:worse hash_map_perf pre-alloc 28190 events per sec
10:worse hash_map_perf pre-alloc 28129 events per sec
8:worse hash_map_perf pre-alloc 28116 events per sec
4:worse hash_map_perf pre-alloc 27906 events per sec
2:worse hash_map_perf pre-alloc 27801 events per sec
0:worse hash_map_perf pre-alloc 27416 events per sec
3:worse hash_map_perf pre-alloc 28188 events per sec
ftrace trace
0) | htab_map_update_elem() {
0) 0.198 us | migrate_disable();
0) | _raw_spin_lock_irqsave() {
0) 0.157 us | preempt_count_add();
0) 0.538 us | }
0) 0.260 us | lookup_elem_raw();
0) | alloc_htab_elem() {
0) | __pcpu_freelist_pop() {
0) | _raw_spin_lock() {
0) 0.152 us | preempt_count_add();
0) 0.352 us | native_queued_spin_lock_slowpath();
0) 1.065 us | }
| ...
0) | _raw_spin_unlock() {
0) 0.254 us | preempt_count_sub();
0) 0.555 us | }
0) + 25.188 us | }
0) + 25.486 us | }
0) | _raw_spin_unlock_irqrestore() {
0) 0.155 us | preempt_count_sub();
0) 0.454 us | }
0) 0.148 us | migrate_enable();
0) + 28.439 us | }
The test machine is 16C, trying to get spin_lock 17 times, in addition
to 16c, there is an extralist.
after patch: hash_map performance
./map_perf_test 1
0:hash_map_perf pre-alloc 969348 events per sec
10:hash_map_perf pre-alloc 906526 events per sec
11:hash_map_perf pre-alloc 904557 events per sec
9:hash_map_perf pre-alloc 902384 events per sec
15:hash_map_perf pre-alloc 912287 events per sec
14:hash_map_perf pre-alloc 905689 events per sec
12:hash_map_perf pre-alloc 903680 events per sec
13:hash_map_perf pre-alloc 902631 events per sec
8:hash_map_perf pre-alloc 875369 events per sec
4:hash_map_perf pre-alloc 862808 events per sec
1:hash_map_perf pre-alloc 857218 events per sec
2:hash_map_perf pre-alloc 852875 events per sec
5:hash_map_perf pre-alloc 846497 events per sec
6:hash_map_perf pre-alloc 828467 events per sec
3:hash_map_perf pre-alloc 812542 events per sec
7:hash_map_perf pre-alloc 805336 events per sec
hash_map worst: no free
./map_perf_test 2048
7:worse hash_map_perf pre-alloc 391104 events per sec
4:worse hash_map_perf pre-alloc 388073 events per sec
5:worse hash_map_perf pre-alloc 387038 events per sec
1:worse hash_map_perf pre-alloc 386546 events per sec
0:worse hash_map_perf pre-alloc 384590 events per sec
11:worse hash_map_perf pre-alloc 379378 events per sec
10:worse hash_map_perf pre-alloc 375480 events per sec
12:worse hash_map_perf pre-alloc 372394 events per sec
6:worse hash_map_perf pre-alloc 367692 events per sec
3:worse hash_map_perf pre-alloc 363970 events per sec
9:worse hash_map_perf pre-alloc 364008 events per sec
8:worse hash_map_perf pre-alloc 363759 events per sec
2:worse hash_map_perf pre-alloc 360743 events per sec
14:worse hash_map_perf pre-alloc 361195 events per sec
13:worse hash_map_perf pre-alloc 360276 events per sec
15:worse hash_map_perf pre-alloc 360057 events per sec
0:worse hash_map_perf pre-alloc 378177 events per sec
ftrace trace
0) | htab_map_update_elem() {
0) 0.317 us | migrate_disable();
0) | _raw_spin_lock_irqsave() {
0) 0.260 us | preempt_count_add();
0) 1.803 us | }
0) 0.276 us | lookup_elem_raw();
0) | alloc_htab_elem() {
0) 0.586 us | __pcpu_freelist_pop();
0) 0.945 us | }
0) | _raw_spin_unlock_irqrestore() {
0) 0.160 us | preempt_count_sub();
0) 0.972 us | }
0) 0.657 us | migrate_enable();
0) 8.669 us | }
It can be seen that after adding this patch, the map performance is
almost not degraded, and when free=0, first check is_empty instead of
directly acquiring spin_lock.
As for why to add is_empty instead of directly judging head->first, my
understanding is this, head->first is frequently modified during updating
map, which will lead to invalid other cpus's cache, and is_empty is after
freelist having no free elems will be changed, the performance will be better.
Co-developed-by: Chengming Zhou <[email protected]>
Signed-off-by: Chengming Zhou <[email protected]>
Signed-off-by: Feng Zhou <[email protected]>
---
kernel/bpf/percpu_freelist.c | 28 +++++++++++++++++++++++++---
kernel/bpf/percpu_freelist.h | 1 +
2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
index 3d897de89061..f83eb63720d4 100644
--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -16,9 +16,11 @@ int pcpu_freelist_init(struct pcpu_freelist *s)
raw_spin_lock_init(&head->lock);
head->first = NULL;
+ head->is_empty = true;
}
raw_spin_lock_init(&s->extralist.lock);
s->extralist.first = NULL;
+ s->extralist.is_empty = true;
return 0;
}
@@ -32,6 +34,8 @@ static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head,
{
node->next = head->first;
head->first = node;
+ if (head->is_empty)
+ head->is_empty = false;
}
static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head,
@@ -130,14 +134,19 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
orig_cpu = cpu = raw_smp_processor_id();
while (1) {
head = per_cpu_ptr(s->freelist, cpu);
+ if (head->is_empty)
+ goto next_cpu;
raw_spin_lock(&head->lock);
node = head->first;
if (node) {
head->first = node->next;
+ if (!head->first)
+ head->is_empty = true;
raw_spin_unlock(&head->lock);
return node;
}
raw_spin_unlock(&head->lock);
+next_cpu:
cpu = cpumask_next(cpu, cpu_possible_mask);
if (cpu >= nr_cpu_ids)
cpu = 0;
@@ -146,10 +155,15 @@ static struct pcpu_freelist_node *___pcpu_freelist_pop(struct pcpu_freelist *s)
}
/* per cpu lists are all empty, try extralist */
+ if (s->extralist.is_empty)
+ return NULL;
raw_spin_lock(&s->extralist.lock);
node = s->extralist.first;
- if (node)
+ if (node) {
s->extralist.first = node->next;
+ if (!s->extralist.first)
+ s->extralist.is_empty = true;
+ }
raw_spin_unlock(&s->extralist.lock);
return node;
}
@@ -164,15 +178,20 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
orig_cpu = cpu = raw_smp_processor_id();
while (1) {
head = per_cpu_ptr(s->freelist, cpu);
+ if (head->is_empty)
+ goto next_cpu;
if (raw_spin_trylock(&head->lock)) {
node = head->first;
if (node) {
head->first = node->next;
+ if (!head->first)
+ head->is_empty = true;
raw_spin_unlock(&head->lock);
return node;
}
raw_spin_unlock(&head->lock);
}
+next_cpu:
cpu = cpumask_next(cpu, cpu_possible_mask);
if (cpu >= nr_cpu_ids)
cpu = 0;
@@ -181,11 +200,14 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
}
/* cannot pop from per cpu lists, try extralist */
- if (!raw_spin_trylock(&s->extralist.lock))
+ if (s->extralist.is_empty || !raw_spin_trylock(&s->extralist.lock))
return NULL;
node = s->extralist.first;
- if (node)
+ if (node) {
s->extralist.first = node->next;
+ if (!s->extralist.first)
+ s->extralist.is_empty = true;
+ }
raw_spin_unlock(&s->extralist.lock);
return node;
}
diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h
index 3c76553cfe57..9e4545631ed5 100644
--- a/kernel/bpf/percpu_freelist.h
+++ b/kernel/bpf/percpu_freelist.h
@@ -9,6 +9,7 @@
struct pcpu_freelist_head {
struct pcpu_freelist_node *first;
raw_spinlock_t lock;
+ bool is_empty;
};
struct pcpu_freelist {
--
2.20.1
On 5/30/22 11:13 AM, Feng zhou wrote:
> From: Feng Zhou <[email protected]>
>
> This patch add is_empty in pcpu_freelist_head to check freelist
> having free or not. If having, grab spin_lock, or check next cpu's
> freelist.
>
> Before patch: hash_map performance
> ./map_perf_test 1
> 0:hash_map_perf pre-alloc 975345 events per sec
> 4:hash_map_perf pre-alloc 855367 events per sec
> 12:hash_map_perf pre-alloc 860862 events per sec
> 8:hash_map_perf pre-alloc 849561 events per sec
> 3:hash_map_perf pre-alloc 849074 events per sec
> 6:hash_map_perf pre-alloc 847120 events per sec
> 10:hash_map_perf pre-alloc 845047 events per sec
> 5:hash_map_perf pre-alloc 841266 events per sec
> 14:hash_map_perf pre-alloc 849740 events per sec
> 2:hash_map_perf pre-alloc 839598 events per sec
> 9:hash_map_perf pre-alloc 838695 events per sec
> 11:hash_map_perf pre-alloc 845390 events per sec
> 7:hash_map_perf pre-alloc 834865 events per sec
> 13:hash_map_perf pre-alloc 842619 events per sec
> 1:hash_map_perf pre-alloc 804231 events per sec
> 15:hash_map_perf pre-alloc 795314 events per sec
>
> hash_map the worst: no free
> ./map_perf_test 2048
> 6:worse hash_map_perf pre-alloc 28628 events per sec
> 5:worse hash_map_perf pre-alloc 28553 events per sec
> 11:worse hash_map_perf pre-alloc 28543 events per sec
> 3:worse hash_map_perf pre-alloc 28444 events per sec
> 1:worse hash_map_perf pre-alloc 28418 events per sec
> 7:worse hash_map_perf pre-alloc 28427 events per sec
> 13:worse hash_map_perf pre-alloc 28330 events per sec
> 14:worse hash_map_perf pre-alloc 28263 events per sec
> 9:worse hash_map_perf pre-alloc 28211 events per sec
> 15:worse hash_map_perf pre-alloc 28193 events per sec
> 12:worse hash_map_perf pre-alloc 28190 events per sec
> 10:worse hash_map_perf pre-alloc 28129 events per sec
> 8:worse hash_map_perf pre-alloc 28116 events per sec
> 4:worse hash_map_perf pre-alloc 27906 events per sec
> 2:worse hash_map_perf pre-alloc 27801 events per sec
> 0:worse hash_map_perf pre-alloc 27416 events per sec
> 3:worse hash_map_perf pre-alloc 28188 events per sec
>
> ftrace trace
>
> 0) | htab_map_update_elem() {
> 0) 0.198 us | migrate_disable();
> 0) | _raw_spin_lock_irqsave() {
> 0) 0.157 us | preempt_count_add();
> 0) 0.538 us | }
> 0) 0.260 us | lookup_elem_raw();
> 0) | alloc_htab_elem() {
> 0) | __pcpu_freelist_pop() {
> 0) | _raw_spin_lock() {
> 0) 0.152 us | preempt_count_add();
> 0) 0.352 us | native_queued_spin_lock_slowpath();
> 0) 1.065 us | }
> | ...
> 0) | _raw_spin_unlock() {
> 0) 0.254 us | preempt_count_sub();
> 0) 0.555 us | }
> 0) + 25.188 us | }
> 0) + 25.486 us | }
> 0) | _raw_spin_unlock_irqrestore() {
> 0) 0.155 us | preempt_count_sub();
> 0) 0.454 us | }
> 0) 0.148 us | migrate_enable();
> 0) + 28.439 us | }
>
> The test machine is 16C, trying to get spin_lock 17 times, in addition
> to 16c, there is an extralist.
>
> after patch: hash_map performance
> ./map_perf_test 1
> 0:hash_map_perf pre-alloc 969348 events per sec
> 10:hash_map_perf pre-alloc 906526 events per sec
> 11:hash_map_perf pre-alloc 904557 events per sec
> 9:hash_map_perf pre-alloc 902384 events per sec
> 15:hash_map_perf pre-alloc 912287 events per sec
> 14:hash_map_perf pre-alloc 905689 events per sec
> 12:hash_map_perf pre-alloc 903680 events per sec
> 13:hash_map_perf pre-alloc 902631 events per sec
> 8:hash_map_perf pre-alloc 875369 events per sec
> 4:hash_map_perf pre-alloc 862808 events per sec
> 1:hash_map_perf pre-alloc 857218 events per sec
> 2:hash_map_perf pre-alloc 852875 events per sec
> 5:hash_map_perf pre-alloc 846497 events per sec
> 6:hash_map_perf pre-alloc 828467 events per sec
> 3:hash_map_perf pre-alloc 812542 events per sec
> 7:hash_map_perf pre-alloc 805336 events per sec
>
> hash_map worst: no free
> ./map_perf_test 2048
> 7:worse hash_map_perf pre-alloc 391104 events per sec
> 4:worse hash_map_perf pre-alloc 388073 events per sec
> 5:worse hash_map_perf pre-alloc 387038 events per sec
> 1:worse hash_map_perf pre-alloc 386546 events per sec
> 0:worse hash_map_perf pre-alloc 384590 events per sec
> 11:worse hash_map_perf pre-alloc 379378 events per sec
> 10:worse hash_map_perf pre-alloc 375480 events per sec
> 12:worse hash_map_perf pre-alloc 372394 events per sec
> 6:worse hash_map_perf pre-alloc 367692 events per sec
> 3:worse hash_map_perf pre-alloc 363970 events per sec
> 9:worse hash_map_perf pre-alloc 364008 events per sec
> 8:worse hash_map_perf pre-alloc 363759 events per sec
> 2:worse hash_map_perf pre-alloc 360743 events per sec
> 14:worse hash_map_perf pre-alloc 361195 events per sec
> 13:worse hash_map_perf pre-alloc 360276 events per sec
> 15:worse hash_map_perf pre-alloc 360057 events per sec
> 0:worse hash_map_perf pre-alloc 378177 events per sec
>
> ftrace trace
> 0) | htab_map_update_elem() {
> 0) 0.317 us | migrate_disable();
> 0) | _raw_spin_lock_irqsave() {
> 0) 0.260 us | preempt_count_add();
> 0) 1.803 us | }
> 0) 0.276 us | lookup_elem_raw();
> 0) | alloc_htab_elem() {
> 0) 0.586 us | __pcpu_freelist_pop();
> 0) 0.945 us | }
> 0) | _raw_spin_unlock_irqrestore() {
> 0) 0.160 us | preempt_count_sub();
> 0) 0.972 us | }
> 0) 0.657 us | migrate_enable();
> 0) 8.669 us | }
>
> It can be seen that after adding this patch, the map performance is
> almost not degraded, and when free=0, first check is_empty instead of
> directly acquiring spin_lock.
>
> As for why to add is_empty instead of directly judging head->first, my
> understanding is this, head->first is frequently modified during updating
> map, which will lead to invalid other cpus's cache, and is_empty is after
> freelist having no free elems will be changed, the performance will be better.
>
> Co-developed-by: Chengming Zhou <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>
> Signed-off-by: Feng Zhou <[email protected]>
> ---
> kernel/bpf/percpu_freelist.c | 28 +++++++++++++++++++++++++---
> kernel/bpf/percpu_freelist.h | 1 +
> 2 files changed, 26 insertions(+), 3 deletions(-)
[...]
> /* per cpu lists are all empty, try extralist */
> + if (s->extralist.is_empty)
> + return NULL;
> raw_spin_lock(&s->extralist.lock);
> node = s->extralist.first;
> - if (node)
> + if (node) {
> s->extralist.first = node->next;
> + if (!s->extralist.first)
> + s->extralist.is_empty = true;
> + }
> raw_spin_unlock(&s->extralist.lock);
> return node;
> }
> @@ -164,15 +178,20 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
> orig_cpu = cpu = raw_smp_processor_id();
> while (1) {
> head = per_cpu_ptr(s->freelist, cpu);
> + if (head->is_empty)
This should use READ_ONCE/WRITE_ONCE pair for head->is_empty.
> + goto next_cpu;
> if (raw_spin_trylock(&head->lock)) {
> node = head->first;
> if (node) {
> head->first = node->next;
> + if (!head->first)
> + head->is_empty = true;
> raw_spin_unlock(&head->lock);
> return node;
> }
> raw_spin_unlock(&head->lock);
> }
> +next_cpu:
> cpu = cpumask_next(cpu, cpu_possible_mask);
> if (cpu >= nr_cpu_ids)
> cpu = 0;
在 2022/5/31 上午5:20, Daniel Borkmann 写道:
> On 5/30/22 11:13 AM, Feng zhou wrote:
>> From: Feng Zhou <[email protected]>
>>
>> This patch add is_empty in pcpu_freelist_head to check freelist
>> having free or not. If having, grab spin_lock, or check next cpu's
>> freelist.
>>
>> Before patch: hash_map performance
>> ./map_perf_test 1
>> 0:hash_map_perf pre-alloc 975345 events per sec
>> 4:hash_map_perf pre-alloc 855367 events per sec
>> 12:hash_map_perf pre-alloc 860862 events per sec
>> 8:hash_map_perf pre-alloc 849561 events per sec
>> 3:hash_map_perf pre-alloc 849074 events per sec
>> 6:hash_map_perf pre-alloc 847120 events per sec
>> 10:hash_map_perf pre-alloc 845047 events per sec
>> 5:hash_map_perf pre-alloc 841266 events per sec
>> 14:hash_map_perf pre-alloc 849740 events per sec
>> 2:hash_map_perf pre-alloc 839598 events per sec
>> 9:hash_map_perf pre-alloc 838695 events per sec
>> 11:hash_map_perf pre-alloc 845390 events per sec
>> 7:hash_map_perf pre-alloc 834865 events per sec
>> 13:hash_map_perf pre-alloc 842619 events per sec
>> 1:hash_map_perf pre-alloc 804231 events per sec
>> 15:hash_map_perf pre-alloc 795314 events per sec
>>
>> hash_map the worst: no free
>> ./map_perf_test 2048
>> 6:worse hash_map_perf pre-alloc 28628 events per sec
>> 5:worse hash_map_perf pre-alloc 28553 events per sec
>> 11:worse hash_map_perf pre-alloc 28543 events per sec
>> 3:worse hash_map_perf pre-alloc 28444 events per sec
>> 1:worse hash_map_perf pre-alloc 28418 events per sec
>> 7:worse hash_map_perf pre-alloc 28427 events per sec
>> 13:worse hash_map_perf pre-alloc 28330 events per sec
>> 14:worse hash_map_perf pre-alloc 28263 events per sec
>> 9:worse hash_map_perf pre-alloc 28211 events per sec
>> 15:worse hash_map_perf pre-alloc 28193 events per sec
>> 12:worse hash_map_perf pre-alloc 28190 events per sec
>> 10:worse hash_map_perf pre-alloc 28129 events per sec
>> 8:worse hash_map_perf pre-alloc 28116 events per sec
>> 4:worse hash_map_perf pre-alloc 27906 events per sec
>> 2:worse hash_map_perf pre-alloc 27801 events per sec
>> 0:worse hash_map_perf pre-alloc 27416 events per sec
>> 3:worse hash_map_perf pre-alloc 28188 events per sec
>>
>> ftrace trace
>>
>> 0) | htab_map_update_elem() {
>> 0) 0.198 us | migrate_disable();
>> 0) | _raw_spin_lock_irqsave() {
>> 0) 0.157 us | preempt_count_add();
>> 0) 0.538 us | }
>> 0) 0.260 us | lookup_elem_raw();
>> 0) | alloc_htab_elem() {
>> 0) | __pcpu_freelist_pop() {
>> 0) | _raw_spin_lock() {
>> 0) 0.152 us | preempt_count_add();
>> 0) 0.352 us | native_queued_spin_lock_slowpath();
>> 0) 1.065 us | }
>> | ...
>> 0) | _raw_spin_unlock() {
>> 0) 0.254 us | preempt_count_sub();
>> 0) 0.555 us | }
>> 0) + 25.188 us | }
>> 0) + 25.486 us | }
>> 0) | _raw_spin_unlock_irqrestore() {
>> 0) 0.155 us | preempt_count_sub();
>> 0) 0.454 us | }
>> 0) 0.148 us | migrate_enable();
>> 0) + 28.439 us | }
>>
>> The test machine is 16C, trying to get spin_lock 17 times, in addition
>> to 16c, there is an extralist.
>>
>> after patch: hash_map performance
>> ./map_perf_test 1
>> 0:hash_map_perf pre-alloc 969348 events per sec
>> 10:hash_map_perf pre-alloc 906526 events per sec
>> 11:hash_map_perf pre-alloc 904557 events per sec
>> 9:hash_map_perf pre-alloc 902384 events per sec
>> 15:hash_map_perf pre-alloc 912287 events per sec
>> 14:hash_map_perf pre-alloc 905689 events per sec
>> 12:hash_map_perf pre-alloc 903680 events per sec
>> 13:hash_map_perf pre-alloc 902631 events per sec
>> 8:hash_map_perf pre-alloc 875369 events per sec
>> 4:hash_map_perf pre-alloc 862808 events per sec
>> 1:hash_map_perf pre-alloc 857218 events per sec
>> 2:hash_map_perf pre-alloc 852875 events per sec
>> 5:hash_map_perf pre-alloc 846497 events per sec
>> 6:hash_map_perf pre-alloc 828467 events per sec
>> 3:hash_map_perf pre-alloc 812542 events per sec
>> 7:hash_map_perf pre-alloc 805336 events per sec
>>
>> hash_map worst: no free
>> ./map_perf_test 2048
>> 7:worse hash_map_perf pre-alloc 391104 events per sec
>> 4:worse hash_map_perf pre-alloc 388073 events per sec
>> 5:worse hash_map_perf pre-alloc 387038 events per sec
>> 1:worse hash_map_perf pre-alloc 386546 events per sec
>> 0:worse hash_map_perf pre-alloc 384590 events per sec
>> 11:worse hash_map_perf pre-alloc 379378 events per sec
>> 10:worse hash_map_perf pre-alloc 375480 events per sec
>> 12:worse hash_map_perf pre-alloc 372394 events per sec
>> 6:worse hash_map_perf pre-alloc 367692 events per sec
>> 3:worse hash_map_perf pre-alloc 363970 events per sec
>> 9:worse hash_map_perf pre-alloc 364008 events per sec
>> 8:worse hash_map_perf pre-alloc 363759 events per sec
>> 2:worse hash_map_perf pre-alloc 360743 events per sec
>> 14:worse hash_map_perf pre-alloc 361195 events per sec
>> 13:worse hash_map_perf pre-alloc 360276 events per sec
>> 15:worse hash_map_perf pre-alloc 360057 events per sec
>> 0:worse hash_map_perf pre-alloc 378177 events per sec
>>
>> ftrace trace
>> 0) | htab_map_update_elem() {
>> 0) 0.317 us | migrate_disable();
>> 0) | _raw_spin_lock_irqsave() {
>> 0) 0.260 us | preempt_count_add();
>> 0) 1.803 us | }
>> 0) 0.276 us | lookup_elem_raw();
>> 0) | alloc_htab_elem() {
>> 0) 0.586 us | __pcpu_freelist_pop();
>> 0) 0.945 us | }
>> 0) | _raw_spin_unlock_irqrestore() {
>> 0) 0.160 us | preempt_count_sub();
>> 0) 0.972 us | }
>> 0) 0.657 us | migrate_enable();
>> 0) 8.669 us | }
>>
>> It can be seen that after adding this patch, the map performance is
>> almost not degraded, and when free=0, first check is_empty instead of
>> directly acquiring spin_lock.
>>
>> As for why to add is_empty instead of directly judging head->first, my
>> understanding is this, head->first is frequently modified during
>> updating
>> map, which will lead to invalid other cpus's cache, and is_empty is
>> after
>> freelist having no free elems will be changed, the performance will
>> be better.
>>
>> Co-developed-by: Chengming Zhou <[email protected]>
>> Signed-off-by: Chengming Zhou <[email protected]>
>> Signed-off-by: Feng Zhou <[email protected]>
>> ---
>> kernel/bpf/percpu_freelist.c | 28 +++++++++++++++++++++++++---
>> kernel/bpf/percpu_freelist.h | 1 +
>> 2 files changed, 26 insertions(+), 3 deletions(-)
> [...]
>> /* per cpu lists are all empty, try extralist */
>> + if (s->extralist.is_empty)
>> + return NULL;
>> raw_spin_lock(&s->extralist.lock);
>> node = s->extralist.first;
>> - if (node)
>> + if (node) {
>> s->extralist.first = node->next;
>> + if (!s->extralist.first)
>> + s->extralist.is_empty = true;
>> + }
>> raw_spin_unlock(&s->extralist.lock);
>> return node;
>> }
>> @@ -164,15 +178,20 @@ ___pcpu_freelist_pop_nmi(struct pcpu_freelist *s)
>> orig_cpu = cpu = raw_smp_processor_id();
>> while (1) {
>> head = per_cpu_ptr(s->freelist, cpu);
>> + if (head->is_empty)
>
> This should use READ_ONCE/WRITE_ONCE pair for head->is_empty.
Yes, will do. Thanks.
>
>> + goto next_cpu;
>> if (raw_spin_trylock(&head->lock)) {
>> node = head->first;
>> if (node) {
>> head->first = node->next;
>> + if (!head->first)
>> + head->is_empty = true;
>> raw_spin_unlock(&head->lock);
>> return node;
>> }
>> raw_spin_unlock(&head->lock);
>> }
>> +next_cpu:
>> cpu = cpumask_next(cpu, cpu_possible_mask);
>> if (cpu >= nr_cpu_ids)
>> cpu = 0;