2021-10-15 16:50:10

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH] bpf: use count for prealloc hashtab too

We only use count for kmalloc hashtab not for prealloc hashtab, because
__pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.

But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
spin_lock for all CPUs to find there is no more elem at last.

We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
would last for 1ms. This patch use count for prealloc hashtab too,
avoid traverse and spin_lock for all CPUs in this case.

Signed-off-by: Chengming Zhou <[email protected]>
---
kernel/bpf/hashtab.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 32471ba02708..0c432a23aa00 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -855,12 +855,12 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
{
htab_put_fd_value(htab, l);
+ atomic_dec(&htab->count);

if (htab_is_prealloc(htab)) {
check_and_free_timer(htab, l);
__pcpu_freelist_push(&htab->freelist, &l->fnode);
} else {
- atomic_dec(&htab->count);
l->htab = htab;
call_rcu(&l->rcu, htab_elem_free_rcu);
}
@@ -938,6 +938,11 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
} else {
struct pcpu_freelist_node *l;

+ if (atomic_inc_return(&htab->count) > htab->map.max_entries) {
+ l_new = ERR_PTR(-E2BIG);
+ goto dec_count;
+ }
+
l = __pcpu_freelist_pop(&htab->freelist);
if (!l)
return ERR_PTR(-E2BIG);
--
2.11.0


2021-10-18 01:02:16

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH] bpf: use count for prealloc hashtab too

On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
<[email protected]> wrote:
>
> We only use count for kmalloc hashtab not for prealloc hashtab, because
> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
>
> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
> spin_lock for all CPUs to find there is no more elem at last.
>
> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
> would last for 1ms. This patch use count for prealloc hashtab too,
> avoid traverse and spin_lock for all CPUs in this case.
>
> Signed-off-by: Chengming Zhou <[email protected]>

It's not clear from the commit log what you're solving.
The atomic inc/dec in critical path of prealloc maps hurts performance.
That's why it's not used.

2021-10-18 05:52:43

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
> <[email protected]> wrote:
>>
>> We only use count for kmalloc hashtab not for prealloc hashtab, because
>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
>>
>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
>> spin_lock for all CPUs to find there is no more elem at last.
>>
>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
>> would last for 1ms. This patch use count for prealloc hashtab too,
>> avoid traverse and spin_lock for all CPUs in this case.
>>
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> It's not clear from the commit log what you're solving.
> The atomic inc/dec in critical path of prealloc maps hurts performance.
> That's why it's not used.
>
Thanks for the explanation, what I'm solving is when hash table hasn't free
elements, we don't need to call __pcpu_freelist_pop() to traverse and
spin_lock all CPUs. The ftrace output of this bad case is below:

50) | htab_map_update_elem() {
50) 0.329 us | _raw_spin_lock_irqsave();
50) 0.063 us | lookup_elem_raw();
50) | alloc_htab_elem() {
50) | pcpu_freelist_pop() {
50) 0.209 us | _raw_spin_lock();
50) 0.264 us | _raw_spin_lock();
50) 0.231 us | _raw_spin_lock();
50) 0.168 us | _raw_spin_lock();
50) 0.168 us | _raw_spin_lock();
50) 0.300 us | _raw_spin_lock();
50) 0.263 us | _raw_spin_lock();
50) 0.304 us | _raw_spin_lock();
50) 0.168 us | _raw_spin_lock();
50) 0.177 us | _raw_spin_lock();
50) 0.235 us | _raw_spin_lock();
50) 0.162 us | _raw_spin_lock();
50) 0.186 us | _raw_spin_lock();
50) 0.185 us | _raw_spin_lock();
50) 0.315 us | _raw_spin_lock();
50) 0.172 us | _raw_spin_lock();
50) 0.180 us | _raw_spin_lock();
50) 0.173 us | _raw_spin_lock();
50) 0.176 us | _raw_spin_lock();
50) 0.261 us | _raw_spin_lock();
50) 0.364 us | _raw_spin_lock();
50) 0.180 us | _raw_spin_lock();
50) 0.284 us | _raw_spin_lock();
50) 0.226 us | _raw_spin_lock();
50) 0.210 us | _raw_spin_lock();
50) 0.237 us | _raw_spin_lock();
50) 0.333 us | _raw_spin_lock();
50) 0.295 us | _raw_spin_lock();
50) 0.278 us | _raw_spin_lock();
50) 0.260 us | _raw_spin_lock();
50) 0.224 us | _raw_spin_lock();
50) 0.447 us | _raw_spin_lock();
50) 0.221 us | _raw_spin_lock();
50) 0.320 us | _raw_spin_lock();
50) 0.203 us | _raw_spin_lock();
50) 0.213 us | _raw_spin_lock();
50) 0.242 us | _raw_spin_lock();
50) 0.230 us | _raw_spin_lock();
50) 0.216 us | _raw_spin_lock();
50) 0.525 us | _raw_spin_lock();
50) 0.257 us | _raw_spin_lock();
50) 0.235 us | _raw_spin_lock();
50) 0.269 us | _raw_spin_lock();
50) 0.368 us | _raw_spin_lock();
50) 0.249 us | _raw_spin_lock();
50) 0.217 us | _raw_spin_lock();
50) 0.174 us | _raw_spin_lock();
50) 0.173 us | _raw_spin_lock();
50) 0.161 us | _raw_spin_lock();
50) 0.282 us | _raw_spin_lock();
50) 0.264 us | _raw_spin_lock();
50) 0.160 us | _raw_spin_lock();
50) 0.692 us | _raw_spin_lock();
50) 0.185 us | _raw_spin_lock();
50) 0.157 us | _raw_spin_lock();
50) 0.168 us | _raw_spin_lock();
50) 0.205 us | _raw_spin_lock();
50) 0.189 us | _raw_spin_lock();
50) 0.276 us | _raw_spin_lock();
50) 0.171 us | _raw_spin_lock();
50) 0.390 us | _raw_spin_lock();
50) 0.164 us | _raw_spin_lock();
50) 0.170 us | _raw_spin_lock();
50) 0.188 us | _raw_spin_lock();
50) 0.284 us | _raw_spin_lock();
50) 0.191 us | _raw_spin_lock();
50) 0.412 us | _raw_spin_lock();
50) 0.285 us | _raw_spin_lock();
50) 0.296 us | _raw_spin_lock();
50) 0.315 us | _raw_spin_lock();
50) 0.239 us | _raw_spin_lock();
50) 0.225 us | _raw_spin_lock();
50) 0.258 us | _raw_spin_lock();
50) 0.228 us | _raw_spin_lock();
50) 0.240 us | _raw_spin_lock();
50) 0.297 us | _raw_spin_lock();
50) 0.216 us | _raw_spin_lock();
50) 0.213 us | _raw_spin_lock();
50) 0.225 us | _raw_spin_lock();
50) 0.223 us | _raw_spin_lock();
50) 0.287 us | _raw_spin_lock();
50) 0.258 us | _raw_spin_lock();
50) 0.295 us | _raw_spin_lock();
50) 0.262 us | _raw_spin_lock();
50) 0.325 us | _raw_spin_lock();
50) 0.203 us | _raw_spin_lock();
50) 0.325 us | _raw_spin_lock();
50) 0.255 us | _raw_spin_lock();
50) 0.325 us | _raw_spin_lock();
50) 0.216 us | _raw_spin_lock();
50) 0.232 us | _raw_spin_lock();
50) 0.804 us | _raw_spin_lock();
50) 0.262 us | _raw_spin_lock();
50) 0.242 us | _raw_spin_lock();
50) 0.271 us | _raw_spin_lock();
50) 0.175 us | _raw_spin_lock();
50) + 61.026 us | }
50) + 61.575 us | }
50) 0.051 us | _raw_spin_unlock_irqrestore();
50) + 64.863 us | }

2021-10-19 02:02:00

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
<[email protected]> wrote:
>
> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
> > On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
> > <[email protected]> wrote:
> >>
> >> We only use count for kmalloc hashtab not for prealloc hashtab, because
> >> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
> >>
> >> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
> >> spin_lock for all CPUs to find there is no more elem at last.
> >>
> >> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
> >> would last for 1ms. This patch use count for prealloc hashtab too,
> >> avoid traverse and spin_lock for all CPUs in this case.
> >>
> >> Signed-off-by: Chengming Zhou <[email protected]>
> >
> > It's not clear from the commit log what you're solving.
> > The atomic inc/dec in critical path of prealloc maps hurts performance.
> > That's why it's not used.
> >
> Thanks for the explanation, what I'm solving is when hash table hasn't free
> elements, we don't need to call __pcpu_freelist_pop() to traverse and
> spin_lock all CPUs. The ftrace output of this bad case is below:
>
> 50) | htab_map_update_elem() {
> 50) 0.329 us | _raw_spin_lock_irqsave();
> 50) 0.063 us | lookup_elem_raw();
> 50) | alloc_htab_elem() {
> 50) | pcpu_freelist_pop() {
> 50) 0.209 us | _raw_spin_lock();
> 50) 0.264 us | _raw_spin_lock();

This is LRU map. Not hash map.
It will grab spin_locks of other cpus
only if all previous cpus don't have free elements.
Most likely your map is actually full and doesn't have any free elems.
Since it's an lru it will force free an elem eventually.

2021-10-19 03:19:13

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
>>>>
>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
>>>> spin_lock for all CPUs to find there is no more elem at last.
>>>>
>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
>>>> would last for 1ms. This patch use count for prealloc hashtab too,
>>>> avoid traverse and spin_lock for all CPUs in this case.
>>>>
>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>
>>> It's not clear from the commit log what you're solving.
>>> The atomic inc/dec in critical path of prealloc maps hurts performance.
>>> That's why it's not used.
>>>
>> Thanks for the explanation, what I'm solving is when hash table hasn't free
>> elements, we don't need to call __pcpu_freelist_pop() to traverse and
>> spin_lock all CPUs. The ftrace output of this bad case is below:
>>
>> 50) | htab_map_update_elem() {
>> 50) 0.329 us | _raw_spin_lock_irqsave();
>> 50) 0.063 us | lookup_elem_raw();
>> 50) | alloc_htab_elem() {
>> 50) | pcpu_freelist_pop() {
>> 50) 0.209 us | _raw_spin_lock();
>> 50) 0.264 us | _raw_spin_lock();
>
> This is LRU map. Not hash map.
> It will grab spin_locks of other cpus
> only if all previous cpus don't have free elements.
> Most likely your map is actually full and doesn't have any free elems.
> Since it's an lru it will force free an elem eventually.
>

Maybe I missed something, the map_update_elem function of LRU map is
htab_lru_map_update_elem() and the htab_map_update_elem() above is the
map_update_elem function of hash map.
Because of the implementation of percpu freelist used in hash map, it
will spin_lock all other CPUs when there is no free elements.

Thanks.

2021-10-19 03:48:46

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou
<[email protected]> wrote:
>
> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
> > On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
> > <[email protected]> wrote:
> >>
> >> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
> >>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
> >>> <[email protected]> wrote:
> >>>>
> >>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
> >>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
> >>>>
> >>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
> >>>> spin_lock for all CPUs to find there is no more elem at last.
> >>>>
> >>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
> >>>> would last for 1ms. This patch use count for prealloc hashtab too,
> >>>> avoid traverse and spin_lock for all CPUs in this case.
> >>>>
> >>>> Signed-off-by: Chengming Zhou <[email protected]>
> >>>
> >>> It's not clear from the commit log what you're solving.
> >>> The atomic inc/dec in critical path of prealloc maps hurts performance.
> >>> That's why it's not used.
> >>>
> >> Thanks for the explanation, what I'm solving is when hash table hasn't free
> >> elements, we don't need to call __pcpu_freelist_pop() to traverse and
> >> spin_lock all CPUs. The ftrace output of this bad case is below:
> >>
> >> 50) | htab_map_update_elem() {
> >> 50) 0.329 us | _raw_spin_lock_irqsave();
> >> 50) 0.063 us | lookup_elem_raw();
> >> 50) | alloc_htab_elem() {
> >> 50) | pcpu_freelist_pop() {
> >> 50) 0.209 us | _raw_spin_lock();
> >> 50) 0.264 us | _raw_spin_lock();
> >
> > This is LRU map. Not hash map.
> > It will grab spin_locks of other cpus
> > only if all previous cpus don't have free elements.
> > Most likely your map is actually full and doesn't have any free elems.
> > Since it's an lru it will force free an elem eventually.
> >
>
> Maybe I missed something, the map_update_elem function of LRU map is
> htab_lru_map_update_elem() and the htab_map_update_elem() above is the
> map_update_elem function of hash map.
> Because of the implementation of percpu freelist used in hash map, it
> will spin_lock all other CPUs when there is no free elements.

Ahh. Right. Then what's the point of optimizing the error case
at the expense of the fast path?

2021-10-19 04:36:18

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

在 2021/10/19 上午11:45, Alexei Starovoitov 写道:
> On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
>>> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>>>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
>>>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
>>>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
>>>>>>
>>>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
>>>>>> spin_lock for all CPUs to find there is no more elem at last.
>>>>>>
>>>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
>>>>>> would last for 1ms. This patch use count for prealloc hashtab too,
>>>>>> avoid traverse and spin_lock for all CPUs in this case.
>>>>>>
>>>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>>>
>>>>> It's not clear from the commit log what you're solving.
>>>>> The atomic inc/dec in critical path of prealloc maps hurts performance.
>>>>> That's why it's not used.
>>>>>
>>>> Thanks for the explanation, what I'm solving is when hash table hasn't free
>>>> elements, we don't need to call __pcpu_freelist_pop() to traverse and
>>>> spin_lock all CPUs. The ftrace output of this bad case is below:
>>>>
>>>> 50) | htab_map_update_elem() {
>>>> 50) 0.329 us | _raw_spin_lock_irqsave();
>>>> 50) 0.063 us | lookup_elem_raw();
>>>> 50) | alloc_htab_elem() {
>>>> 50) | pcpu_freelist_pop() {
>>>> 50) 0.209 us | _raw_spin_lock();
>>>> 50) 0.264 us | _raw_spin_lock();
>>>
>>> This is LRU map. Not hash map.
>>> It will grab spin_locks of other cpus
>>> only if all previous cpus don't have free elements.
>>> Most likely your map is actually full and doesn't have any free elems.
>>> Since it's an lru it will force free an elem eventually.
>>>
>>
>> Maybe I missed something, the map_update_elem function of LRU map is
>> htab_lru_map_update_elem() and the htab_map_update_elem() above is the
>> map_update_elem function of hash map.
>> Because of the implementation of percpu freelist used in hash map, it
>> will spin_lock all other CPUs when there is no free elements.
>
> Ahh. Right. Then what's the point of optimizing the error case
> at the expense of the fast path?
>

Yes, this patch only optimized the very bad case that no free elements left,
and add atomic operation in the fast path. Maybe the better workaround is not
allowing full hash map in our case or using LRU map?
But the problem of spinlock contention also exist even when the map is not full,
like some CPUs run out of its freelist but other CPUs seldom used, then have to
grab those CPUs' spinlock to get free element.
Should we change the current implementation of percpu freelist to percpu lockless freelist?

Thanks.

2021-10-19 04:46:44

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

On Mon, Oct 18, 2021 at 9:31 PM Chengming Zhou
<[email protected]> wrote:
>
> 在 2021/10/19 上午11:45, Alexei Starovoitov 写道:
> > On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou
> > <[email protected]> wrote:
> >>
> >> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
> >>> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
> >>> <[email protected]> wrote:
> >>>>
> >>>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
> >>>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
> >>>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
> >>>>>>
> >>>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
> >>>>>> spin_lock for all CPUs to find there is no more elem at last.
> >>>>>>
> >>>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
> >>>>>> would last for 1ms. This patch use count for prealloc hashtab too,
> >>>>>> avoid traverse and spin_lock for all CPUs in this case.
> >>>>>>
> >>>>>> Signed-off-by: Chengming Zhou <[email protected]>
> >>>>>
> >>>>> It's not clear from the commit log what you're solving.
> >>>>> The atomic inc/dec in critical path of prealloc maps hurts performance.
> >>>>> That's why it's not used.
> >>>>>
> >>>> Thanks for the explanation, what I'm solving is when hash table hasn't free
> >>>> elements, we don't need to call __pcpu_freelist_pop() to traverse and
> >>>> spin_lock all CPUs. The ftrace output of this bad case is below:
> >>>>
> >>>> 50) | htab_map_update_elem() {
> >>>> 50) 0.329 us | _raw_spin_lock_irqsave();
> >>>> 50) 0.063 us | lookup_elem_raw();
> >>>> 50) | alloc_htab_elem() {
> >>>> 50) | pcpu_freelist_pop() {
> >>>> 50) 0.209 us | _raw_spin_lock();
> >>>> 50) 0.264 us | _raw_spin_lock();
> >>>
> >>> This is LRU map. Not hash map.
> >>> It will grab spin_locks of other cpus
> >>> only if all previous cpus don't have free elements.
> >>> Most likely your map is actually full and doesn't have any free elems.
> >>> Since it's an lru it will force free an elem eventually.
> >>>
> >>
> >> Maybe I missed something, the map_update_elem function of LRU map is
> >> htab_lru_map_update_elem() and the htab_map_update_elem() above is the
> >> map_update_elem function of hash map.
> >> Because of the implementation of percpu freelist used in hash map, it
> >> will spin_lock all other CPUs when there is no free elements.
> >
> > Ahh. Right. Then what's the point of optimizing the error case
> > at the expense of the fast path?
> >
>
> Yes, this patch only optimized the very bad case that no free elements left,
> and add atomic operation in the fast path. Maybe the better workaround is not
> allowing full hash map in our case or using LRU map?

No idea, since you haven't explained your use case.

> But the problem of spinlock contention also exist even when the map is not full,
> like some CPUs run out of its freelist but other CPUs seldom used, then have to
> grab those CPUs' spinlock to get free element.

In theory that would be correct. Do you see it in practice?
Please describe the use case.

> Should we change the current implementation of percpu freelist to percpu lockless freelist?

Like llist.h ? That was tried already and for typical hash map usage
it's slower than percpu free list.
Many progs still do a lot of hash map update/delete on all cpus at once.
That is the use case hashmap optimized for.
Please see commit 6c9059817432 ("bpf: pre-allocate hash map elements")
that also lists different alternative algorithms that were benchmarked.

2021-10-19 05:12:37

by Chengming Zhou

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] bpf: use count for prealloc hashtab too

在 2021/10/19 下午12:43, Alexei Starovoitov 写道:
> On Mon, Oct 18, 2021 at 9:31 PM Chengming Zhou
> <[email protected]> wrote:
>>
>> 在 2021/10/19 上午11:45, Alexei Starovoitov 写道:
>>> On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou
>>> <[email protected]> wrote:
>>>>
>>>> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道:
>>>>> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道:
>>>>>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because
>>>>>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist.
>>>>>>>>
>>>>>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and
>>>>>>>> spin_lock for all CPUs to find there is no more elem at last.
>>>>>>>>
>>>>>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem()
>>>>>>>> would last for 1ms. This patch use count for prealloc hashtab too,
>>>>>>>> avoid traverse and spin_lock for all CPUs in this case.
>>>>>>>>
>>>>>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>>>>>
>>>>>>> It's not clear from the commit log what you're solving.
>>>>>>> The atomic inc/dec in critical path of prealloc maps hurts performance.
>>>>>>> That's why it's not used.
>>>>>>>
>>>>>> Thanks for the explanation, what I'm solving is when hash table hasn't free
>>>>>> elements, we don't need to call __pcpu_freelist_pop() to traverse and
>>>>>> spin_lock all CPUs. The ftrace output of this bad case is below:
>>>>>>
>>>>>> 50) | htab_map_update_elem() {
>>>>>> 50) 0.329 us | _raw_spin_lock_irqsave();
>>>>>> 50) 0.063 us | lookup_elem_raw();
>>>>>> 50) | alloc_htab_elem() {
>>>>>> 50) | pcpu_freelist_pop() {
>>>>>> 50) 0.209 us | _raw_spin_lock();
>>>>>> 50) 0.264 us | _raw_spin_lock();
>>>>>
>>>>> This is LRU map. Not hash map.
>>>>> It will grab spin_locks of other cpus
>>>>> only if all previous cpus don't have free elements.
>>>>> Most likely your map is actually full and doesn't have any free elems.
>>>>> Since it's an lru it will force free an elem eventually.
>>>>>
>>>>
>>>> Maybe I missed something, the map_update_elem function of LRU map is
>>>> htab_lru_map_update_elem() and the htab_map_update_elem() above is the
>>>> map_update_elem function of hash map.
>>>> Because of the implementation of percpu freelist used in hash map, it
>>>> will spin_lock all other CPUs when there is no free elements.
>>>
>>> Ahh. Right. Then what's the point of optimizing the error case
>>> at the expense of the fast path?
>>>
>>
>> Yes, this patch only optimized the very bad case that no free elements left,
>> and add atomic operation in the fast path. Maybe the better workaround is not
>> allowing full hash map in our case or using LRU map?
>
> No idea, since you haven't explained your use case.
>
>> But the problem of spinlock contention also exist even when the map is not full,
>> like some CPUs run out of its freelist but other CPUs seldom used, then have to
>> grab those CPUs' spinlock to get free element.
>
> In theory that would be correct. Do you see it in practice?
> Please describe the use case.
>
>> Should we change the current implementation of percpu freelist to percpu lockless freelist?
>
> Like llist.h ? That was tried already and for typical hash map usage
> it's slower than percpu free list.
> Many progs still do a lot of hash map update/delete on all cpus at once.
> That is the use case hashmap optimized for.
> Please see commit 6c9059817432 ("bpf: pre-allocate hash map elements")
> that also lists different alternative algorithms that were benchmarked.
>

Ok, I will figure out our use case, try these alternatives and collect some data first.
Thanks for your explanation.