2019-12-20 09:36:29

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Percpu variables, benchmarking, and performance weirdness

On Fri, 20 Dec 2019 09:25:43 +0100
Björn Töpel <[email protected]> wrote:

> I've been doing some benchmarking with AF_XDP, and more specific the
> bpf_xdp_redirect_map() helper and xdp_do_redirect(). One thing that
> puzzles me is that the percpu-variable accesses stands out.
>
> I did a horrible hack that just accesses a regular global variable,
> instead of the percpu struct bpf_redirect_info, and got a performance
> boost from 22.7 Mpps to 23.8 Mpps with the rxdrop scenario from
> xdpsock.

Yes, this an 2 ns overhead, which is annoying in XDP context.
(1/22.7-1/23.8)*1000 = 2 ns

> Have anyone else seen this?

Yes, I see it all the time...

> So, my question to the uarch/percpu folks out there: Why are percpu
> accesses (%gs segment register) more expensive than regular global
> variables in this scenario.

I'm also VERY interested in knowing the answer to above question!?
(Adding LKML to reach more people)


> One way around that is changing BPF_PROG_RUN, and BPF_CALL_x to pass a
> context (struct bpf_redirect_info) explicitly, and access that instead
> of doing percpu access. That would be a pretty churny patch, and
> before doing that it would be nice to understand why percpu stands out
> performance-wise.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer


2019-12-20 15:13:45

by Tejun Heo

[permalink] [raw]
Subject: Re: Percpu variables, benchmarking, and performance weirdness

On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
> > So, my question to the uarch/percpu folks out there: Why are percpu
> > accesses (%gs segment register) more expensive than regular global
> > variables in this scenario.
>
> I'm also VERY interested in knowing the answer to above question!?
> (Adding LKML to reach more people)

No idea. One difference is that percpu accesses are through vmap area
which is mapped using 4k pages while global variable would be accessed
through the fault linear mapping. Maybe you're getting hit by tlb
pressure?

Thanks.

--
tejun

Subject: Re: Percpu variables, benchmarking, and performance weirdness

On Fri, 20 Dec 2019, Tejun Heo wrote:

> On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
> > > So, my question to the uarch/percpu folks out there: Why are percpu
> > > accesses (%gs segment register) more expensive than regular global
> > > variables in this scenario.
> >
> > I'm also VERY interested in knowing the answer to above question!?
> > (Adding LKML to reach more people)
>
> No idea. One difference is that percpu accesses are through vmap area
> which is mapped using 4k pages while global variable would be accessed
> through the fault linear mapping. Maybe you're getting hit by tlb
> pressure?

And there are some accesses from remote processors to per cpu ares of
other cpus. If those are in the same cacheline then those will cause
additional latencies.

2019-12-20 16:24:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: Percpu variables, benchmarking, and performance weirdness



On 12/20/19 7:12 AM, Tejun Heo wrote:
> On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
>>> So, my question to the uarch/percpu folks out there: Why are percpu
>>> accesses (%gs segment register) more expensive than regular global
>>> variables in this scenario.
>>
>> I'm also VERY interested in knowing the answer to above question!?
>> (Adding LKML to reach more people)
>
> No idea. One difference is that percpu accesses are through vmap area
> which is mapped using 4k pages while global variable would be accessed
> through the fault linear mapping. Maybe you're getting hit by tlb
> pressure?

I definitely seen expensive per-cpu updates in the stack.
(SNMP counters, or per-cpu stats for packets/bytes counters)

It might be nice to have an option to use 2M pages.

(I recall sending some patches in the past about using high-order pages for vmalloc,
but this went nowhere)

2019-12-20 16:35:39

by Tejun Heo

[permalink] [raw]
Subject: Re: Percpu variables, benchmarking, and performance weirdness

On Fri, Dec 20, 2019 at 08:22:02AM -0800, Eric Dumazet wrote:
> I definitely seen expensive per-cpu updates in the stack.
> (SNMP counters, or per-cpu stats for packets/bytes counters)
>
> It might be nice to have an option to use 2M pages.
>
> (I recall sending some patches in the past about using high-order pages for vmalloc,
> but this went nowhere)

Yeah, the percpu allocator implementation is half-way prepared for
that. There just hasn't been a real need for that yet. If this
actually is a difference coming from tlb pressure, this might be it, I
guess?

Thanks.

--
tejun

2019-12-20 17:11:46

by Dennis Zhou

[permalink] [raw]
Subject: Re: Percpu variables, benchmarking, and performance weirdness

On Fri, Dec 20, 2019 at 03:36:51PM +0000, Christopher Lameter wrote:
> On Fri, 20 Dec 2019, Tejun Heo wrote:
>
> > On Fri, Dec 20, 2019 at 10:34:20AM +0100, Jesper Dangaard Brouer wrote:
> > > > So, my question to the uarch/percpu folks out there: Why are percpu
> > > > accesses (%gs segment register) more expensive than regular global
> > > > variables in this scenario.
> > >
> > > I'm also VERY interested in knowing the answer to above question!?
> > > (Adding LKML to reach more people)
> >
> > No idea. One difference is that percpu accesses are through vmap area
> > which is mapped using 4k pages while global variable would be accessed
> > through the fault linear mapping. Maybe you're getting hit by tlb
> > pressure?

bpf_redirect_info is static so that should be accessed via the linear
mapping as well if we're embedding the first chunk.

>
> And there are some accesses from remote processors to per cpu ares of
> other cpus. If those are in the same cacheline then those will cause
> additional latencies.
>

I guess we could pad out certain structs like bpf_redirect_info, but
that isn't really ideal.