LinuxLists.cc - Re: [mm/page_alloc] f26b3fa046: netperf.Throughput

2022-05-09 03:10:18

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On 5/7/2022 3:11 PM, [email protected] wrote:
> On Sat, 2022-05-07 at 11:27 +0800, Aaron Lu wrote:
>> On Sat, May 07, 2022 at 08:54:35AM +0800, [email protected] wrote:
>>> On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
>>>> On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
>>>>> On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
>>>>>> Hi Mel,
>>>>>>
>>>>>> On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
>>>>>>>
>>>>>>> (please be noted we reported
>>>>>>> "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
>>>>>>> on
>>>>>>> https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
>>>>>>> while the commit is on branch.
>>>>>>> now we still observe similar regression when it's on mainline, and we also
>>>>>>> observe a 13.2% improvement on another netperf subtest.
>>>>>>> so report again for information)
>>>>>>>
>>>>>>> Greeting,
>>>>>>>
>>>>>>> FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
>>>>>>>
>>>>>>>
>>>>>>> commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>>>
>>>>>>
>>>>>> So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
>>>>>
>>>>> IMHO, this means the consumer and producer are running on different
>>>>> CPUs.
>>>>>
>>>>
>>>> Right.
>>>>
>>>>>> and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
>>>>>> then do not use PCP but directly free the page directly to buddy.
>>>>>>
>>>>>> The rationale as explained in the commit's changelog is:
>>>>>> "
>>>>>> Netperf running on localhost exhibits this pattern and while it does not
>>>>>> matter for some machines, it does matter for others with smaller caches
>>>>>> where cache misses cause problems due to reduced page reuse. Pages
>>>>>> freed directly to the buddy list may be reused quickly while still cache
>>>>>> hot where as storing on the PCP lists may be cold by the time
>>>>>> free_pcppages_bulk() is called.
>>>>>> "
>>>>>>
>>>>>> This regression occurred on a machine that has large caches so this
>>>>>> optimization brings no value to it but only overhead(skipped PCP), I
>>>>>> guess this is the reason why there is a regression.
>>>>>
>>>>> Per my understanding, not only the cache size is larger, but also the L2
>>>>> cache (1MB) is per-core on this machine. So if the consumer and
>>>>> producer are running on different cores, the cache-hot page may cause
>>>>> more core-to-core cache transfer. This may hurt performance too.
>>>>>
>>>>
>>>> Client side allocates skb(page) and server side recvfrom() it.
>>>> recvfrom() copies the page data to server's own buffer and then releases
>>>> the page associated with the skb. Client does all the allocation and
>>>> server does all the free, page reuse happens at client side.
>>>> So I think core-2-core cache transfer due to page reuse can occur when
>>>> client task migrates.
>>>
>>> The core-to-core cache transfering can be cross-socket or cross-L2 in
>>> one socket. I mean the later one.
>>>
>>>> I have modified the job to have the client and server bound to a
>>>> specific CPU of different cores on the same node, and testing it on the
>>>> same Icelake 2 sockets server, the result is
>>>>
>>>> kernel throughput
>>>> 8b10b465d0e1 125168
>>>> f26b3fa04611 102039 -18%
>>>>
>>>> It's also a 18% drop. I think this means c2c is not a factor?
>>>
>>> Can you test with client and server bound to 2 hardware threads
>>> (hyperthread) of one core? The two hardware threads of one core will
>>> share the L2 cache.
>>>
>>
>> 8b10b465d0e1: 89702
>> f26b3fa04611: 95823 +6.8%
>>
>> When binding client and server on the 2 threads of the same core, the
>> bisected commit is an improvement now on this 2 sockets Icelake server.
>
> Good. I guess cache-hot works now.
>

Yes, it can't be more hot now :-)

>>>>>> I have also tested this case on a small machine: a skylake desktop and
>>>>>> this commit shows improvement:
>>>>>> 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
>>>>>> f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
>>>>>>
>>>>>> So this means those directly freed pages get reused by allocator side
>>>>>> and that brings performance improvement for machines with smaller cache.
>>>>>
>>>>> Per my understanding, the L2 cache on this desktop machine is shared
>>>>> among cores.
>>>>>
>>>>
>>>> The said CPU is i7-6700 and according to this wikipedia page,
>>>> L2 is per core:
>>>> https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
>>>
>>> Sorry, my memory was wrong. The skylake and later server has much
>>> larger private L2 cache (1MB vs 256KB of client), this may increase the
>>> possibility of core-2-core transfering.
>>>
>>
>> I'm trying to understand where is the core-2-core cache transfer.
>>
>> When server needs to do the copy in recvfrom(), there is core-2-core
>> cache transfer from client cpu to server cpu. But this is the same no
>> matter page gets reused or not, i.e. the bisected commit and its parent
>> doesn't have any difference in this step.
>
> Yes.
>
>> Then when page gets reused in
>> the client side, there is no core-2-core cache transfer as the server
>> side didn't do write to the page's data.
>
> The "reused" pages were read by the server side, so their cache lines
> are in "shared" state, some inter-core traffic is needed to shoot down
> these cache lines before the client side writes them. This will incur
> some overhead.
>

I thought the overhead of changing the cache line from "shared" to
"own"/"modify" is pretty cheap.

Also, this is the same case as the Skylake desktop machine, why it is a
gain there but a loss here? Is it that this "overhead" is much greater
in server machine to the extent that it is even better to use a totally
cold page than a hot one? If so, it seems to suggest we should avoid
cache reuse in server machine unless the two CPUs happens to be two
hyperthreads of the same core.

Thanks,
Aaron

>> So page reuse or not, it
>> shouldn't cause any difference regarding core-2-core cache transfer.
>> Is this correct?
>>
>>>>>> I wonder if we should still use PCP a little bit under the above said
>>>>>> condition, for the purpose of:
>>>>>> 1 reduced overhead in the free path for machines with large cache;
>>>>>> 2 still keeps the benefit of reused pages for machines with smaller cache.
>>>>>>
>>>>>> For this reason, I tested increasing nr_pcp_high() from returning 0 to
>>>>>> either returning pcp->batch or (pcp->batch << 2):
>>>>>> machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
>>>>>> skylake desktop: 72288 90784 92219 91528
>>>>>> icelake 2sockets: 120956 99177 98251 116108
>>>>>>
>>>>>> note nr_pcp_high() returns pcp->high is the behaviour of this commit's
>>>>>> parent, returns 0 is the behaviour of this commit.
>>>>>>
>>>>>> The result shows, if we effectively use a PCP high as (pcp->batch << 2)
>>>>>> for the described condition, then this workload's performance on
>>>>>> small machine can remain while the regression on large machines can be
>>>>>> greately reduced(from -18% to -4%).
>>>>>>
>>>>>
>>>>> Can we use cache size and topology information directly?
>>>>
>>>> It can be complicated by the fact that the system can have multiple
>>>> producers(cpus that are doing free) running at the same time and getting
>>>> the perfect number can be a difficult job.
>>>
>>> We can discuss this after verifying whether it's core-2-core transfering
>>> related.
>>>
>>> Best Regards,
>>> Huang, Ying
>>>
>>>
>
>

2022-05-09 05:09:03

by Huang, Ying

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
> On 5/7/2022 3:11 PM, [email protected] wrote:
> > On Sat, 2022-05-07 at 11:27 +0800, Aaron Lu wrote:
> > > On Sat, May 07, 2022 at 08:54:35AM +0800, [email protected] wrote:
> > > > On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
> > > > > On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
> > > > > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > > > > > > Hi Mel,
> > > > > > >
> > > > > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > > > > > > >
> > > > > > > > (please be noted we reported
> > > > > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > > > > > > on
> > > > > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > > > > > > while the commit is on branch.
> > > > > > > > now we still observe similar regression when it's on mainline, and we also
> > > > > > > > observe a 13.2% improvement on another netperf subtest.
> > > > > > > > so report again for information)
> > > > > > > >
> > > > > > > > Greeting,
> > > > > > > >
> > > > > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > > > > > > >
> > > > > > > >
> > > > > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > > > >
> > > > > > >
> > > > > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> > > > > >
> > > > > > IMHO, this means the consumer and producer are running on different
> > > > > > CPUs.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > > > > > > then do not use PCP but directly free the page directly to buddy.
> > > > > > >
> > > > > > > The rationale as explained in the commit's changelog is:
> > > > > > > "
> > > > > > > Netperf running on localhost exhibits this pattern and while it does not
> > > > > > > matter for some machines, it does matter for others with smaller caches
> > > > > > > where cache misses cause problems due to reduced page reuse. Pages
> > > > > > > freed directly to the buddy list may be reused quickly while still cache
> > > > > > > hot where as storing on the PCP lists may be cold by the time
> > > > > > > free_pcppages_bulk() is called.
> > > > > > > "
> > > > > > >
> > > > > > > This regression occurred on a machine that has large caches so this
> > > > > > > optimization brings no value to it but only overhead(skipped PCP), I
> > > > > > > guess this is the reason why there is a regression.
> > > > > >
> > > > > > Per my understanding, not only the cache size is larger, but also the L2
> > > > > > cache (1MB) is per-core on this machine. So if the consumer and
> > > > > > producer are running on different cores, the cache-hot page may cause
> > > > > > more core-to-core cache transfer. This may hurt performance too.
> > > > > >
> > > > >
> > > > > Client side allocates skb(page) and server side recvfrom() it.
> > > > > recvfrom() copies the page data to server's own buffer and then releases
> > > > > the page associated with the skb. Client does all the allocation and
> > > > > server does all the free, page reuse happens at client side.
> > > > > So I think core-2-core cache transfer due to page reuse can occur when
> > > > > client task migrates.
> > > >
> > > > The core-to-core cache transfering can be cross-socket or cross-L2 in
> > > > one socket. I mean the later one.
> > > >
> > > > > I have modified the job to have the client and server bound to a
> > > > > specific CPU of different cores on the same node, and testing it on the
> > > > > same Icelake 2 sockets server, the result is
> > > > >
> > > > > kernel throughput
> > > > > 8b10b465d0e1 125168
> > > > > f26b3fa04611 102039 -18%
> > > > >
> > > > > It's also a 18% drop. I think this means c2c is not a factor?
> > > >
> > > > Can you test with client and server bound to 2 hardware threads
> > > > (hyperthread) of one core? The two hardware threads of one core will
> > > > share the L2 cache.
> > > >
> > >
> > > 8b10b465d0e1: 89702
> > > f26b3fa04611: 95823 +6.8%
> > >
> > > When binding client and server on the 2 threads of the same core, the
> > > bisected commit is an improvement now on this 2 sockets Icelake server.
> >
> > Good. I guess cache-hot works now.
> >
>
> Yes, it can't be more hot now :-)
>
> > > > > > > I have also tested this case on a small machine: a skylake desktop and
> > > > > > > this commit shows improvement:
> > > > > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > > > > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> > > > > > >
> > > > > > > So this means those directly freed pages get reused by allocator side
> > > > > > > and that brings performance improvement for machines with smaller cache.
> > > > > >
> > > > > > Per my understanding, the L2 cache on this desktop machine is shared
> > > > > > among cores.
> > > > > >
> > > > >
> > > > > The said CPU is i7-6700 and according to this wikipedia page,
> > > > > L2 is per core:
> > > > > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
> > > >
> > > > Sorry, my memory was wrong. The skylake and later server has much
> > > > larger private L2 cache (1MB vs 256KB of client), this may increase the
> > > > possibility of core-2-core transfering.
> > > >
> > >
> > > I'm trying to understand where is the core-2-core cache transfer.
> > >
> > > When server needs to do the copy in recvfrom(), there is core-2-core
> > > cache transfer from client cpu to server cpu. But this is the same no
> > > matter page gets reused or not, i.e. the bisected commit and its parent
> > > doesn't have any difference in this step.
> >
> > Yes.
> >
> > > Then when page gets reused in
> > > the client side, there is no core-2-core cache transfer as the server
> > > side didn't do write to the page's data.
> >
> > The "reused" pages were read by the server side, so their cache lines
> > are in "shared" state, some inter-core traffic is needed to shoot down
> > these cache lines before the client side writes them. This will incur
> > some overhead.
> >
>
> I thought the overhead of changing the cache line from "shared" to
> "own"/"modify" is pretty cheap.

This is the read/write pattern of cache ping-pong. Although it should
be cheaper than the write/write pattern of cache ping-pong in theory, we
have gotten sevious regression for that before.

> Also, this is the same case as the Skylake desktop machine, why it is a
> gain there but a loss here?

I guess the reason is the private cache size. The size of the private
L2 cache of SKL server is much larger than that of SKL client (1MB vs.
256KB). So there's much more core-2-core traffic on SKL server.

> Is it that this "overhead" is much greater
> in server machine to the extent that it is even better to use a totally
> cold page than a hot one?

Yes. And I think the private cache size matters here. And after being
evicted from the private cache (L1/L2), the cache lines of the reused
pages will go to shared cache (L3), that will help performance.

> If so, it seems to suggest we should avoid
> cache reuse in server machine unless the two CPUs happens to be two
> hyperthreads of the same core.

Yes. I think so.

Best Regards,
Huang, Ying

> > > So page reuse or not, it
> > > shouldn't cause any difference regarding core-2-core cache transfer.
> > > Is this correct?
> > >
> > > > > > > I wonder if we should still use PCP a little bit under the above said
> > > > > > > condition, for the purpose of:
> > > > > > > 1 reduced overhead in the free path for machines with large cache;
> > > > > > > 2 still keeps the benefit of reused pages for machines with smaller cache.
> > > > > > >
> > > > > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > > > > > > either returning pcp->batch or (pcp->batch << 2):
> > > > > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > > > > > > skylake desktop: 72288 90784 92219 91528
> > > > > > > icelake 2sockets: 120956 99177 98251 116108
> > > > > > >
> > > > > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > > > > > > parent, returns 0 is the behaviour of this commit.
> > > > > > >
> > > > > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > > > > > > for the described condition, then this workload's performance on
> > > > > > > small machine can remain while the regression on large machines can be
> > > > > > > greately reduced(from -18% to -4%).
> > > > > > >
> > > > > >
> > > > > > Can we use cache size and topology information directly?
> > > > >
> > > > > It can be complicated by the fact that the system can have multiple
> > > > > producers(cpus that are doing free) running at the same time and getting
> > > > > the perfect number can be a difficult job.
> > > >
> > > > We can discuss this after verifying whether it's core-2-core transfering
> > > > related.
> > > >
> > > > Best Regards,
> > > > Huang, Ying
> > > >
> > > >
> >
> >

2022-05-10 09:27:40

by Aaron Lu

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On 5/7/2022 3:44 PM, [email protected] wrote:
> On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:

... ...

>>
>> I thought the overhead of changing the cache line from "shared" to
>> "own"/"modify" is pretty cheap.
>
> This is the read/write pattern of cache ping-pong. Although it should
> be cheaper than the write/write pattern of cache ping-pong in theory, we
> have gotten sevious regression for that before.
>

Can you point me to the regression report? I would like to take a look,
thanks.

>> Also, this is the same case as the Skylake desktop machine, why it is a
>> gain there but a loss here?
>
> I guess the reason is the private cache size. The size of the private
> L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> 256KB). So there's much more core-2-core traffic on SKL server.
>

It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
pages and that means the allocator side may have a higher chance of
reusing a page that is evicted from the free cpu's L2 cache than the
server machine, whose L2 can store 40 order-3 pages.

I can do more tests using different high for the two machines:
1) high=0, this is the case when page reuse is the extreme. core-2-core
transfer should be the most. This is the behavior of this bisected commit.
2) high=L2_size, this is the case when page reuse is fewer compared to
the above case, core-2-core should still be the majority.
3) high=2 times of L2_size and smaller than llc size, this is the case
when cache reuse is further reduced, and when the page is indeed reused,
it shouldn't cause core-2-core transfer but can benefit from llc.
4) high>llc_size, this is the case when page reuse is the least and when
page is indeed reused, it is likely not in the entire cache hierarchy.
This is the behavior of this bisected commit's parent commit for the
Skylake desktop machine.

I expect case 3) should give us the best performance and 1) or 4) is the
worst for this testcase.

case 4) is difficult to test on the server machine due to the cap of
pcp->high which is affected by the low watermark of the zone. The server
machine has 128 cpus but only 128G memory, which makes the pcp->high
capped at 421, while llc size is 40MiB and that translates to a page
number of 12288.

>> Is it that this "overhead" is much greater
>> in server machine to the extent that it is even better to use a totally
>> cold page than a hot one?
>
> Yes. And I think the private cache size matters here. And after being
> evicted from the private cache (L1/L2), the cache lines of the reused
> pages will go to shared cache (L3), that will help performance.
>

Sounds reasonable.

>> If so, it seems to suggest we should avoid
>> cache reuse in server machine unless the two CPUs happens to be two
>> hyperthreads of the same core.
>
> Yes. I think so.

2022-05-10 09:56:04

by Huang, Ying

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Tue, 2022-05-10 at 11:43 +0800, Aaron Lu wrote:
> On 5/7/2022 3:44 PM, [email protected] wrote:
> > On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
>
> ... ...
>
> > >
> > > I thought the overhead of changing the cache line from "shared" to
> > > "own"/"modify" is pretty cheap.
> >
> > This is the read/write pattern of cache ping-pong. Although it should
> > be cheaper than the write/write pattern of cache ping-pong in theory, we
> > have gotten sevious regression for that before.
> >
>
> Can you point me to the regression report? I would like to take a look,
> thanks.

Sure.

https://lore.kernel.org/all/[email protected]/

> > > Also, this is the same case as the Skylake desktop machine, why it is a
> > > gain there but a loss here?
> >
> > I guess the reason is the private cache size. The size of the private
> > L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> > 256KB). So there's much more core-2-core traffic on SKL server.
> >
>
> It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
> pages and that means the allocator side may have a higher chance of
> reusing a page that is evicted from the free cpu's L2 cache than the
> server machine, whose L2 can store 40 order-3 pages.
>
> I can do more tests using different high for the two machines:
> 1) high=0, this is the case when page reuse is the extreme. core-2-core
> transfer should be the most. This is the behavior of this bisected commit.
> 2) high=L2_size, this is the case when page reuse is fewer compared to
> the above case, core-2-core should still be the majority.
> 3) high=2 times of L2_size and smaller than llc size, this is the case
> when cache reuse is further reduced, and when the page is indeed reused,
> it shouldn't cause core-2-core transfer but can benefit from llc.
> 4) high>llc_size, this is the case when page reuse is the least and when
> page is indeed reused, it is likely not in the entire cache hierarchy.
> This is the behavior of this bisected commit's parent commit for the
> Skylake desktop machine.
>
> I expect case 3) should give us the best performance and 1) or 4) is the
> worst for this testcase.
>
> case 4) is difficult to test on the server machine due to the cap of
> pcp->high which is affected by the low watermark of the zone. The server
> machine has 128 cpus but only 128G memory, which makes the pcp->high
> capped at 421, while llc size is 40MiB and that translates to a page
> number of 12288.
> >

Sounds good to me.

Best Regards,
Huang, Ying

> > > Is it that this "overhead" is much greater
> > > in server machine to the extent that it is even better to use a totally
> > > cold page than a hot one?
> >
> > Yes. And I think the private cache size matters here. And after being
> > evicted from the private cache (L1/L2), the cache lines of the reused
> > pages will go to shared cache (L3), that will help performance.
> >
>
> Sounds reasonable.
>
> > > If so, it seems to suggest we should avoid
> > > cache reuse in server machine unless the two CPUs happens to be two
> > > hyperthreads of the same core.
> >
> > Yes. I think so.

2022-05-10 20:44:32

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

[ Adding locking people in case they have any input ]

On Mon, May 9, 2022 at 11:23 PM [email protected]
<[email protected]> wrote:
>
> >
> > Can you point me to the regression report? I would like to take a look,
> > thanks.
>
> https://lore.kernel.org/all/[email protected]/

Hmm.

That explanation looks believable, except that our qspinlocks
shouldn't be spinning on the lock itself, but spinning on the mcs node
it inserts into the lock.

Or so I believed before I looked closer at the code again (it's been years).

It turns out we spin on the lock itself if we're the "head waiter". So
somebody is always spinning.

That's a bit unfortunate for this workload, I guess.

I think from a pure lock standpoint, it's the right thing to do (no
unnecessary bouncing, with the lock releaser doing just one write, and
the head waiter spinning on it is doing the right thing).

But I think this is an example of where you end up having that
spinning on the lock possibly then being a disturbance on the other
fields around the lock.

I wonder if Waiman / PeterZ / Will have any comments on that. Maybe
that "spin on the lock itself" is just fundamentally the only correct
thing, but since my initial reaction was "no, we're spinning on the
mcs node", maybe that would be _possible_?

We do have a lot of those spinlocks embedded in other data structures
cases. And if "somebody else is waiting for the lock" contends badly
with "the lock holder is doing a lot of writes close to the lock",
then that's not great.

Linus

2022-05-11 02:04:13

by Waiman Long

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On 5/10/22 14:05, Linus Torvalds wrote:
> [ Adding locking people in case they have any input ]
>
> On Mon, May 9, 2022 at 11:23 PM [email protected]
> <[email protected]> wrote:
>>> Can you point me to the regression report? I would like to take a look,
>>> thanks.
>> https://lore.kernel.org/all/[email protected]/
> Hmm.
>
> That explanation looks believable, except that our qspinlocks
> shouldn't be spinning on the lock itself, but spinning on the mcs node
> it inserts into the lock.
>
> Or so I believed before I looked closer at the code again (it's been years).
>
> It turns out we spin on the lock itself if we're the "head waiter". So
> somebody is always spinning.
>
> That's a bit unfortunate for this workload, I guess.
>
> I think from a pure lock standpoint, it's the right thing to do (no
> unnecessary bouncing, with the lock releaser doing just one write, and
> the head waiter spinning on it is doing the right thing).
>
> But I think this is an example of where you end up having that
> spinning on the lock possibly then being a disturbance on the other
> fields around the lock.
>
> I wonder if Waiman / PeterZ / Will have any comments on that. Maybe
> that "spin on the lock itself" is just fundamentally the only correct
> thing, but since my initial reaction was "no, we're spinning on the
> mcs node", maybe that would be _possible_?
>
> We do have a lot of those spinlocks embedded in other data structures
> cases. And if "somebody else is waiting for the lock" contends badly
> with "the lock holder is doing a lot of writes close to the lock",
> then that's not great.

Qspinlock still has one head waiter spinning on the lock. This is much
better than the original ticket spinlock where there will be n waiters
spinning on the lock. That is the cost of a cheap unlock. There is no
way to eliminate all lock spinning unless we use MCS lock directly which
will require a change in locking API as well as more expensive unlock.

Cheers,
Longman

2022-05-11 04:31:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Tue, May 10, 2022 at 11:05:01AM -0700, Linus Torvalds wrote:

> I think from a pure lock standpoint, it's the right thing to do (no
> unnecessary bouncing, with the lock releaser doing just one write, and
> the head waiter spinning on it is doing the right thing).
>
> But I think this is an example of where you end up having that
> spinning on the lock possibly then being a disturbance on the other
> fields around the lock.
>
> I wonder if Waiman / PeterZ / Will have any comments on that. Maybe
> that "spin on the lock itself" is just fundamentally the only correct
> thing, but since my initial reaction was "no, we're spinning on the
> mcs node", maybe that would be _possible_?
>
> We do have a lot of those spinlocks embedded in other data structures
> cases. And if "somebody else is waiting for the lock" contends badly
> with "the lock holder is doing a lot of writes close to the lock",
> then that's not great.

The immediate problem is that we don't always have a node. Notably we
only do the whole MCS queueing thing when there's more than 1 contender.

Always doing the MCS thing had a hefty performance penalty vs the
simpler spinlock implementations for the uncontended and light contended
lock cases (by far the most common scenario) due to the extra cache-miss
of getting an MCS node.

2022-05-11 08:54:37

by Huang, Ying

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Tue, 2022-05-10 at 11:05 -0700, Linus Torvalds wrote:
> [ Adding locking people in case they have any input ]
>
> On Mon, May 9, 2022 at 11:23 PM [email protected]
> <[email protected]> wrote:
> >
> > >
> > > Can you point me to the regression report? I would like to take a look,
> > > thanks.
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Hmm.
>
> That explanation looks believable, except that our qspinlocks
> shouldn't be spinning on the lock itself, but spinning on the mcs node
> it inserts into the lock.

The referenced regression report is very old (in Feb 2015 for 3.16-
3.17). The ticket spinlock was still used at that time. I believe that
things become much better after we used qspinlock. We can test that.

Best Regards,
Huang, Ying

2022-05-11 09:01:38

by Aaron Lu

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Tue, May 10, 2022 at 02:23:28PM +0800, [email protected] wrote:
> On Tue, 2022-05-10 at 11:43 +0800, Aaron Lu wrote:
> > On 5/7/2022 3:44 PM, [email protected] wrote:
> > > On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
> >
> > ... ...
> >
> > > >
> > > > I thought the overhead of changing the cache line from "shared" to
> > > > "own"/"modify" is pretty cheap.
> > >
> > > This is the read/write pattern of cache ping-pong. Although it should
> > > be cheaper than the write/write pattern of cache ping-pong in theory, we
> > > have gotten sevious regression for that before.
> > >
> >
> > Can you point me to the regression report? I would like to take a look,
> > thanks.
>
> Sure.
>
> https://lore.kernel.org/all/[email protected]/
>
> > > > Also, this is the same case as the Skylake desktop machine, why it is a
> > > > gain there but a loss here??
> > >
> > > I guess the reason is the private cache size. The size of the private
> > > L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> > > 256KB). So there's much more core-2-core traffic on SKL server.
> > >
> >
> > It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
> > pages and that means the allocator side may have a higher chance of
> > reusing a page that is evicted from the free cpu's L2 cache than the
> > server machine, whose L2 can store 40 order-3 pages.
> >
> > I can do more tests using different high for the two machines:
> > 1) high=0, this is the case when page reuse is the extreme. core-2-core
> > transfer should be the most. This is the behavior of this bisected commit.
> > 2) high=L2_size, this is the case when page reuse is fewer compared to
> > the above case, core-2-core should still be the majority.
> > 3) high=2 times of L2_size and smaller than llc size, this is the case
> > when cache reuse is further reduced, and when the page is indeed reused,
> > it shouldn't cause core-2-core transfer but can benefit from llc.
> > 4) high>llc_size, this is the case when page reuse is the least and when
> > page is indeed reused, it is likely not in the entire cache hierarchy.
> > This is the behavior of this bisected commit's parent commit for the
> > Skylake desktop machine.
> >
> > I expect case 3) should give us the best performance and 1) or 4) is the
> > worst for this testcase.
> >
> > case 4) is difficult to test on the server machine due to the cap of
> > pcp->high which is affected by the low watermark of the zone. The server
> > machine has 128 cpus but only 128G memory, which makes the pcp->high
> > capped at 421, while llc size is 40MiB and that translates to a page
> > number of 12288.
> > >
>
> Sounds good to me.

I've run the tests on a 2 sockets Icelake server and a Skylake desktop.

On this 2 sockets Icelake server(1.25MiB L2 = 320 pages, 48MiB LLC =
12288 pages):

pcp->high score
0 100662 (bypass PCP, most page resue, most core-2-core transfer)
320(L2) 117252
640 133149
6144(1/2 llc) 134674
12416(>llc) 103193 (least page reuse)

Setting pcp->high to 640(2 times L2 size) gives very good result, only
slightly lower than 6144(1/2 llc size). Bypassing PCP to get the most
cache reuse didn't deliver good performance, so I think Ying is right:
core-2-core really hurts.

On this 4core/8cpu Skylake desktop(256KiB L2 = 64 pages, 8MiB LLC = 2048
pages):

0 86780 (bypass PCP, most page reuse, most core-2-core transfer)
64(L2) 85813
128 85521
1024(1/2 llc) 85557
2176(> llc) 74458 (least page reuse)

Things are different on this small machine. Bypassing PCP gives the best
performance. I find it hard to explain this. Maybe the 256KiB is too
small that even bypassing PCP, the page still ends up being evicted from
L2 when allocator side reuses it? Or maybe core-2-core transfer is
fast on this small machine?

P.S. I've blindly setting pcp->high to the above value, ignoring zone's
low watermark cap for testing purpose.

2022-05-11 09:43:40

by Linus Torvalds

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Tue, May 10, 2022 at 11:47 AM Waiman Long <[email protected]> wrote:>
> Qspinlock still has one head waiter spinning on the lock. This is much
> better than the original ticket spinlock where there will be n waiters
> spinning on the lock.

Oh, absolutely. I'm not saying we should look at going back. I'm more
asking whether maybe we could go even further..

> That is the cost of a cheap unlock. There is no way to eliminate all
> lock spinning unless we use MCS lock directly which will require a
> change in locking API as well as more expensive unlock.

So there's no question that unlock would be more expensive for the
contention case, since it would have to always not only clear the lock
itself, as well as update the noce it points to.

But does it actually require a change in the locking API?

The qspinlock slowpath already always allocates that mcs node (for
some definition of "always" - I am obviously ignoring all the trylock
cases both before and in the slowpath)

But yes, clearly the simply store-release of the current
queued_spin_unlock() wouldn't work as-is, and maybe the cost of
replacing it with something else is much more expensive than any
possible win.

I think the PV case already basically does that - replacing the the
"store release" with a much more complex sequence. No?

Linus

2022-05-11 10:02:56

by Waiman Long

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On 5/10/22 21:58, [email protected] wrote:
> On Tue, 2022-05-10 at 11:05 -0700, Linus Torvalds wrote:
>> [ Adding locking people in case they have any input ]
>>
>> On Mon, May 9, 2022 at 11:23 PM [email protected]
>> <[email protected]> wrote:
>>>> Can you point me to the regression report? I would like to take a look,
>>>> thanks.
>>> https://lore.kernel.org/all/[email protected]/
>> +
>> Hmm.
>>
>> That explanation looks believable, except that our qspinlocks
>> shouldn't be spinning on the lock itself, but spinning on the mcs node
>> it inserts into the lock.
> The referenced regression report is very old (in Feb 2015 for 3.16-
> 3.17). The ticket spinlock was still used at that time. I believe that
> things become much better after we used qspinlock. We can test that.

Thank for the info. Qspinlock was merged into mainline since 4.2. So
ticket spinlock was used on all v3.* kernels. I was wondering why
qspinlock would have produced such a large performance regression with
just one lock spinning head waiter. So this is not such a big issue
after all.

Cheers,
Longman

2022-05-11 14:58:49

by Huang, Ying

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Wed, 2022-05-11 at 11:40 +0800, Aaron Lu wrote:
> On Tue, May 10, 2022 at 02:23:28PM +0800, [email protected] wrote:
> > On Tue, 2022-05-10 at 11:43 +0800, Aaron Lu wrote:
> > > On 5/7/2022 3:44 PM, [email protected] wrote:
> > > > On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
> > >
> > > ... ...
> > >
> > > > >
> > > > > I thought the overhead of changing the cache line from "shared" to
> > > > > "own"/"modify" is pretty cheap.
> > > >
> > > > This is the read/write pattern of cache ping-pong. Although it should
> > > > be cheaper than the write/write pattern of cache ping-pong in theory, we
> > > > have gotten sevious regression for that before.
> > > >
> > >
> > > Can you point me to the regression report? I would like to take a look,
> > > thanks.
> >
> > Sure.
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > > > > Also, this is the same case as the Skylake desktop machine, why it is a
> > > > > gain there but a loss here?
> > > >
> > > > I guess the reason is the private cache size. The size of the private
> > > > L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> > > > 256KB). So there's much more core-2-core traffic on SKL server.
> > > >
> > >
> > > It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
> > > pages and that means the allocator side may have a higher chance of
> > > reusing a page that is evicted from the free cpu's L2 cache than the
> > > server machine, whose L2 can store 40 order-3 pages.
> > >
> > > I can do more tests using different high for the two machines:
> > > 1) high=0, this is the case when page reuse is the extreme. core-2-core
> > > transfer should be the most. This is the behavior of this bisected commit.
> > > 2) high=L2_size, this is the case when page reuse is fewer compared to
> > > the above case, core-2-core should still be the majority.
> > > 3) high=2 times of L2_size and smaller than llc size, this is the case
> > > when cache reuse is further reduced, and when the page is indeed reused,
> > > it shouldn't cause core-2-core transfer but can benefit from llc.
> > > 4) high>llc_size, this is the case when page reuse is the least and when
> > > page is indeed reused, it is likely not in the entire cache hierarchy.
> > > This is the behavior of this bisected commit's parent commit for the
> > > Skylake desktop machine.
> > >
> > > I expect case 3) should give us the best performance and 1) or 4) is the
> > > worst for this testcase.
> > >
> > > case 4) is difficult to test on the server machine due to the cap of
> > > pcp->high which is affected by the low watermark of the zone. The server
> > > machine has 128 cpus but only 128G memory, which makes the pcp->high
> > > capped at 421, while llc size is 40MiB and that translates to a page
> > > number of 12288.
> > > >
> >
> > Sounds good to me.
>
> I've run the tests on a 2 sockets Icelake server and a Skylake desktop.
>
> On this 2 sockets Icelake server(1.25MiB L2 = 320 pages, 48MiB LLC =
> 12288 pages):
>
> pcp->high score
>     0 100662 (bypass PCP, most page resue, most core-2-core transfer)
>   320(L2) 117252
>   640 133149
> 6144(1/2 llc) 134674
> 12416(>llc) 103193 (least page reuse)
>
> Setting pcp->high to 640(2 times L2 size) gives very good result, only
> slightly lower than 6144(1/2 llc size). Bypassing PCP to get the most
> cache reuse didn't deliver good performance, so I think Ying is right:
> core-2-core really hurts.
>
> On this 4core/8cpu Skylake desktop(256KiB L2 = 64 pages, 8MiB LLC = 2048
> pages):
>
>    0 86780 (bypass PCP, most page reuse, most core-2-core transfer)
>   64(L2) 85813
> 128 85521
> 1024(1/2 llc) 85557
> 2176(> llc) 74458 (least page reuse)
>
> Things are different on this small machine. Bypassing PCP gives the best
> performance. I find it hard to explain this. Maybe the 256KiB is too
> small that even bypassing PCP, the page still ends up being evicted from
> L2 when allocator side reuses it? Or maybe core-2-core transfer is
> fast on this small machine?

86780 / 85813 = 1.011

So, there's almost no measurable difference among the configurations
except the last one. I would rather say the test isn't sensitive to L2
size, but sensitive to LLC size on this machine.

Best Regards,
Huang, Ying

> P.S. I've blindly setting pcp->high to the above value, ignoring zone's
> low watermark cap for testing purpose.

2022-05-13 20:23:26

by Aaron Lu

[permalink] [raw]

Subject: Re: [mm/page_alloc] f26b3fa046: netperf.Throughput_Mbps -18.0% regression

On Wed, May 11, 2022 > On Tue, 2022-05-10 > > [ Adding > >
> > On Mon, > > The result on an
tbox_group/testcase/rootfs lkp-icl-2sp4/will-it-sca
commit:
v5.18-rc4
731a704c0d8760cfd641af4b
v5.18-rc4 ---------------- %stddev %change \ | 12323894
22.33 ? 4% -22.3 9.80 -9.2 36.25 +6.7 4.28 ? 10% +34.6 75.05 +7.8
commit 731a704c0d8760cfd641af4bf57 free_area by reverting different cache line
The interpretation dropped 26%, zone contention increased makes a difference
------
Commit 731a704c0d8760cfd641af4bf57:
From 731a704c0d8760cfd641a From: Aaron Lu
This reverts commit ---
include/linux/mmzone.h 1 file changed,
diff --git a/include/linux/mmzone.h index 46ffab808f03..f5534f42c693 --- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,15 +628,15 /* Write-intensive ZONE_PADDING(_pad1_)

+ /* Primarily protects + spinlock_t lock;
+
/* free areas of struct free_area free_ar
/* zone flags, see below */
unsigned long flags;

- /* Primarily protects - spinlock_t lock;
-
/* Write-intensive ZONE_PADDING(_pad2_)

--
2.35.3

The entire diff between
========================== tbox_group/testcase/rootfs lkp-icl-2sp4/will-it-sca
commit:
v5.18-rc4
731a704c0d8760cfd641af4b
v5.18-rc4 ---------------- %stddev %change \ | 12323894 0.05 ? 8% +37.5% 96279 -26.0% 12323894 0.33 ?141% +800.0% 0.66 -0.1 1.49 -0.3 747.00 ? 54% 4063469 1634 -3.9% 250770 ? 5% 7234686 ? 2% 7231508 ? 2% 101436 -19.5% 592.33 ?141% +201.2% 1.873e+09 1.872e+09 1.853e+09 1.852e+09 52056 ? 65% +53.8% 0.06 -16.7% 75911699 ? 4% 27.73 -23.4 77.67 486.01 94.08 -13.3% 186.67 ? 54% -84.1% 1031719 186.67 ? 54% -84.1% 1.872e+09 1.873e+09 1030546 ? 2% 1.852e+09 1.853e+09 52056 ? 65% +53.8% 34.48 ? 33% +59.4% 227417 59485 ? 84% -144.1% -1687153 227479 8.05 ? 21% +59.2% 0.55 ? 7% +61.5% 68.39 ? 10% 6.02 ? 10% +65.3% 51614 1762215 ? 3% 1855872 ? 3% 25600 -19.4% 8779 +3.6% 51614 1855870 ? 3% 3.725e+09 3.726e+09 140034 ? 3% 164530 3.722e+09 3.712e+09 3.722e+09 92383 14.36 -11.1% 1.493e+10 0.12 -0.0 16850271 53.64 -9.1 5.43e+08 1.012e+09 1550 -3.2% 5.92 +16.4% 4.178e+11 150.89 769.17 +57.8% 0.01 -0.0 1363413 ? 3% 1.855e+10 1.87 -0.0 1.45e+08 7.586e+09 7.051e+10 0.17 -14.0% 333.69 +209.4% 332.10 12265683 8.89 +4.2 1327995 14189574 0.63 +0.0 2654944 4.223e+08 12265684 14.35 -11.1% 0.11 -0.0 53.62 -9.0 5.93 +16.4% 770.18 +57.5% 0.01 ? 2% -0.0 1.87 -0.0 0.17 -14.1% 8.47 +3.9 0.62 +0.0 1728530 1.483e+10 16689442 5.392e+08 1.006e+09 1534 -4.1% 148.92 1379865 ? 2% 1.843e+10 1.44e+08 7.537e+09 7.006e+10 0.97 -7.8% 12184666 1314901 14202713 2633146 4.191e+08 12184667 2.13e+13 22.34 ? 4% -22.3 22.34 ? 4% -22.3 22.33 ? 4% -22.3 10.82 -9.6 10.74 -9.5 67.12 -9.3 67.39 -9.3 9.85 -9.3 9.85 -9.2 9.80 -9.2 63.37 -8.3 63.30 -8.3 62.65 -8.1 62.18 -7.9 61.61 -7.8 6.69 -2.5 5.36 -2.0 4.76 -1.7 2.25 -0.8 1.16 -0.3 1.08 -0.3 0.00 +0.6 3.31 +0.8 3.29 +0.8 0.87 ? 7% +3.1 0.98 ? 6% +3.1 0.95 ? 7% +3.1 43.13 +4.6 42.94 +4.7 42.69 +4.7 37.53 +6.6 37.08 +6.7 36.25 +6.7 36.26 +6.7 28.37 +8.7 28.37 +8.7 28.37 +8.7 28.35 +8.7 27.31 +9.1 27.16 +9.2 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.70 +9.5 3.40 ? 14% +31.6 4.15 ? 12% +31.7 3.94 ? 12% +31.7 4.28 ? 10% +34.6 34.49 ? 2% -33.7 34.49 ? 2% -33.7 10.86 -9.6 10.80 -9.6 68.04 -9.6 65.56 -8.9 63.41 -8.3 63.35 -8.3 62.69 -8.1 62.20 -7.9 61.84 -7.8 6.74 -2.6 5.56 -2.1 4.82 -1.7 2.26 -0.8 1.21 -0.3 1.08 -0.3 1.01 -0.3 0.66 -0.3 0.68 -0.2 0.66 -0.2 0.50 -0.2 0.53 -0.2 0.47 -0.1 0.41 -0.1 0.39 -0.1 0.38 -0.1 0.37 -0.1 0.37 -0.1 0.26 ? 5% -0.1 0.32 -0.1 0.33 -0.1 0.32 -0.1 0.30 -0.1 0.29 -0.1 0.29 -0.1 0.23 ? 2% -0.1 0.27 -0.1 0.14 ? 3% -0.1 0.25 -0.1 0.13 ? 3% -0.1 0.22 -0.1 0.24 -0.1 0.16 ? 3% -0.1 0.12 ? 3% -0.1 0.15 ? 5% -0.1 0.20 -0.1 0.13 ? 3% -0.1 0.12 ? 3% -0.1 0.12 ? 6% -0.1 0.12 ? 6% -0.1 0.12 ? 6% -0.1 0.12 ? 6% -0.1 0.20 ? 2% -0.1 0.19 ? 2% -0.1 0.19 ? 2% -0.1 0.16 -0.0 0.16 ? 3% -0.0 0.16 -0.0 0.13 ? 3% -0.0 0.09 -0.0 0.09 ? 5% -0.0 0.12 ? 4% -0.0 0.11 ? 4% -0.0 0.11 ? 4% -0.0 0.09 -0.0 0.12 ? 3% -0.0 0.09 ? 5% -0.0 0.06 -0.0 0.12 ? 4% -0.0 0.08 -0.0 0.10 ? 4% -0.0 0.08 -0.0 0.07 -0.0 0.07 -0.0 0.09 -0.0 0.06 -0.0 0.16 +0.0 0.06 ? 8% +0.0 0.06 ? 7% +0.1 0.06 ? 7% +0.1 0.00 +0.1 0.55 ? 3% +0.1 3.31 +0.8 43.22 +4.7 43.13 +4.8 42.83 +4.9 37.68 +6.7 37.20 +6.8 75.05 +7.8 28.37 +8.7 28.37 +8.7 28.37 +8.7 28.37 +8.7 27.31 +9.1 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.70 +9.5 31.91 +9.5 31.91 +9.5 30.60 +10.0 5.15 ? 8% +34.8 4.90 ? 9% +34.8 40.73 ? 2% +41.5 4.79 -1.7 3.27 -1.2 1.80 -0.7 2.15 -0.6 0.57 ? 3% -0.3 1.06 -0.3 1.01 -0.3 0.63 -0.3 0.51 -0.2 0.55 ? 2% -0.2 0.39 ? 2% -0.2 0.30 ? 2% -0.1 0.38 -0.1 0.34 -0.1 0.25 -0.1 0.30 -0.1 0.22 ? 6% -0.1 0.33 -0.1 0.22 ? 2% -0.1 0.28 -0.1 0.13 ? 3% -0.1 0.11 ? 4% -0.1 0.19 ? 2% -0.1 0.16 ? 2% -0.0 0.17 ? 2% -0.0 0.14 ? 3% -0.0 0.16 -0.0 0.07 -0.0 0.09 -0.0 0.09 -0.0 0.11 -0.0 0.11 -0.0 0.08 -0.0 0.13 -0.0 0.12 -0.0 0.09 ? 5% -0.0 0.08 -0.0 0.07 ? 6% -0.0 0.10 ? 4% -0.0 0.10 -0.0 0.07 ? 6% -0.0 0.09 -0.0 0.08 -0.0 0.07 ? 6% -0.0 0.07 -0.0 0.06 ? 7% -0.0 0.08 ? 5% -0.0 0.13 +0.0 0.48 +0.1 0.46 ? 3% +0.1 75.05 +7.8

at 09:58:23AM +0800, [email protected] wrote:
at 11:05 -0700, Linus Torvalds wrote:
locking people in case they have any input ]
May 9, 2022 at 11:23 PM [email protected]
i/l/email-protection#86a0eaf2bdffefe8e1a8eef3e7e8e1c6efe8f2e3eaa8e5e9eba0e1f2bd"><[email protected]> wrote:
Can you point me to the regression report? I would like to take a look,
href="https://lore.kernel.org/all/1425108604.10337.84.camel@linux.intel.com/">https://lore.kernel.org/all/[email protected]/
looks believable, except that our qspinlocks
be spinning on the lock itself, but spinning on the mcs node
into the lock.
regression report is very old (in Feb 2015 for 3.16-
ticket spinlock was still used at that time. I believe that
much better after we used qspinlock. We can test that.
process mode' can greatly stress both zone
when nr_process = nr_cpu with thp disabled. So I run
makes a difference with qspinlock.
om/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c">https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
Icelake 2 sockets server with a total of 48cores/96cpus:
/kconfig/compiler/nr_task/mode/test/thp_enabled/cpufreq_governor/ucode:
le/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-11/100%/process/page_fault1/never/performance/0xd000331
f57167d8c68f9b99
731a704c0d8760cfd641af4bf57
---------------------------
%stddev
\
-26.0% 9125299 will-it-scale.128.processes
0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu
0.57 ? 3% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru
42.94 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist
38.93 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages
82.83 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
moves zone's lock back to above
commit a368ab67aa55("mm: move zone lock to a
than order-0 free page lists") on top of v5.18-rc4.
of the above result is: after the revert, performance
lock increased 41% from 40% to 81%, the overall lock
7.8% from 75% to 82.83%. So it appears it still
with qspinlock.
/> f4bf57167d8c68f9b99 Mon Sep 17 00:00:00 2001
href="/cdn-cgi/l/email-protection#082e647c3369697a676626647d4861667c6d64266b67652e6f7c33"><[email protected]>
2022 10:32:53 +0800
Revert "mm: move zone lock to a different cache line than
a368ab67aa55615a03b2c9c00fb965bee3ebeaa4.
| 6 +++---
3 insertions(+), 3 deletions(-)
b/include/linux/mmzone.h
100644
@@ struct zone {
fields used from the page allocator */
free_area */
different sizes */
ea[MAX_ORDER];
free_area */
fields used by compaction and vmstats. */
the two kernels:
===============================================================
/kconfig/compiler/nr_task/mode/test/thp_enabled/cpufreq_governor/ucode:
le/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-11/100%/process/page_fault1/never/performance/0xd000331
f57167d8c68f9b99
731a704c0d8760cfd641af4bf57
---------------------------
%stddev
\
-26.0% 9125299 will-it-scale.128.processes
0.07 ? 17% will-it-scale.128.processes_idle
71290 will-it-scale.per_process_ops
-26.0% 9125299 will-it-scale.workload
3.00 ? 54% time.major_page_faults
0.60 mpstat.cpu.all.irq%
1.23 mpstat.cpu.all.usr%
-83.8% 121.33 ? 62% numa-meminfo.node0.Active(file)
-11.0% 3617426 ? 2% numa-meminfo.node0.AnonPages
1571 vmstat.system.cs
-24.4% 189542 vmstat.system.in
+13.9% 8241057 meminfo.Inactive
+13.9% 8239382 meminfo.Inactive(anon)
81700 meminfo.Mapped
1784 meminfo.Mlocked
-23.7% 1.429e+09 numa-numastat.node0.local_node
-23.7% 1.429e+09 numa-numastat.node0.numa_hit
-28.2% 1.33e+09 numa-numastat.node1.local_node
-28.2% 1.329e+09 numa-numastat.node1.numa_hit
80068 ? 34% numa-numastat.node1.other_node
0.05 turbostat.IPC
-24.2% 57562839 turbostat.IRQ
4.29 ? 6% turbostat.PKG_%
-1.7% 76.33 turbostat.PkgTmp
-2.8% 472.42 turbostat.PkgWatt
81.55 turbostat.RAMWatt
29.67 ? 63% numa-vmstat.node0.nr_active_file
-10.8% 920591 ? 2% numa-vmstat.node0.nr_anon_pages
29.67 ? 63% numa-vmstat.node0.nr_zone_active_file
-23.7% 1.429e+09 numa-vmstat.node0.numa_hit
-23.7% 1.429e+09 numa-vmstat.node0.numa_local
-9.2% 935582 numa-vmstat.node1.nr_anon_pages
-28.2% 1.329e+09 numa-vmstat.node1.numa_hit
-28.2% 1.33e+09 numa-vmstat.node1.numa_local
80068 ? 34% numa-vmstat.node1.numa_other
54.95 ? 16% sched_debug.cfs_rq:/.load_avg.avg
+10.5% 251193 ? 3% sched_debug.cfs_rq:/.min_vruntime.stddev
-26247 sched_debug.cfs_rq:/.spread0.avg
+8.2% -1825127 sched_debug.cfs_rq:/.spread0.min
+10.4% 251123 ? 3% sched_debug.cfs_rq:/.spread0.stddev
12.82 ? 27% sched_debug.cpu.clock.stddev
0.88 ? 14% sched_debug.rt_rq:/.rt_time.avg
+65.2% 113.01 ? 14% sched_debug.rt_rq:/.rt_time.max
9.95 ? 14% sched_debug.rt_rq:/.rt_time.stddev
+6.2% 54828 ? 2% proc-vmstat.nr_active_anon
+5.3% 1855523 proc-vmstat.nr_anon_pages
+9.5% 2032582 proc-vmstat.nr_inactive_anon
20637 proc-vmstat.nr_mapped
9100 proc-vmstat.nr_page_table_pages
+6.2% 54828 ? 2% proc-vmstat.nr_zone_active_anon
+9.5% 2032581 proc-vmstat.nr_zone_inactive_anon
-26.0% 2.758e+09 proc-vmstat.numa_hit
-25.9% 2.759e+09 proc-vmstat.numa_local
-15.7% 118073 ? 4% proc-vmstat.numa_pte_updates
-6.5% 153823 ? 2% proc-vmstat.pgactivate
-25.9% 2.756e+09 proc-vmstat.pgalloc_normal
-25.9% 2.749e+09 proc-vmstat.pgfault
-25.9% 2.756e+09 proc-vmstat.pgfree
-2.0% 90497 proc-vmstat.pgreuse
12.77 perf-stat.i.MPKI
-11.2% 1.326e+10 perf-stat.i.branch-instructions
0.09 perf-stat.i.branch-miss-rate%
-30.3% 11746955 perf-stat.i.branch-misses
44.57 perf-stat.i.cache-miss-rate%
-36.0% 3.473e+08 perf-stat.i.cache-misses
-23.1% 7.788e+08 perf-stat.i.cache-references
1500 perf-stat.i.context-switches
6.89 perf-stat.i.cpi
+1.0% 4.219e+11 perf-stat.i.cpu-cycles
-2.3% 147.36 perf-stat.i.cpu-migrations
1213 perf-stat.i.cycles-between-cache-misses
0.01 ? 3% perf-stat.i.dTLB-load-miss-rate%
-41.4% 799244 ? 4% perf-stat.i.dTLB-load-misses
-13.9% 1.597e+10 perf-stat.i.dTLB-loads
1.83 perf-stat.i.dTLB-store-miss-rate%
-27.1% 1.057e+08 perf-stat.i.dTLB-store-misses
-25.1% 5.682e+09 perf-stat.i.dTLB-stores
-13.3% 6.114e+10 perf-stat.i.instructions
0.15 perf-stat.i.ipc
1032 perf-stat.i.metric.K/sec
-16.3% 278.07 perf-stat.i.metric.M/sec
-25.6% 9119612 perf-stat.i.minor-faults
13.06 perf-stat.i.node-load-miss-rate%
-33.6% 882417 perf-stat.i.node-load-misses
-57.0% 6101421 perf-stat.i.node-loads
0.68 perf-stat.i.node-store-miss-rate%
-33.3% 1769896 perf-stat.i.node-store-misses
-38.3% 2.606e+08 perf-stat.i.node-stores
-25.6% 9119613 perf-stat.i.page-faults
12.76 perf-stat.overall.MPKI
0.09 perf-stat.overall.branch-miss-rate%
44.59 perf-stat.overall.cache-miss-rate%
6.90 perf-stat.overall.cpi
1213 perf-stat.overall.cycles-between-cache-misses
0.01 ? 4% perf-stat.overall.dTLB-load-miss-rate%
1.83 perf-stat.overall.dTLB-store-miss-rate%
0.14 perf-stat.overall.ipc
12.38 perf-stat.overall.node-load-miss-rate%
0.67 perf-stat.overall.node-store-miss-rate%
+16.5% 2012907 perf-stat.overall.path-length
-11.8% 1.309e+10 perf-stat.ps.branch-instructions
-30.9% 11532682 perf-stat.ps.branch-misses
-36.3% 3.433e+08 perf-stat.ps.cache-misses
-23.4% 7.698e+08 perf-stat.ps.cache-references
1472 perf-stat.ps.context-switches
-2.9% 144.56 perf-stat.ps.cpu-migrations
-39.8% 830956 ? 4% perf-stat.ps.dTLB-load-misses
-14.5% 1.576e+10 perf-stat.ps.dTLB-loads
-27.5% 1.045e+08 perf-stat.ps.dTLB-store-misses
-25.6% 5.611e+09 perf-stat.ps.dTLB-stores
-13.9% 6.035e+10 perf-stat.ps.instructions
0.89 perf-stat.ps.major-faults
-26.0% 9015678 perf-stat.ps.minor-faults
-34.3% 864119 perf-stat.ps.node-load-misses
-56.9% 6114798 perf-stat.ps.node-loads
-34.0% 1737950 perf-stat.ps.node-store-misses
-38.6% 2.575e+08 perf-stat.ps.node-stores
-26.0% 9015679 perf-stat.ps.page-faults
-13.8% 1.837e+13 perf-stat.total.instructions
0.00 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu.zap_pte_range
0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu
1.26 perf-profile.calltrace.cycles-pp.folio_add_lru.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
1.20 ? 2% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.folio_add_lru.do_anonymous_page.__handle_mm_fault.handle_mm_fault
57.77 perf-profile.calltrace.cycles-pp.testcase
58.05 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
0.60 ? 3% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru.do_anonymous_page
0.61 ? 3% perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru.do_anonymous_page.__handle_mm_fault
0.57 ? 3% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru
55.10 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
55.04 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
54.57 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
54.23 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
53.84 perf-profile.calltrace.cycles-pp.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
4.15 ? 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
3.35 ? 2% perf-profile.calltrace.cycles-pp.charge_memcg.__mem_cgroup_charge.do_anonymous_page.__handle_mm_fault.handle_mm_fault
3.05 perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_pages_vma.do_anonymous_page
1.41 ? 2% perf-profile.calltrace.cycles-pp.try_charge_memcg.charge_memcg.__mem_cgroup_charge.do_anonymous_page.__handle_mm_fault
0.84 perf-profile.calltrace.cycles-pp.error_entry.testcase
0.78 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.testcase
0.58 perf-profile.calltrace.cycles-pp.__free_one_page.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu
4.16 perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
4.14 perf-profile.calltrace.cycles-pp.release_pages.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap
3.96 perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_finish_mmu
4.08 perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_finish_mmu.unmap_region.__do_munmap
4.05 perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_finish_mmu.unmap_region
47.75 perf-profile.calltrace.cycles-pp.alloc_pages_vma.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
47.60 perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_vma.do_anonymous_page.__handle_mm_fault.handle_mm_fault
47.42 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.alloc_pages_vma.do_anonymous_page.__handle_mm_fault
44.09 perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_pages_vma.do_anonymous_page
43.76 perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.alloc_pages_vma
42.94 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist
42.96 perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
37.04 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
37.04 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.__vm_munmap
37.04 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
37.03 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region
36.40 perf-profile.calltrace.cycles-pp.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
36.32 perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range
41.20 perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
41.20 perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
41.20 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
41.20 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
41.21 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
41.21 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
41.21 perf-profile.calltrace.cycles-pp.__munmap
34.97 perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu
35.82 perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
35.61 perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu.zap_pte_range
38.93 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages
0.74 ? 5% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.74 ? 5% perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
1.27 perf-profile.children.cycles-pp.folio_add_lru
1.21 ? 2% perf-profile.children.cycles-pp.__pagevec_lru_add
58.47 perf-profile.children.cycles-pp.testcase
56.70 perf-profile.children.cycles-pp.asm_exc_page_fault
55.13 perf-profile.children.cycles-pp.exc_page_fault
55.09 perf-profile.children.cycles-pp.do_user_addr_fault
54.60 perf-profile.children.cycles-pp.handle_mm_fault
54.26 perf-profile.children.cycles-pp.__handle_mm_fault
54.00 perf-profile.children.cycles-pp.do_anonymous_page
4.18 ? 2% perf-profile.children.cycles-pp.__mem_cgroup_charge
3.47 ? 2% perf-profile.children.cycles-pp.charge_memcg
3.09 perf-profile.children.cycles-pp.clear_page_erms
1.42 ? 2% perf-profile.children.cycles-pp.try_charge_memcg
0.88 perf-profile.children.cycles-pp.error_entry
0.79 perf-profile.children.cycles-pp.sync_regs
0.73 perf-profile.children.cycles-pp.native_irq_return_iret
0.39 ? 3% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.44 perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.48 perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.32 ? 2% perf-profile.children.cycles-pp.page_add_new_anon_rmap
0.37 perf-profile.children.cycles-pp.__list_del_entry_valid
0.32 perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.27 perf-profile.children.cycles-pp.page_remove_rmap
0.28 ? 3% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.26 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.26 ? 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.26 ? 3% perf-profile.children.cycles-pp.hrtimer_interrupt
0.15 ? 3% perf-profile.children.cycles-pp.page_counter_try_charge
0.21 perf-profile.children.cycles-pp.__mod_lruvec_state
0.23 ? 2% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.23 ? 2% perf-profile.children.cycles-pp.__perf_sw_event
0.21 ? 3% perf-profile.children.cycles-pp.tick_sched_timer
0.20 ? 2% perf-profile.children.cycles-pp.tick_sched_handle
0.20 ? 2% perf-profile.children.cycles-pp.update_process_times
0.15 perf-profile.children.cycles-pp.__mod_node_page_state
0.19 ? 2% perf-profile.children.cycles-pp.scheduler_tick
0.06 ? 7% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.17 ? 2% perf-profile.children.cycles-pp.task_tick_fair
0.06 perf-profile.children.cycles-pp.free_swap_cache
0.15 ? 5% perf-profile.children.cycles-pp.update_curr
0.17 perf-profile.children.cycles-pp.___perf_sw_event
0.09 ? 5% perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
0.06 perf-profile.children.cycles-pp.task_numa_work
0.09 ? 5% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.14 ? 5% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.07 ? 7% perf-profile.children.cycles-pp.exit_to_user_mode_loop
0.06 ? 7% perf-profile.children.cycles-pp.task_work_run
0.06 perf-profile.children.cycles-pp.change_prot_numa
0.06 perf-profile.children.cycles-pp.change_protection_range
0.06 perf-profile.children.cycles-pp.change_pmd_range
0.06 perf-profile.children.cycles-pp.change_pte_range
0.14 ? 5% perf-profile.children.cycles-pp.perf_tp_event
0.13 ? 6% perf-profile.children.cycles-pp.__perf_event_overflow
0.13 ? 6% perf-profile.children.cycles-pp.perf_event_output_forward
0.11 ? 4% perf-profile.children.cycles-pp.perf_callchain
0.11 ? 7% perf-profile.children.cycles-pp.get_perf_callchain
0.12 ? 4% perf-profile.children.cycles-pp.perf_prepare_sample
0.10 ? 4% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.06 ? 8% perf-profile.children.cycles-pp.__irqentry_text_end
0.06 perf-profile.children.cycles-pp.__cgroup_throttle_swaprate
0.08 ? 5% perf-profile.children.cycles-pp.perf_callchain_kernel
0.08 perf-profile.children.cycles-pp.free_unref_page_commit
0.08 perf-profile.children.cycles-pp.__count_memcg_events
0.06 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
0.09 ? 5% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
0.06 perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.03 ? 70% perf-profile.children.cycles-pp.handle_pte_fault
0.09 perf-profile.children.cycles-pp.__might_resched
0.06 ? 8% perf-profile.children.cycles-pp.up_read
0.08 perf-profile.children.cycles-pp.__mod_zone_page_state
0.06 perf-profile.children.cycles-pp.down_read_trylock
0.05 perf-profile.children.cycles-pp.folio_mapping
0.05 perf-profile.children.cycles-pp.find_vma
0.07 ? 6% perf-profile.children.cycles-pp.unwind_next_frame
0.05 perf-profile.children.cycles-pp.__cond_resched
0.17 ? 2% perf-profile.children.cycles-pp.__list_add_valid
0.10 ? 4% perf-profile.children.cycles-pp.__tlb_remove_page_size
0.13 ? 18% perf-profile.children.cycles-pp.shmem_alloc_and_acct_page
0.13 ? 18% perf-profile.children.cycles-pp.shmem_alloc_page
0.09 ? 5% perf-profile.children.cycles-pp.__get_free_pages
0.68 perf-profile.children.cycles-pp.__free_one_page
4.16 perf-profile.children.cycles-pp.tlb_finish_mmu
47.91 perf-profile.children.cycles-pp.alloc_pages_vma
47.90 perf-profile.children.cycles-pp.__alloc_pages
47.69 perf-profile.children.cycles-pp.get_page_from_freelist
44.37 perf-profile.children.cycles-pp.rmqueue
44.02 perf-profile.children.cycles-pp.rmqueue_bulk
82.83 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
37.04 perf-profile.children.cycles-pp.unmap_vmas
37.04 perf-profile.children.cycles-pp.unmap_page_range
37.04 perf-profile.children.cycles-pp.zap_pmd_range
37.04 perf-profile.children.cycles-pp.zap_pte_range
36.40 perf-profile.children.cycles-pp.tlb_flush_mmu
41.20 perf-profile.children.cycles-pp.__do_munmap
41.20 perf-profile.children.cycles-pp.__vm_munmap
41.20 perf-profile.children.cycles-pp.__x64_sys_munmap
41.20 perf-profile.children.cycles-pp.unmap_region
41.21 perf-profile.children.cycles-pp.__munmap
41.44 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
41.44 perf-profile.children.cycles-pp.do_syscall_64
40.57 perf-profile.children.cycles-pp.release_pages
39.91 perf-profile.children.cycles-pp.free_unref_page_list
39.69 perf-profile.children.cycles-pp.free_pcppages_bulk
82.22 perf-profile.children.cycles-pp._raw_spin_lock
3.08 ? 2% perf-profile.self.cycles-pp.clear_page_erms
2.02 ? 2% perf-profile.self.cycles-pp.charge_memcg
1.15 ? 2% perf-profile.self.cycles-pp.try_charge_memcg
1.59 perf-profile.self.cycles-pp.testcase
0.27 ? 3% perf-profile.self.cycles-pp.zap_pte_range
0.77 perf-profile.self.cycles-pp.sync_regs
0.73 perf-profile.self.cycles-pp.native_irq_return_iret
0.38 ? 3% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.34 perf-profile.self.cycles-pp.__list_del_entry_valid
0.39 perf-profile.self.cycles-pp.do_anonymous_page
0.24 ? 3% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.18 ? 4% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.27 perf-profile.self.cycles-pp.release_pages
0.23 ? 2% perf-profile.self.cycles-pp.get_page_from_freelist
0.14 ? 3% perf-profile.self.cycles-pp.rmqueue
0.21 ? 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
0.13 ? 3% perf-profile.self.cycles-pp.page_counter_try_charge
0.24 perf-profile.self.cycles-pp.__pagevec_lru_add_fn
0.14 ? 3% perf-profile.self.cycles-pp.__mod_node_page_state
0.21 ? 2% perf-profile.self.cycles-pp.__handle_mm_fault
0.06 ? 8% perf-profile.self.cycles-pp.free_swap_cache
0.06 ? 8% perf-profile.self.cycles-pp.change_pte_range
0.14 perf-profile.self.cycles-pp.handle_mm_fault
0.12 ? 4% perf-profile.self.cycles-pp.__alloc_pages
0.12 perf-profile.self.cycles-pp.___perf_sw_event
0.10 perf-profile.self.cycles-pp.page_remove_rmap
0.12 perf-profile.self.cycles-pp.do_user_addr_fault
0.03 ? 70% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.06 perf-profile.self.cycles-pp.__perf_sw_event
0.06 perf-profile.self.cycles-pp.__count_memcg_events
0.08 perf-profile.self.cycles-pp.cgroup_rstat_updated
0.08 perf-profile.self.cycles-pp.__might_resched
0.05 perf-profile.self.cycles-pp.__irqentry_text_end
0.10 perf-profile.self.cycles-pp.error_entry
0.09 perf-profile.self.cycles-pp.alloc_pages_vma
0.06 perf-profile.self.cycles-pp.free_unref_page_commit
0.06 ? 8% perf-profile.self.cycles-pp.folio_add_lru
0.05 perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.07 ? 6% perf-profile.self.cycles-pp.mem_cgroup_charge_statistics
0.08 perf-profile.self.cycles-pp._raw_spin_lock
0.05 ? 8% perf-profile.self.cycles-pp.down_read_trylock
0.07 perf-profile.self.cycles-pp.__mod_zone_page_state
0.06 perf-profile.self.cycles-pp.__mod_lruvec_state
0.06 ? 8% perf-profile.self.cycles-pp.asm_exc_page_fault
0.05 ? 8% perf-profile.self.cycles-pp.up_read
0.05 perf-profile.self.cycles-pp.folio_mapping
0.07 perf-profile.self.cycles-pp.free_unref_page_list
0.15 ? 3% perf-profile.self.cycles-pp.__list_add_valid
0.58 perf-profile.self.cycles-pp.rmqueue_bulk
0.56 perf-profile.self.cycles-pp.__free_one_page
82.83 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath