On 5/7/2022 3:11 PM, [email protected] wrote:
> On Sat, 2022-05-07 at 11:27 +0800, Aaron Lu wrote:
>> On Sat, May 07, 2022 at 08:54:35AM +0800, [email protected] wrote:
>>> On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
>>>> On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
>>>>> On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
>>>>>> Hi Mel,
>>>>>>
>>>>>> On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
>>>>>>>
>>>>>>> (please be noted we reported
>>>>>>> "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
>>>>>>> on
>>>>>>> https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
>>>>>>> while the commit is on branch.
>>>>>>> now we still observe similar regression when it's on mainline, and we also
>>>>>>> observe a 13.2% improvement on another netperf subtest.
>>>>>>> so report again for information)
>>>>>>>
>>>>>>> Greeting,
>>>>>>>
>>>>>>> FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
>>>>>>>
>>>>>>>
>>>>>>> commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>>>
>>>>>>
>>>>>> So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
>>>>>
>>>>> IMHO, this means the consumer and producer are running on different
>>>>> CPUs.
>>>>>
>>>>
>>>> Right.
>>>>
>>>>>> and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
>>>>>> then do not use PCP but directly free the page directly to buddy.
>>>>>>
>>>>>> The rationale as explained in the commit's changelog is:
>>>>>> "
>>>>>> Netperf running on localhost exhibits this pattern and while it does not
>>>>>> matter for some machines, it does matter for others with smaller caches
>>>>>> where cache misses cause problems due to reduced page reuse. Pages
>>>>>> freed directly to the buddy list may be reused quickly while still cache
>>>>>> hot where as storing on the PCP lists may be cold by the time
>>>>>> free_pcppages_bulk() is called.
>>>>>> "
>>>>>>
>>>>>> This regression occurred on a machine that has large caches so this
>>>>>> optimization brings no value to it but only overhead(skipped PCP), I
>>>>>> guess this is the reason why there is a regression.
>>>>>
>>>>> Per my understanding, not only the cache size is larger, but also the L2
>>>>> cache (1MB) is per-core on this machine. So if the consumer and
>>>>> producer are running on different cores, the cache-hot page may cause
>>>>> more core-to-core cache transfer. This may hurt performance too.
>>>>>
>>>>
>>>> Client side allocates skb(page) and server side recvfrom() it.
>>>> recvfrom() copies the page data to server's own buffer and then releases
>>>> the page associated with the skb. Client does all the allocation and
>>>> server does all the free, page reuse happens at client side.
>>>> So I think core-2-core cache transfer due to page reuse can occur when
>>>> client task migrates.
>>>
>>> The core-to-core cache transfering can be cross-socket or cross-L2 in
>>> one socket. I mean the later one.
>>>
>>>> I have modified the job to have the client and server bound to a
>>>> specific CPU of different cores on the same node, and testing it on the
>>>> same Icelake 2 sockets server, the result is
>>>>
>>>> kernel throughput
>>>> 8b10b465d0e1 125168
>>>> f26b3fa04611 102039 -18%
>>>>
>>>> It's also a 18% drop. I think this means c2c is not a factor?
>>>
>>> Can you test with client and server bound to 2 hardware threads
>>> (hyperthread) of one core? The two hardware threads of one core will
>>> share the L2 cache.
>>>
>>
>> 8b10b465d0e1: 89702
>> f26b3fa04611: 95823 +6.8%
>>
>> When binding client and server on the 2 threads of the same core, the
>> bisected commit is an improvement now on this 2 sockets Icelake server.
>
> Good. I guess cache-hot works now.
>
Yes, it can't be more hot now :-)
>>>>>> I have also tested this case on a small machine: a skylake desktop and
>>>>>> this commit shows improvement:
>>>>>> 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
>>>>>> f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
>>>>>>
>>>>>> So this means those directly freed pages get reused by allocator side
>>>>>> and that brings performance improvement for machines with smaller cache.
>>>>>
>>>>> Per my understanding, the L2 cache on this desktop machine is shared
>>>>> among cores.
>>>>>
>>>>
>>>> The said CPU is i7-6700 and according to this wikipedia page,
>>>> L2 is per core:
>>>> https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
>>>
>>> Sorry, my memory was wrong. The skylake and later server has much
>>> larger private L2 cache (1MB vs 256KB of client), this may increase the
>>> possibility of core-2-core transfering.
>>>
>>
>> I'm trying to understand where is the core-2-core cache transfer.
>>
>> When server needs to do the copy in recvfrom(), there is core-2-core
>> cache transfer from client cpu to server cpu. But this is the same no
>> matter page gets reused or not, i.e. the bisected commit and its parent
>> doesn't have any difference in this step.
>
> Yes.
>
>> Then when page gets reused in
>> the client side, there is no core-2-core cache transfer as the server
>> side didn't do write to the page's data.
>
> The "reused" pages were read by the server side, so their cache lines
> are in "shared" state, some inter-core traffic is needed to shoot down
> these cache lines before the client side writes them. This will incur
> some overhead.
>
I thought the overhead of changing the cache line from "shared" to
"own"/"modify" is pretty cheap.
Also, this is the same case as the Skylake desktop machine, why it is a
gain there but a loss here? Is it that this "overhead" is much greater
in server machine to the extent that it is even better to use a totally
cold page than a hot one? If so, it seems to suggest we should avoid
cache reuse in server machine unless the two CPUs happens to be two
hyperthreads of the same core.
Thanks,
Aaron
>> So page reuse or not, it
>> shouldn't cause any difference regarding core-2-core cache transfer.
>> Is this correct?
>>
>>>>>> I wonder if we should still use PCP a little bit under the above said
>>>>>> condition, for the purpose of:
>>>>>> 1 reduced overhead in the free path for machines with large cache;
>>>>>> 2 still keeps the benefit of reused pages for machines with smaller cache.
>>>>>>
>>>>>> For this reason, I tested increasing nr_pcp_high() from returning 0 to
>>>>>> either returning pcp->batch or (pcp->batch << 2):
>>>>>> machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
>>>>>> skylake desktop: 72288 90784 92219 91528
>>>>>> icelake 2sockets: 120956 99177 98251 116108
>>>>>>
>>>>>> note nr_pcp_high() returns pcp->high is the behaviour of this commit's
>>>>>> parent, returns 0 is the behaviour of this commit.
>>>>>>
>>>>>> The result shows, if we effectively use a PCP high as (pcp->batch << 2)
>>>>>> for the described condition, then this workload's performance on
>>>>>> small machine can remain while the regression on large machines can be
>>>>>> greately reduced(from -18% to -4%).
>>>>>>
>>>>>
>>>>> Can we use cache size and topology information directly?
>>>>
>>>> It can be complicated by the fact that the system can have multiple
>>>> producers(cpus that are doing free) running at the same time and getting
>>>> the perfect number can be a difficult job.
>>>
>>> We can discuss this after verifying whether it's core-2-core transfering
>>> related.
>>>
>>> Best Regards,
>>> Huang, Ying
>>>
>>>
>
>
On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
> On 5/7/2022 3:11 PM, [email protected] wrote:
> > On Sat, 2022-05-07 at 11:27 +0800, Aaron Lu wrote:
> > > On Sat, May 07, 2022 at 08:54:35AM +0800, [email protected] wrote:
> > > > On Fri, 2022-05-06 at 20:17 +0800, Aaron Lu wrote:
> > > > > On Fri, May 06, 2022 at 04:40:45PM +0800, [email protected] wrote:
> > > > > > On Fri, 2022-04-29 at 19:29 +0800, Aaron Lu wrote:
> > > > > > > Hi Mel,
> > > > > > >
> > > > > > > On Wed, Apr 20, 2022 at 09:35:26AM +0800, kernel test robot wrote:
> > > > > > > >
> > > > > > > > (please be noted we reported
> > > > > > > > "[mm/page_alloc] 39907a939a: netperf.Throughput_Mbps -18.1% regression"
> > > > > > > > on
> > > > > > > > https://lore.kernel.org/all/20220228155733.GF1643@xsang-OptiPlex-9020/
> > > > > > > > while the commit is on branch.
> > > > > > > > now we still observe similar regression when it's on mainline, and we also
> > > > > > > > observe a 13.2% improvement on another netperf subtest.
> > > > > > > > so report again for information)
> > > > > > > >
> > > > > > > > Greeting,
> > > > > > > >
> > > > > > > > FYI, we noticed a -18.0% regression of netperf.Throughput_Mbps due to commit:
> > > > > > > >
> > > > > > > >
> > > > > > > > commit: f26b3fa046116a7dedcaafe30083402113941451 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free")
> > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > > > >
> > > > > > >
> > > > > > > So what this commit did is: if a CPU is always doing free(pcp->free_factor > 0)
> > > > > >
> > > > > > IMHO, this means the consumer and producer are running on different
> > > > > > CPUs.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > > and if the being freed high-order page's order is <= PAGE_ALLOC_COSTLY_ORDER,
> > > > > > > then do not use PCP but directly free the page directly to buddy.
> > > > > > >
> > > > > > > The rationale as explained in the commit's changelog is:
> > > > > > > "
> > > > > > > Netperf running on localhost exhibits this pattern and while it does not
> > > > > > > matter for some machines, it does matter for others with smaller caches
> > > > > > > where cache misses cause problems due to reduced page reuse. Pages
> > > > > > > freed directly to the buddy list may be reused quickly while still cache
> > > > > > > hot where as storing on the PCP lists may be cold by the time
> > > > > > > free_pcppages_bulk() is called.
> > > > > > > "
> > > > > > >
> > > > > > > This regression occurred on a machine that has large caches so this
> > > > > > > optimization brings no value to it but only overhead(skipped PCP), I
> > > > > > > guess this is the reason why there is a regression.
> > > > > >
> > > > > > Per my understanding, not only the cache size is larger, but also the L2
> > > > > > cache (1MB) is per-core on this machine. So if the consumer and
> > > > > > producer are running on different cores, the cache-hot page may cause
> > > > > > more core-to-core cache transfer. This may hurt performance too.
> > > > > >
> > > > >
> > > > > Client side allocates skb(page) and server side recvfrom() it.
> > > > > recvfrom() copies the page data to server's own buffer and then releases
> > > > > the page associated with the skb. Client does all the allocation and
> > > > > server does all the free, page reuse happens at client side.
> > > > > So I think core-2-core cache transfer due to page reuse can occur when
> > > > > client task migrates.
> > > >
> > > > The core-to-core cache transfering can be cross-socket or cross-L2 in
> > > > one socket. I mean the later one.
> > > >
> > > > > I have modified the job to have the client and server bound to a
> > > > > specific CPU of different cores on the same node, and testing it on the
> > > > > same Icelake 2 sockets server, the result is
> > > > >
> > > > > kernel throughput
> > > > > 8b10b465d0e1 125168
> > > > > f26b3fa04611 102039 -18%
> > > > >
> > > > > It's also a 18% drop. I think this means c2c is not a factor?
> > > >
> > > > Can you test with client and server bound to 2 hardware threads
> > > > (hyperthread) of one core? The two hardware threads of one core will
> > > > share the L2 cache.
> > > >
> > >
> > > 8b10b465d0e1: 89702
> > > f26b3fa04611: 95823 +6.8%
> > >
> > > When binding client and server on the 2 threads of the same core, the
> > > bisected commit is an improvement now on this 2 sockets Icelake server.
> >
> > Good. I guess cache-hot works now.
> >
>
> Yes, it can't be more hot now :-)
>
> > > > > > > I have also tested this case on a small machine: a skylake desktop and
> > > > > > > this commit shows improvement:
> > > > > > > 8b10b465d0e1: "netperf.Throughput_Mbps": 72288.76,
> > > > > > > f26b3fa04611: "netperf.Throughput_Mbps": 90784.4, +25.6%
> > > > > > >
> > > > > > > So this means those directly freed pages get reused by allocator side
> > > > > > > and that brings performance improvement for machines with smaller cache.
> > > > > >
> > > > > > Per my understanding, the L2 cache on this desktop machine is shared
> > > > > > among cores.
> > > > > >
> > > > >
> > > > > The said CPU is i7-6700 and according to this wikipedia page,
> > > > > L2 is per core:
> > > > > https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Mainstream_desktop_processors
> > > >
> > > > Sorry, my memory was wrong. The skylake and later server has much
> > > > larger private L2 cache (1MB vs 256KB of client), this may increase the
> > > > possibility of core-2-core transfering.
> > > >
> > >
> > > I'm trying to understand where is the core-2-core cache transfer.
> > >
> > > When server needs to do the copy in recvfrom(), there is core-2-core
> > > cache transfer from client cpu to server cpu. But this is the same no
> > > matter page gets reused or not, i.e. the bisected commit and its parent
> > > doesn't have any difference in this step.
> >
> > Yes.
> >
> > > Then when page gets reused in
> > > the client side, there is no core-2-core cache transfer as the server
> > > side didn't do write to the page's data.
> >
> > The "reused" pages were read by the server side, so their cache lines
> > are in "shared" state, some inter-core traffic is needed to shoot down
> > these cache lines before the client side writes them. This will incur
> > some overhead.
> >
>
> I thought the overhead of changing the cache line from "shared" to
> "own"/"modify" is pretty cheap.
This is the read/write pattern of cache ping-pong. Although it should
be cheaper than the write/write pattern of cache ping-pong in theory, we
have gotten sevious regression for that before.
> Also, this is the same case as the Skylake desktop machine, why it is a
> gain there but a loss here?
I guess the reason is the private cache size. The size of the private
L2 cache of SKL server is much larger than that of SKL client (1MB vs.
256KB). So there's much more core-2-core traffic on SKL server.
> Is it that this "overhead" is much greater
> in server machine to the extent that it is even better to use a totally
> cold page than a hot one?
Yes. And I think the private cache size matters here. And after being
evicted from the private cache (L1/L2), the cache lines of the reused
pages will go to shared cache (L3), that will help performance.
> If so, it seems to suggest we should avoid
> cache reuse in server machine unless the two CPUs happens to be two
> hyperthreads of the same core.
Yes. I think so.
Best Regards,
Huang, Ying
> > > So page reuse or not, it
> > > shouldn't cause any difference regarding core-2-core cache transfer.
> > > Is this correct?
> > >
> > > > > > > I wonder if we should still use PCP a little bit under the above said
> > > > > > > condition, for the purpose of:
> > > > > > > 1 reduced overhead in the free path for machines with large cache;
> > > > > > > 2 still keeps the benefit of reused pages for machines with smaller cache.
> > > > > > >
> > > > > > > For this reason, I tested increasing nr_pcp_high() from returning 0 to
> > > > > > > either returning pcp->batch or (pcp->batch << 2):
> > > > > > > machine\nr_pcp_high() ret: pcp->high 0 pcp->batch (pcp->batch << 2)
> > > > > > > skylake desktop: 72288 90784 92219 91528
> > > > > > > icelake 2sockets: 120956 99177 98251 116108
> > > > > > >
> > > > > > > note nr_pcp_high() returns pcp->high is the behaviour of this commit's
> > > > > > > parent, returns 0 is the behaviour of this commit.
> > > > > > >
> > > > > > > The result shows, if we effectively use a PCP high as (pcp->batch << 2)
> > > > > > > for the described condition, then this workload's performance on
> > > > > > > small machine can remain while the regression on large machines can be
> > > > > > > greately reduced(from -18% to -4%).
> > > > > > >
> > > > > >
> > > > > > Can we use cache size and topology information directly?
> > > > >
> > > > > It can be complicated by the fact that the system can have multiple
> > > > > producers(cpus that are doing free) running at the same time and getting
> > > > > the perfect number can be a difficult job.
> > > >
> > > > We can discuss this after verifying whether it's core-2-core transfering
> > > > related.
> > > >
> > > > Best Regards,
> > > > Huang, Ying
> > > >
> > > >
> >
> >
On 5/7/2022 3:44 PM, [email protected] wrote:
> On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
... ...
>>
>> I thought the overhead of changing the cache line from "shared" to
>> "own"/"modify" is pretty cheap.
>
> This is the read/write pattern of cache ping-pong. Although it should
> be cheaper than the write/write pattern of cache ping-pong in theory, we
> have gotten sevious regression for that before.
>
Can you point me to the regression report? I would like to take a look,
thanks.
>> Also, this is the same case as the Skylake desktop machine, why it is a
>> gain there but a loss here?
>
> I guess the reason is the private cache size. The size of the private
> L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> 256KB). So there's much more core-2-core traffic on SKL server.
>
It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
pages and that means the allocator side may have a higher chance of
reusing a page that is evicted from the free cpu's L2 cache than the
server machine, whose L2 can store 40 order-3 pages.
I can do more tests using different high for the two machines:
1) high=0, this is the case when page reuse is the extreme. core-2-core
transfer should be the most. This is the behavior of this bisected commit.
2) high=L2_size, this is the case when page reuse is fewer compared to
the above case, core-2-core should still be the majority.
3) high=2 times of L2_size and smaller than llc size, this is the case
when cache reuse is further reduced, and when the page is indeed reused,
it shouldn't cause core-2-core transfer but can benefit from llc.
4) high>llc_size, this is the case when page reuse is the least and when
page is indeed reused, it is likely not in the entire cache hierarchy.
This is the behavior of this bisected commit's parent commit for the
Skylake desktop machine.
I expect case 3) should give us the best performance and 1) or 4) is the
worst for this testcase.
case 4) is difficult to test on the server machine due to the cap of
pcp->high which is affected by the low watermark of the zone. The server
machine has 128 cpus but only 128G memory, which makes the pcp->high
capped at 421, while llc size is 40MiB and that translates to a page
number of 12288.
>> Is it that this "overhead" is much greater
>> in server machine to the extent that it is even better to use a totally
>> cold page than a hot one?
>
> Yes. And I think the private cache size matters here. And after being
> evicted from the private cache (L1/L2), the cache lines of the reused
> pages will go to shared cache (L3), that will help performance.
>
Sounds reasonable.
>> If so, it seems to suggest we should avoid
>> cache reuse in server machine unless the two CPUs happens to be two
>> hyperthreads of the same core.
>
> Yes. I think so.
On Tue, 2022-05-10 at 11:43 +0800, Aaron Lu wrote:
> On 5/7/2022 3:44 PM, [email protected] wrote:
> > On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
>
> ... ...
>
> > >
> > > I thought the overhead of changing the cache line from "shared" to
> > > "own"/"modify" is pretty cheap.
> >
> > This is the read/write pattern of cache ping-pong. Although it should
> > be cheaper than the write/write pattern of cache ping-pong in theory, we
> > have gotten sevious regression for that before.
> >
>
> Can you point me to the regression report? I would like to take a look,
> thanks.
Sure.
https://lore.kernel.org/all/[email protected]/
> > > Also, this is the same case as the Skylake desktop machine, why it is a
> > > gain there but a loss here?
> >
> > I guess the reason is the private cache size. The size of the private
> > L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> > 256KB). So there's much more core-2-core traffic on SKL server.
> >
>
> It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
> pages and that means the allocator side may have a higher chance of
> reusing a page that is evicted from the free cpu's L2 cache than the
> server machine, whose L2 can store 40 order-3 pages.
>
> I can do more tests using different high for the two machines:
> 1) high=0, this is the case when page reuse is the extreme. core-2-core
> transfer should be the most. This is the behavior of this bisected commit.
> 2) high=L2_size, this is the case when page reuse is fewer compared to
> the above case, core-2-core should still be the majority.
> 3) high=2 times of L2_size and smaller than llc size, this is the case
> when cache reuse is further reduced, and when the page is indeed reused,
> it shouldn't cause core-2-core transfer but can benefit from llc.
> 4) high>llc_size, this is the case when page reuse is the least and when
> page is indeed reused, it is likely not in the entire cache hierarchy.
> This is the behavior of this bisected commit's parent commit for the
> Skylake desktop machine.
>
> I expect case 3) should give us the best performance and 1) or 4) is the
> worst for this testcase.
>
> case 4) is difficult to test on the server machine due to the cap of
> pcp->high which is affected by the low watermark of the zone. The server
> machine has 128 cpus but only 128G memory, which makes the pcp->high
> capped at 421, while llc size is 40MiB and that translates to a page
> number of 12288.
> >
Sounds good to me.
Best Regards,
Huang, Ying
> > > Is it that this "overhead" is much greater
> > > in server machine to the extent that it is even better to use a totally
> > > cold page than a hot one?
> >
> > Yes. And I think the private cache size matters here. And after being
> > evicted from the private cache (L1/L2), the cache lines of the reused
> > pages will go to shared cache (L3), that will help performance.
> >
>
> Sounds reasonable.
>
> > > If so, it seems to suggest we should avoid
> > > cache reuse in server machine unless the two CPUs happens to be two
> > > hyperthreads of the same core.
> >
> > Yes. I think so.
[ Adding locking people in case they have any input ]
On Mon, May 9, 2022 at 11:23 PM [email protected]
<[email protected]> wrote:
>
> >
> > Can you point me to the regression report? I would like to take a look,
> > thanks.
>
> https://lore.kernel.org/all/[email protected]/
Hmm.
That explanation looks believable, except that our qspinlocks
shouldn't be spinning on the lock itself, but spinning on the mcs node
it inserts into the lock.
Or so I believed before I looked closer at the code again (it's been years).
It turns out we spin on the lock itself if we're the "head waiter". So
somebody is always spinning.
That's a bit unfortunate for this workload, I guess.
I think from a pure lock standpoint, it's the right thing to do (no
unnecessary bouncing, with the lock releaser doing just one write, and
the head waiter spinning on it is doing the right thing).
But I think this is an example of where you end up having that
spinning on the lock possibly then being a disturbance on the other
fields around the lock.
I wonder if Waiman / PeterZ / Will have any comments on that. Maybe
that "spin on the lock itself" is just fundamentally the only correct
thing, but since my initial reaction was "no, we're spinning on the
mcs node", maybe that would be _possible_?
We do have a lot of those spinlocks embedded in other data structures
cases. And if "somebody else is waiting for the lock" contends badly
with "the lock holder is doing a lot of writes close to the lock",
then that's not great.
Linus
On 5/10/22 14:05, Linus Torvalds wrote:
> [ Adding locking people in case they have any input ]
>
> On Mon, May 9, 2022 at 11:23 PM [email protected]
> <[email protected]> wrote:
>>> Can you point me to the regression report? I would like to take a look,
>>> thanks.
>> https://lore.kernel.org/all/[email protected]/
> Hmm.
>
> That explanation looks believable, except that our qspinlocks
> shouldn't be spinning on the lock itself, but spinning on the mcs node
> it inserts into the lock.
>
> Or so I believed before I looked closer at the code again (it's been years).
>
> It turns out we spin on the lock itself if we're the "head waiter". So
> somebody is always spinning.
>
> That's a bit unfortunate for this workload, I guess.
>
> I think from a pure lock standpoint, it's the right thing to do (no
> unnecessary bouncing, with the lock releaser doing just one write, and
> the head waiter spinning on it is doing the right thing).
>
> But I think this is an example of where you end up having that
> spinning on the lock possibly then being a disturbance on the other
> fields around the lock.
>
> I wonder if Waiman / PeterZ / Will have any comments on that. Maybe
> that "spin on the lock itself" is just fundamentally the only correct
> thing, but since my initial reaction was "no, we're spinning on the
> mcs node", maybe that would be _possible_?
>
> We do have a lot of those spinlocks embedded in other data structures
> cases. And if "somebody else is waiting for the lock" contends badly
> with "the lock holder is doing a lot of writes close to the lock",
> then that's not great.
Qspinlock still has one head waiter spinning on the lock. This is much
better than the original ticket spinlock where there will be n waiters
spinning on the lock. That is the cost of a cheap unlock. There is no
way to eliminate all lock spinning unless we use MCS lock directly which
will require a change in locking API as well as more expensive unlock.
Cheers,
Longman
On Tue, May 10, 2022 at 11:05:01AM -0700, Linus Torvalds wrote:
> I think from a pure lock standpoint, it's the right thing to do (no
> unnecessary bouncing, with the lock releaser doing just one write, and
> the head waiter spinning on it is doing the right thing).
>
> But I think this is an example of where you end up having that
> spinning on the lock possibly then being a disturbance on the other
> fields around the lock.
>
> I wonder if Waiman / PeterZ / Will have any comments on that. Maybe
> that "spin on the lock itself" is just fundamentally the only correct
> thing, but since my initial reaction was "no, we're spinning on the
> mcs node", maybe that would be _possible_?
>
> We do have a lot of those spinlocks embedded in other data structures
> cases. And if "somebody else is waiting for the lock" contends badly
> with "the lock holder is doing a lot of writes close to the lock",
> then that's not great.
The immediate problem is that we don't always have a node. Notably we
only do the whole MCS queueing thing when there's more than 1 contender.
Always doing the MCS thing had a hefty performance penalty vs the
simpler spinlock implementations for the uncontended and light contended
lock cases (by far the most common scenario) due to the extra cache-miss
of getting an MCS node.
On Tue, 2022-05-10 at 11:05 -0700, Linus Torvalds wrote:
> [ Adding locking people in case they have any input ]
>
> On Mon, May 9, 2022 at 11:23 PM [email protected]
> <[email protected]> wrote:
> >
> > >
> > > Can you point me to the regression report? I would like to take a look,
> > > thanks.
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Hmm.
>
> That explanation looks believable, except that our qspinlocks
> shouldn't be spinning on the lock itself, but spinning on the mcs node
> it inserts into the lock.
The referenced regression report is very old (in Feb 2015 for 3.16-
3.17). The ticket spinlock was still used at that time. I believe that
things become much better after we used qspinlock. We can test that.
Best Regards,
Huang, Ying
On Tue, May 10, 2022 at 02:23:28PM +0800, [email protected] wrote:
> On Tue, 2022-05-10 at 11:43 +0800, Aaron Lu wrote:
> > On 5/7/2022 3:44 PM, [email protected] wrote:
> > > On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
> >
> > ... ...
> >
> > > >
> > > > I thought the overhead of changing the cache line from "shared" to
> > > > "own"/"modify" is pretty cheap.
> > >
> > > This is the read/write pattern of cache ping-pong. Although it should
> > > be cheaper than the write/write pattern of cache ping-pong in theory, we
> > > have gotten sevious regression for that before.
> > >
> >
> > Can you point me to the regression report? I would like to take a look,
> > thanks.
>
> Sure.
>
> https://lore.kernel.org/all/[email protected]/
>
> > > > Also, this is the same case as the Skylake desktop machine, why it is a
> > > > gain there but a loss here??
> > >
> > > I guess the reason is the private cache size. The size of the private
> > > L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> > > 256KB). So there's much more core-2-core traffic on SKL server.
> > >
> >
> > It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
> > pages and that means the allocator side may have a higher chance of
> > reusing a page that is evicted from the free cpu's L2 cache than the
> > server machine, whose L2 can store 40 order-3 pages.
> >
> > I can do more tests using different high for the two machines:
> > 1) high=0, this is the case when page reuse is the extreme. core-2-core
> > transfer should be the most. This is the behavior of this bisected commit.
> > 2) high=L2_size, this is the case when page reuse is fewer compared to
> > the above case, core-2-core should still be the majority.
> > 3) high=2 times of L2_size and smaller than llc size, this is the case
> > when cache reuse is further reduced, and when the page is indeed reused,
> > it shouldn't cause core-2-core transfer but can benefit from llc.
> > 4) high>llc_size, this is the case when page reuse is the least and when
> > page is indeed reused, it is likely not in the entire cache hierarchy.
> > This is the behavior of this bisected commit's parent commit for the
> > Skylake desktop machine.
> >
> > I expect case 3) should give us the best performance and 1) or 4) is the
> > worst for this testcase.
> >
> > case 4) is difficult to test on the server machine due to the cap of
> > pcp->high which is affected by the low watermark of the zone. The server
> > machine has 128 cpus but only 128G memory, which makes the pcp->high
> > capped at 421, while llc size is 40MiB and that translates to a page
> > number of 12288.
> > >
>
> Sounds good to me.
I've run the tests on a 2 sockets Icelake server and a Skylake desktop.
On this 2 sockets Icelake server(1.25MiB L2 = 320 pages, 48MiB LLC =
12288 pages):
pcp->high score
0 100662 (bypass PCP, most page resue, most core-2-core transfer)
320(L2) 117252
640 133149
6144(1/2 llc) 134674
12416(>llc) 103193 (least page reuse)
Setting pcp->high to 640(2 times L2 size) gives very good result, only
slightly lower than 6144(1/2 llc size). Bypassing PCP to get the most
cache reuse didn't deliver good performance, so I think Ying is right:
core-2-core really hurts.
On this 4core/8cpu Skylake desktop(256KiB L2 = 64 pages, 8MiB LLC = 2048
pages):
0 86780 (bypass PCP, most page reuse, most core-2-core transfer)
64(L2) 85813
128 85521
1024(1/2 llc) 85557
2176(> llc) 74458 (least page reuse)
Things are different on this small machine. Bypassing PCP gives the best
performance. I find it hard to explain this. Maybe the 256KiB is too
small that even bypassing PCP, the page still ends up being evicted from
L2 when allocator side reuses it? Or maybe core-2-core transfer is
fast on this small machine?
P.S. I've blindly setting pcp->high to the above value, ignoring zone's
low watermark cap for testing purpose.
On Tue, May 10, 2022 at 11:47 AM Waiman Long <[email protected]> wrote:>
> Qspinlock still has one head waiter spinning on the lock. This is much
> better than the original ticket spinlock where there will be n waiters
> spinning on the lock.
Oh, absolutely. I'm not saying we should look at going back. I'm more
asking whether maybe we could go even further..
> That is the cost of a cheap unlock. There is no way to eliminate all
> lock spinning unless we use MCS lock directly which will require a
> change in locking API as well as more expensive unlock.
So there's no question that unlock would be more expensive for the
contention case, since it would have to always not only clear the lock
itself, as well as update the noce it points to.
But does it actually require a change in the locking API?
The qspinlock slowpath already always allocates that mcs node (for
some definition of "always" - I am obviously ignoring all the trylock
cases both before and in the slowpath)
But yes, clearly the simply store-release of the current
queued_spin_unlock() wouldn't work as-is, and maybe the cost of
replacing it with something else is much more expensive than any
possible win.
I think the PV case already basically does that - replacing the the
"store release" with a much more complex sequence. No?
Linus
On 5/10/22 21:58, [email protected] wrote:
> On Tue, 2022-05-10 at 11:05 -0700, Linus Torvalds wrote:
>> [ Adding locking people in case they have any input ]
>>
>> On Mon, May 9, 2022 at 11:23 PM [email protected]
>> <[email protected]> wrote:
>>>> Can you point me to the regression report? I would like to take a look,
>>>> thanks.
>>> https://lore.kernel.org/all/[email protected]/
>> +
>> Hmm.
>>
>> That explanation looks believable, except that our qspinlocks
>> shouldn't be spinning on the lock itself, but spinning on the mcs node
>> it inserts into the lock.
> The referenced regression report is very old (in Feb 2015 for 3.16-
> 3.17). The ticket spinlock was still used at that time. I believe that
> things become much better after we used qspinlock. We can test that.
Thank for the info. Qspinlock was merged into mainline since 4.2. So
ticket spinlock was used on all v3.* kernels. I was wondering why
qspinlock would have produced such a large performance regression with
just one lock spinning head waiter. So this is not such a big issue
after all.
Cheers,
Longman
On Wed, 2022-05-11 at 11:40 +0800, Aaron Lu wrote:
> On Tue, May 10, 2022 at 02:23:28PM +0800, [email protected] wrote:
> > On Tue, 2022-05-10 at 11:43 +0800, Aaron Lu wrote:
> > > On 5/7/2022 3:44 PM, [email protected] wrote:
> > > > On Sat, 2022-05-07 at 15:31 +0800, Aaron Lu wrote:
> > >
> > > ... ...
> > >
> > > > >
> > > > > I thought the overhead of changing the cache line from "shared" to
> > > > > "own"/"modify" is pretty cheap.
> > > >
> > > > This is the read/write pattern of cache ping-pong. Although it should
> > > > be cheaper than the write/write pattern of cache ping-pong in theory, we
> > > > have gotten sevious regression for that before.
> > > >
> > >
> > > Can you point me to the regression report? I would like to take a look,
> > > thanks.
> >
> > Sure.
> >
> > https://lore.kernel.org/all/[email protected]/
> >
> > > > > Also, this is the same case as the Skylake desktop machine, why it is a
> > > > > gain there but a loss here?
> > > >
> > > > I guess the reason is the private cache size. The size of the private
> > > > L2 cache of SKL server is much larger than that of SKL client (1MB vs.
> > > > 256KB). So there's much more core-2-core traffic on SKL server.
> > > >
> > >
> > > It could be. The 256KiB L2 in Skylake desktop can only store 8 order-3
> > > pages and that means the allocator side may have a higher chance of
> > > reusing a page that is evicted from the free cpu's L2 cache than the
> > > server machine, whose L2 can store 40 order-3 pages.
> > >
> > > I can do more tests using different high for the two machines:
> > > 1) high=0, this is the case when page reuse is the extreme. core-2-core
> > > transfer should be the most. This is the behavior of this bisected commit.
> > > 2) high=L2_size, this is the case when page reuse is fewer compared to
> > > the above case, core-2-core should still be the majority.
> > > 3) high=2 times of L2_size and smaller than llc size, this is the case
> > > when cache reuse is further reduced, and when the page is indeed reused,
> > > it shouldn't cause core-2-core transfer but can benefit from llc.
> > > 4) high>llc_size, this is the case when page reuse is the least and when
> > > page is indeed reused, it is likely not in the entire cache hierarchy.
> > > This is the behavior of this bisected commit's parent commit for the
> > > Skylake desktop machine.
> > >
> > > I expect case 3) should give us the best performance and 1) or 4) is the
> > > worst for this testcase.
> > >
> > > case 4) is difficult to test on the server machine due to the cap of
> > > pcp->high which is affected by the low watermark of the zone. The server
> > > machine has 128 cpus but only 128G memory, which makes the pcp->high
> > > capped at 421, while llc size is 40MiB and that translates to a page
> > > number of 12288.
> > > >
> >
> > Sounds good to me.
>
> I've run the tests on a 2 sockets Icelake server and a Skylake desktop.
>
> On this 2 sockets Icelake server(1.25MiB L2 = 320 pages, 48MiB LLC =
> 12288 pages):
>
> pcp->high score
> 0 100662 (bypass PCP, most page resue, most core-2-core transfer)
> 320(L2) 117252
> 640 133149
> 6144(1/2 llc) 134674
> 12416(>llc) 103193 (least page reuse)
>
> Setting pcp->high to 640(2 times L2 size) gives very good result, only
> slightly lower than 6144(1/2 llc size). Bypassing PCP to get the most
> cache reuse didn't deliver good performance, so I think Ying is right:
> core-2-core really hurts.
>
> On this 4core/8cpu Skylake desktop(256KiB L2 = 64 pages, 8MiB LLC = 2048
> pages):
>
> 0 86780 (bypass PCP, most page reuse, most core-2-core transfer)
> 64(L2) 85813
> 128 85521
> 1024(1/2 llc) 85557
> 2176(> llc) 74458 (least page reuse)
>
> Things are different on this small machine. Bypassing PCP gives the best
> performance. I find it hard to explain this. Maybe the 256KiB is too
> small that even bypassing PCP, the page still ends up being evicted from
> L2 when allocator side reuses it? Or maybe core-2-core transfer is
> fast on this small machine?
86780 / 85813 = 1.011
So, there's almost no measurable difference among the configurations
except the last one. I would rather say the test isn't sensitive to L2
size, but sensitive to LLC size on this machine.
Best Regards,
Huang, Ying
> P.S. I've blindly setting pcp->high to the above value, ignoring zone's
> low watermark cap for testing purpose.
On Wed, May 11, 2022 at 09:58:23AM +0800, [email protected] wrote:
> On Tue, 2022-05-10 at 11:05 -0700, Linus Torvalds wrote:
> > [ Adding locking people in case they have any input ]
> >
> > On Mon, May 9, 2022 at 11:23 PM [email protected]
> > <[email protected]> wrote:
> > >
> > > >
> > > > Can you point me to the regression report? I would like to take a look,
> > > > thanks.
> > >
> > > https://lore.kernel.org/all/[email protected]/
> >
> > Hmm.
> >
> > That explanation looks believable, except that our qspinlocks
> > shouldn't be spinning on the lock itself, but spinning on the mcs node
> > it inserts into the lock.
>
> The referenced regression report is very old (in Feb 2015 for 3.16-
> 3.17). The ticket spinlock was still used at that time. I believe that
> things become much better after we used qspinlock. We can test that.
'will-it-scale/page_fault1 process mode' can greatly stress both zone
lock and LRU lock when nr_process = nr_cpu with thp disabled. So I run
it to see if it still makes a difference with qspinlock.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
The result on an Icelake 2 sockets server with a total of 48cores/96cpus:
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/thp_enabled/cpufreq_governor/ucode:
lkp-icl-2sp4/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-11/100%/process/page_fault1/never/performance/0xd000331
commit:
v5.18-rc4
731a704c0d8760cfd641af4bf57167d8c68f9b99
v5.18-rc4 731a704c0d8760cfd641af4bf57
---------------- ---------------------------
%stddev %change %stddev
\ | \
12323894 -26.0% 9125299 will-it-scale.128.processes
22.33 ? 4% -22.3 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu
9.80 -9.2 0.57 ? 3% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru
36.25 +6.7 42.94 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist
4.28 ? 10% +34.6 38.93 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages
75.05 +7.8 82.83 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
commit 731a704c0d8760cfd641af4bf57 moves zone's lock back to above
free_area by reverting commit a368ab67aa55("mm: move zone lock to a
different cache line than order-0 free page lists") on top of v5.18-rc4.
The interpretation of the above result is: after the revert, performance
dropped 26%, zone lock increased 41% from 40% to 81%, the overall lock
contention increased 7.8% from 75% to 82.83%. So it appears it still
makes a difference with qspinlock.
------
Commit 731a704c0d8760cfd641af4bf57:
From 731a704c0d8760cfd641af4bf57167d8c68f9b99 Mon Sep 17 00:00:00 2001
From: Aaron Lu <[email protected]>
Date: Wed, 11 May 2022 10:32:53 +0800
Subject: [PATCH] Revert "mm: move zone lock to a different cache line than
order-0 free page lists"
This reverts commit a368ab67aa55615a03b2c9c00fb965bee3ebeaa4.
---
include/linux/mmzone.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 46ffab808f03..f5534f42c693 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,15 +628,15 @@ struct zone {
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
+ /* Primarily protects free_area */
+ spinlock_t lock;
+
/* free areas of different sizes */
struct free_area free_area[MAX_ORDER];
/* zone flags, see below */
unsigned long flags;
- /* Primarily protects free_area */
- spinlock_t lock;
-
/* Write-intensive fields used by compaction and vmstats. */
ZONE_PADDING(_pad2_)
--
2.35.3
The entire diff between the two kernels:
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/thp_enabled/cpufreq_governor/ucode:
lkp-icl-2sp4/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-11/100%/process/page_fault1/never/performance/0xd000331
commit:
v5.18-rc4
731a704c0d8760cfd641af4bf57167d8c68f9b99
v5.18-rc4 731a704c0d8760cfd641af4bf57
---------------- ---------------------------
%stddev %change %stddev
\ | \
12323894 -26.0% 9125299 will-it-scale.128.processes
0.05 ? 8% +37.5% 0.07 ? 17% will-it-scale.128.processes_idle
96279 -26.0% 71290 will-it-scale.per_process_ops
12323894 -26.0% 9125299 will-it-scale.workload
0.33 ?141% +800.0% 3.00 ? 54% time.major_page_faults
0.66 -0.1 0.60 mpstat.cpu.all.irq%
1.49 -0.3 1.23 mpstat.cpu.all.usr%
747.00 ? 54% -83.8% 121.33 ? 62% numa-meminfo.node0.Active(file)
4063469 -11.0% 3617426 ? 2% numa-meminfo.node0.AnonPages
1634 -3.9% 1571 vmstat.system.cs
250770 ? 5% -24.4% 189542 vmstat.system.in
7234686 ? 2% +13.9% 8241057 meminfo.Inactive
7231508 ? 2% +13.9% 8239382 meminfo.Inactive(anon)
101436 -19.5% 81700 meminfo.Mapped
592.33 ?141% +201.2% 1784 meminfo.Mlocked
1.873e+09 -23.7% 1.429e+09 numa-numastat.node0.local_node
1.872e+09 -23.7% 1.429e+09 numa-numastat.node0.numa_hit
1.853e+09 -28.2% 1.33e+09 numa-numastat.node1.local_node
1.852e+09 -28.2% 1.329e+09 numa-numastat.node1.numa_hit
52056 ? 65% +53.8% 80068 ? 34% numa-numastat.node1.other_node
0.06 -16.7% 0.05 turbostat.IPC
75911699 ? 4% -24.2% 57562839 turbostat.IRQ
27.73 -23.4 4.29 ? 6% turbostat.PKG_%
77.67 -1.7% 76.33 turbostat.PkgTmp
486.01 -2.8% 472.42 turbostat.PkgWatt
94.08 -13.3% 81.55 turbostat.RAMWatt
186.67 ? 54% -84.1% 29.67 ? 63% numa-vmstat.node0.nr_active_file
1031719 -10.8% 920591 ? 2% numa-vmstat.node0.nr_anon_pages
186.67 ? 54% -84.1% 29.67 ? 63% numa-vmstat.node0.nr_zone_active_file
1.872e+09 -23.7% 1.429e+09 numa-vmstat.node0.numa_hit
1.873e+09 -23.7% 1.429e+09 numa-vmstat.node0.numa_local
1030546 ? 2% -9.2% 935582 numa-vmstat.node1.nr_anon_pages
1.852e+09 -28.2% 1.329e+09 numa-vmstat.node1.numa_hit
1.853e+09 -28.2% 1.33e+09 numa-vmstat.node1.numa_local
52056 ? 65% +53.8% 80068 ? 34% numa-vmstat.node1.numa_other
34.48 ? 33% +59.4% 54.95 ? 16% sched_debug.cfs_rq:/.load_avg.avg
227417 +10.5% 251193 ? 3% sched_debug.cfs_rq:/.min_vruntime.stddev
59485 ? 84% -144.1% -26247 sched_debug.cfs_rq:/.spread0.avg
-1687153 +8.2% -1825127 sched_debug.cfs_rq:/.spread0.min
227479 +10.4% 251123 ? 3% sched_debug.cfs_rq:/.spread0.stddev
8.05 ? 21% +59.2% 12.82 ? 27% sched_debug.cpu.clock.stddev
0.55 ? 7% +61.5% 0.88 ? 14% sched_debug.rt_rq:/.rt_time.avg
68.39 ? 10% +65.2% 113.01 ? 14% sched_debug.rt_rq:/.rt_time.max
6.02 ? 10% +65.3% 9.95 ? 14% sched_debug.rt_rq:/.rt_time.stddev
51614 +6.2% 54828 ? 2% proc-vmstat.nr_active_anon
1762215 ? 3% +5.3% 1855523 proc-vmstat.nr_anon_pages
1855872 ? 3% +9.5% 2032582 proc-vmstat.nr_inactive_anon
25600 -19.4% 20637 proc-vmstat.nr_mapped
8779 +3.6% 9100 proc-vmstat.nr_page_table_pages
51614 +6.2% 54828 ? 2% proc-vmstat.nr_zone_active_anon
1855870 ? 3% +9.5% 2032581 proc-vmstat.nr_zone_inactive_anon
3.725e+09 -26.0% 2.758e+09 proc-vmstat.numa_hit
3.726e+09 -25.9% 2.759e+09 proc-vmstat.numa_local
140034 ? 3% -15.7% 118073 ? 4% proc-vmstat.numa_pte_updates
164530 -6.5% 153823 ? 2% proc-vmstat.pgactivate
3.722e+09 -25.9% 2.756e+09 proc-vmstat.pgalloc_normal
3.712e+09 -25.9% 2.749e+09 proc-vmstat.pgfault
3.722e+09 -25.9% 2.756e+09 proc-vmstat.pgfree
92383 -2.0% 90497 proc-vmstat.pgreuse
14.36 -11.1% 12.77 perf-stat.i.MPKI
1.493e+10 -11.2% 1.326e+10 perf-stat.i.branch-instructions
0.12 -0.0 0.09 perf-stat.i.branch-miss-rate%
16850271 -30.3% 11746955 perf-stat.i.branch-misses
53.64 -9.1 44.57 perf-stat.i.cache-miss-rate%
5.43e+08 -36.0% 3.473e+08 perf-stat.i.cache-misses
1.012e+09 -23.1% 7.788e+08 perf-stat.i.cache-references
1550 -3.2% 1500 perf-stat.i.context-switches
5.92 +16.4% 6.89 perf-stat.i.cpi
4.178e+11 +1.0% 4.219e+11 perf-stat.i.cpu-cycles
150.89 -2.3% 147.36 perf-stat.i.cpu-migrations
769.17 +57.8% 1213 perf-stat.i.cycles-between-cache-misses
0.01 -0.0 0.01 ? 3% perf-stat.i.dTLB-load-miss-rate%
1363413 ? 3% -41.4% 799244 ? 4% perf-stat.i.dTLB-load-misses
1.855e+10 -13.9% 1.597e+10 perf-stat.i.dTLB-loads
1.87 -0.0 1.83 perf-stat.i.dTLB-store-miss-rate%
1.45e+08 -27.1% 1.057e+08 perf-stat.i.dTLB-store-misses
7.586e+09 -25.1% 5.682e+09 perf-stat.i.dTLB-stores
7.051e+10 -13.3% 6.114e+10 perf-stat.i.instructions
0.17 -14.0% 0.15 perf-stat.i.ipc
333.69 +209.4% 1032 perf-stat.i.metric.K/sec
332.10 -16.3% 278.07 perf-stat.i.metric.M/sec
12265683 -25.6% 9119612 perf-stat.i.minor-faults
8.89 +4.2 13.06 perf-stat.i.node-load-miss-rate%
1327995 -33.6% 882417 perf-stat.i.node-load-misses
14189574 -57.0% 6101421 perf-stat.i.node-loads
0.63 +0.0 0.68 perf-stat.i.node-store-miss-rate%
2654944 -33.3% 1769896 perf-stat.i.node-store-misses
4.223e+08 -38.3% 2.606e+08 perf-stat.i.node-stores
12265684 -25.6% 9119613 perf-stat.i.page-faults
14.35 -11.1% 12.76 perf-stat.overall.MPKI
0.11 -0.0 0.09 perf-stat.overall.branch-miss-rate%
53.62 -9.0 44.59 perf-stat.overall.cache-miss-rate%
5.93 +16.4% 6.90 perf-stat.overall.cpi
770.18 +57.5% 1213 perf-stat.overall.cycles-between-cache-misses
0.01 ? 2% -0.0 0.01 ? 4% perf-stat.overall.dTLB-load-miss-rate%
1.87 -0.0 1.83 perf-stat.overall.dTLB-store-miss-rate%
0.17 -14.1% 0.14 perf-stat.overall.ipc
8.47 +3.9 12.38 perf-stat.overall.node-load-miss-rate%
0.62 +0.0 0.67 perf-stat.overall.node-store-miss-rate%
1728530 +16.5% 2012907 perf-stat.overall.path-length
1.483e+10 -11.8% 1.309e+10 perf-stat.ps.branch-instructions
16689442 -30.9% 11532682 perf-stat.ps.branch-misses
5.392e+08 -36.3% 3.433e+08 perf-stat.ps.cache-misses
1.006e+09 -23.4% 7.698e+08 perf-stat.ps.cache-references
1534 -4.1% 1472 perf-stat.ps.context-switches
148.92 -2.9% 144.56 perf-stat.ps.cpu-migrations
1379865 ? 2% -39.8% 830956 ? 4% perf-stat.ps.dTLB-load-misses
1.843e+10 -14.5% 1.576e+10 perf-stat.ps.dTLB-loads
1.44e+08 -27.5% 1.045e+08 perf-stat.ps.dTLB-store-misses
7.537e+09 -25.6% 5.611e+09 perf-stat.ps.dTLB-stores
7.006e+10 -13.9% 6.035e+10 perf-stat.ps.instructions
0.97 -7.8% 0.89 perf-stat.ps.major-faults
12184666 -26.0% 9015678 perf-stat.ps.minor-faults
1314901 -34.3% 864119 perf-stat.ps.node-load-misses
14202713 -56.9% 6114798 perf-stat.ps.node-loads
2633146 -34.0% 1737950 perf-stat.ps.node-store-misses
4.191e+08 -38.6% 2.575e+08 perf-stat.ps.node-stores
12184667 -26.0% 9015679 perf-stat.ps.page-faults
2.13e+13 -13.8% 1.837e+13 perf-stat.total.instructions
22.34 ? 4% -22.3 0.00 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
22.34 ? 4% -22.3 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu.zap_pte_range
22.33 ? 4% -22.3 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_flush_mmu
10.82 -9.6 1.26 perf-profile.calltrace.cycles-pp.folio_add_lru.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
10.74 -9.5 1.20 ? 2% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.folio_add_lru.do_anonymous_page.__handle_mm_fault.handle_mm_fault
67.12 -9.3 57.77 perf-profile.calltrace.cycles-pp.testcase
67.39 -9.3 58.05 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
9.85 -9.3 0.60 ? 3% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru.do_anonymous_page
9.85 -9.2 0.61 ? 3% perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru.do_anonymous_page.__handle_mm_fault
9.80 -9.2 0.57 ? 3% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.__pagevec_lru_add.folio_add_lru
63.37 -8.3 55.10 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
63.30 -8.3 55.04 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
62.65 -8.1 54.57 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
62.18 -7.9 54.23 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
61.61 -7.8 53.84 perf-profile.calltrace.cycles-pp.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
6.69 -2.5 4.15 ? 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
5.36 -2.0 3.35 ? 2% perf-profile.calltrace.cycles-pp.charge_memcg.__mem_cgroup_charge.do_anonymous_page.__handle_mm_fault.handle_mm_fault
4.76 -1.7 3.05 perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_pages_vma.do_anonymous_page
2.25 -0.8 1.41 ? 2% perf-profile.calltrace.cycles-pp.try_charge_memcg.charge_memcg.__mem_cgroup_charge.do_anonymous_page.__handle_mm_fault
1.16 -0.3 0.84 perf-profile.calltrace.cycles-pp.error_entry.testcase
1.08 -0.3 0.78 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.testcase
0.00 +0.6 0.58 perf-profile.calltrace.cycles-pp.__free_one_page.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu
3.31 +0.8 4.16 perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
3.29 +0.8 4.14 perf-profile.calltrace.cycles-pp.release_pages.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap
0.87 ? 7% +3.1 3.96 perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_finish_mmu
0.98 ? 6% +3.1 4.08 perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_finish_mmu.unmap_region.__do_munmap
0.95 ? 7% +3.1 4.05 perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_finish_mmu.unmap_region
43.13 +4.6 47.75 perf-profile.calltrace.cycles-pp.alloc_pages_vma.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
42.94 +4.7 47.60 perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_vma.do_anonymous_page.__handle_mm_fault.handle_mm_fault
42.69 +4.7 47.42 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.alloc_pages_vma.do_anonymous_page.__handle_mm_fault
37.53 +6.6 44.09 perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_pages_vma.do_anonymous_page
37.08 +6.7 43.76 perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.alloc_pages_vma
36.25 +6.7 42.94 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist
36.26 +6.7 42.96 perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
28.37 +8.7 37.04 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
28.37 +8.7 37.04 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.__vm_munmap
28.37 +8.7 37.04 perf-profile.calltrace.cycles-pp.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
28.35 +8.7 37.03 perf-profile.calltrace.cycles-pp.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas.unmap_region
27.31 +9.1 36.40 perf-profile.calltrace.cycles-pp.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range.unmap_vmas
27.16 +9.2 36.32 perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range.unmap_page_range
31.70 +9.5 41.20 perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
31.70 +9.5 41.20 perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
31.70 +9.5 41.20 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
31.70 +9.5 41.20 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
31.70 +9.5 41.21 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
31.70 +9.5 41.21 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
31.70 +9.5 41.21 perf-profile.calltrace.cycles-pp.__munmap
3.40 ? 14% +31.6 34.97 perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu
4.15 ? 12% +31.7 35.82 perf-profile.calltrace.cycles-pp.free_unref_page_list.release_pages.tlb_flush_mmu.zap_pte_range.zap_pmd_range
3.94 ? 12% +31.7 35.61 perf-profile.calltrace.cycles-pp.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_flush_mmu.zap_pte_range
4.28 ? 10% +34.6 38.93 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages
34.49 ? 2% -33.7 0.74 ? 5% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
34.49 ? 2% -33.7 0.74 ? 5% perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
10.86 -9.6 1.27 perf-profile.children.cycles-pp.folio_add_lru
10.80 -9.6 1.21 ? 2% perf-profile.children.cycles-pp.__pagevec_lru_add
68.04 -9.6 58.47 perf-profile.children.cycles-pp.testcase
65.56 -8.9 56.70 perf-profile.children.cycles-pp.asm_exc_page_fault
63.41 -8.3 55.13 perf-profile.children.cycles-pp.exc_page_fault
63.35 -8.3 55.09 perf-profile.children.cycles-pp.do_user_addr_fault
62.69 -8.1 54.60 perf-profile.children.cycles-pp.handle_mm_fault
62.20 -7.9 54.26 perf-profile.children.cycles-pp.__handle_mm_fault
61.84 -7.8 54.00 perf-profile.children.cycles-pp.do_anonymous_page
6.74 -2.6 4.18 ? 2% perf-profile.children.cycles-pp.__mem_cgroup_charge
5.56 -2.1 3.47 ? 2% perf-profile.children.cycles-pp.charge_memcg
4.82 -1.7 3.09 perf-profile.children.cycles-pp.clear_page_erms
2.26 -0.8 1.42 ? 2% perf-profile.children.cycles-pp.try_charge_memcg
1.21 -0.3 0.88 perf-profile.children.cycles-pp.error_entry
1.08 -0.3 0.79 perf-profile.children.cycles-pp.sync_regs
1.01 -0.3 0.73 perf-profile.children.cycles-pp.native_irq_return_iret
0.66 -0.3 0.39 ? 3% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.68 -0.2 0.44 perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.66 -0.2 0.48 perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.50 -0.2 0.32 ? 2% perf-profile.children.cycles-pp.page_add_new_anon_rmap
0.53 -0.2 0.37 perf-profile.children.cycles-pp.__list_del_entry_valid
0.47 -0.1 0.32 perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.41 -0.1 0.27 perf-profile.children.cycles-pp.page_remove_rmap
0.39 -0.1 0.28 ? 3% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.38 -0.1 0.26 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.37 -0.1 0.26 ? 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.37 -0.1 0.26 ? 3% perf-profile.children.cycles-pp.hrtimer_interrupt
0.26 ? 5% -0.1 0.15 ? 3% perf-profile.children.cycles-pp.page_counter_try_charge
0.32 -0.1 0.21 perf-profile.children.cycles-pp.__mod_lruvec_state
0.33 -0.1 0.23 ? 2% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.32 -0.1 0.23 ? 2% perf-profile.children.cycles-pp.__perf_sw_event
0.30 -0.1 0.21 ? 3% perf-profile.children.cycles-pp.tick_sched_timer
0.29 -0.1 0.20 ? 2% perf-profile.children.cycles-pp.tick_sched_handle
0.29 -0.1 0.20 ? 2% perf-profile.children.cycles-pp.update_process_times
0.23 ? 2% -0.1 0.15 perf-profile.children.cycles-pp.__mod_node_page_state
0.27 -0.1 0.19 ? 2% perf-profile.children.cycles-pp.scheduler_tick
0.14 ? 3% -0.1 0.06 ? 7% perf-profile.children.cycles-pp.free_pages_and_swap_cache
0.25 -0.1 0.17 ? 2% perf-profile.children.cycles-pp.task_tick_fair
0.13 ? 3% -0.1 0.06 perf-profile.children.cycles-pp.free_swap_cache
0.22 -0.1 0.15 ? 5% perf-profile.children.cycles-pp.update_curr
0.24 -0.1 0.17 perf-profile.children.cycles-pp.___perf_sw_event
0.16 ? 3% -0.1 0.09 ? 5% perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
0.12 ? 3% -0.1 0.06 perf-profile.children.cycles-pp.task_numa_work
0.15 ? 5% -0.1 0.09 ? 5% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.20 -0.1 0.14 ? 5% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.13 ? 3% -0.1 0.07 ? 7% perf-profile.children.cycles-pp.exit_to_user_mode_loop
0.12 ? 3% -0.1 0.06 ? 7% perf-profile.children.cycles-pp.task_work_run
0.12 ? 6% -0.1 0.06 perf-profile.children.cycles-pp.change_prot_numa
0.12 ? 6% -0.1 0.06 perf-profile.children.cycles-pp.change_protection_range
0.12 ? 6% -0.1 0.06 perf-profile.children.cycles-pp.change_pmd_range
0.12 ? 6% -0.1 0.06 perf-profile.children.cycles-pp.change_pte_range
0.20 ? 2% -0.1 0.14 ? 5% perf-profile.children.cycles-pp.perf_tp_event
0.19 ? 2% -0.1 0.13 ? 6% perf-profile.children.cycles-pp.__perf_event_overflow
0.19 ? 2% -0.1 0.13 ? 6% perf-profile.children.cycles-pp.perf_event_output_forward
0.16 -0.0 0.11 ? 4% perf-profile.children.cycles-pp.perf_callchain
0.16 ? 3% -0.0 0.11 ? 7% perf-profile.children.cycles-pp.get_perf_callchain
0.16 -0.0 0.12 ? 4% perf-profile.children.cycles-pp.perf_prepare_sample
0.13 ? 3% -0.0 0.10 ? 4% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.09 -0.0 0.06 ? 8% perf-profile.children.cycles-pp.__irqentry_text_end
0.09 ? 5% -0.0 0.06 perf-profile.children.cycles-pp.__cgroup_throttle_swaprate
0.12 ? 4% -0.0 0.08 ? 5% perf-profile.children.cycles-pp.perf_callchain_kernel
0.11 ? 4% -0.0 0.08 perf-profile.children.cycles-pp.free_unref_page_commit
0.11 ? 4% -0.0 0.08 perf-profile.children.cycles-pp.__count_memcg_events
0.09 -0.0 0.06 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
0.12 ? 3% -0.0 0.09 ? 5% perf-profile.children.cycles-pp.mem_cgroup_charge_statistics
0.09 ? 5% -0.0 0.06 perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.06 -0.0 0.03 ? 70% perf-profile.children.cycles-pp.handle_pte_fault
0.12 ? 4% -0.0 0.09 perf-profile.children.cycles-pp.__might_resched
0.08 -0.0 0.06 ? 8% perf-profile.children.cycles-pp.up_read
0.10 ? 4% -0.0 0.08 perf-profile.children.cycles-pp.__mod_zone_page_state
0.08 -0.0 0.06 perf-profile.children.cycles-pp.down_read_trylock
0.07 -0.0 0.05 perf-profile.children.cycles-pp.folio_mapping
0.07 -0.0 0.05 perf-profile.children.cycles-pp.find_vma
0.09 -0.0 0.07 ? 6% perf-profile.children.cycles-pp.unwind_next_frame
0.06 -0.0 0.05 perf-profile.children.cycles-pp.__cond_resched
0.16 +0.0 0.17 ? 2% perf-profile.children.cycles-pp.__list_add_valid
0.06 ? 8% +0.0 0.10 ? 4% perf-profile.children.cycles-pp.__tlb_remove_page_size
0.06 ? 7% +0.1 0.13 ? 18% perf-profile.children.cycles-pp.shmem_alloc_and_acct_page
0.06 ? 7% +0.1 0.13 ? 18% perf-profile.children.cycles-pp.shmem_alloc_page
0.00 +0.1 0.09 ? 5% perf-profile.children.cycles-pp.__get_free_pages
0.55 ? 3% +0.1 0.68 perf-profile.children.cycles-pp.__free_one_page
3.31 +0.8 4.16 perf-profile.children.cycles-pp.tlb_finish_mmu
43.22 +4.7 47.91 perf-profile.children.cycles-pp.alloc_pages_vma
43.13 +4.8 47.90 perf-profile.children.cycles-pp.__alloc_pages
42.83 +4.9 47.69 perf-profile.children.cycles-pp.get_page_from_freelist
37.68 +6.7 44.37 perf-profile.children.cycles-pp.rmqueue
37.20 +6.8 44.02 perf-profile.children.cycles-pp.rmqueue_bulk
75.05 +7.8 82.83 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
28.37 +8.7 37.04 perf-profile.children.cycles-pp.unmap_vmas
28.37 +8.7 37.04 perf-profile.children.cycles-pp.unmap_page_range
28.37 +8.7 37.04 perf-profile.children.cycles-pp.zap_pmd_range
28.37 +8.7 37.04 perf-profile.children.cycles-pp.zap_pte_range
27.31 +9.1 36.40 perf-profile.children.cycles-pp.tlb_flush_mmu
31.70 +9.5 41.20 perf-profile.children.cycles-pp.__do_munmap
31.70 +9.5 41.20 perf-profile.children.cycles-pp.__vm_munmap
31.70 +9.5 41.20 perf-profile.children.cycles-pp.__x64_sys_munmap
31.70 +9.5 41.20 perf-profile.children.cycles-pp.unmap_region
31.70 +9.5 41.21 perf-profile.children.cycles-pp.__munmap
31.91 +9.5 41.44 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
31.91 +9.5 41.44 perf-profile.children.cycles-pp.do_syscall_64
30.60 +10.0 40.57 perf-profile.children.cycles-pp.release_pages
5.15 ? 8% +34.8 39.91 perf-profile.children.cycles-pp.free_unref_page_list
4.90 ? 9% +34.8 39.69 perf-profile.children.cycles-pp.free_pcppages_bulk
40.73 ? 2% +41.5 82.22 perf-profile.children.cycles-pp._raw_spin_lock
4.79 -1.7 3.08 ? 2% perf-profile.self.cycles-pp.clear_page_erms
3.27 -1.2 2.02 ? 2% perf-profile.self.cycles-pp.charge_memcg
1.80 -0.7 1.15 ? 2% perf-profile.self.cycles-pp.try_charge_memcg
2.15 -0.6 1.59 perf-profile.self.cycles-pp.testcase
0.57 ? 3% -0.3 0.27 ? 3% perf-profile.self.cycles-pp.zap_pte_range
1.06 -0.3 0.77 perf-profile.self.cycles-pp.sync_regs
1.01 -0.3 0.73 perf-profile.self.cycles-pp.native_irq_return_iret
0.63 -0.3 0.38 ? 3% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.51 -0.2 0.34 perf-profile.self.cycles-pp.__list_del_entry_valid
0.55 ? 2% -0.2 0.39 perf-profile.self.cycles-pp.do_anonymous_page
0.39 ? 2% -0.2 0.24 ? 3% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.30 ? 2% -0.1 0.18 ? 4% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.38 -0.1 0.27 perf-profile.self.cycles-pp.release_pages
0.34 -0.1 0.23 ? 2% perf-profile.self.cycles-pp.get_page_from_freelist
0.25 -0.1 0.14 ? 3% perf-profile.self.cycles-pp.rmqueue
0.30 -0.1 0.21 ? 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
0.22 ? 6% -0.1 0.13 ? 3% perf-profile.self.cycles-pp.page_counter_try_charge
0.33 -0.1 0.24 perf-profile.self.cycles-pp.__pagevec_lru_add_fn
0.22 ? 2% -0.1 0.14 ? 3% perf-profile.self.cycles-pp.__mod_node_page_state
0.28 -0.1 0.21 ? 2% perf-profile.self.cycles-pp.__handle_mm_fault
0.13 ? 3% -0.1 0.06 ? 8% perf-profile.self.cycles-pp.free_swap_cache
0.11 ? 4% -0.1 0.06 ? 8% perf-profile.self.cycles-pp.change_pte_range
0.19 ? 2% -0.1 0.14 perf-profile.self.cycles-pp.handle_mm_fault
0.16 ? 2% -0.0 0.12 ? 4% perf-profile.self.cycles-pp.__alloc_pages
0.17 ? 2% -0.0 0.12 perf-profile.self.cycles-pp.___perf_sw_event
0.14 ? 3% -0.0 0.10 perf-profile.self.cycles-pp.page_remove_rmap
0.16 -0.0 0.12 perf-profile.self.cycles-pp.do_user_addr_fault
0.07 -0.0 0.03 ? 70% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.09 -0.0 0.06 perf-profile.self.cycles-pp.__perf_sw_event
0.09 -0.0 0.06 perf-profile.self.cycles-pp.__count_memcg_events
0.11 -0.0 0.08 perf-profile.self.cycles-pp.cgroup_rstat_updated
0.11 -0.0 0.08 perf-profile.self.cycles-pp.__might_resched
0.08 -0.0 0.05 perf-profile.self.cycles-pp.__irqentry_text_end
0.13 -0.0 0.10 perf-profile.self.cycles-pp.error_entry
0.12 -0.0 0.09 perf-profile.self.cycles-pp.alloc_pages_vma
0.09 ? 5% -0.0 0.06 perf-profile.self.cycles-pp.free_unref_page_commit
0.08 -0.0 0.06 ? 8% perf-profile.self.cycles-pp.folio_add_lru
0.07 ? 6% -0.0 0.05 perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.10 ? 4% -0.0 0.07 ? 6% perf-profile.self.cycles-pp.mem_cgroup_charge_statistics
0.10 -0.0 0.08 perf-profile.self.cycles-pp._raw_spin_lock
0.07 ? 6% -0.0 0.05 ? 8% perf-profile.self.cycles-pp.down_read_trylock
0.09 -0.0 0.07 perf-profile.self.cycles-pp.__mod_zone_page_state
0.08 -0.0 0.06 perf-profile.self.cycles-pp.__mod_lruvec_state
0.07 ? 6% -0.0 0.06 ? 8% perf-profile.self.cycles-pp.asm_exc_page_fault
0.07 -0.0 0.05 ? 8% perf-profile.self.cycles-pp.up_read
0.06 ? 7% -0.0 0.05 perf-profile.self.cycles-pp.folio_mapping
0.08 ? 5% -0.0 0.07 perf-profile.self.cycles-pp.free_unref_page_list
0.13 +0.0 0.15 ? 3% perf-profile.self.cycles-pp.__list_add_valid
0.48 +0.1 0.58 perf-profile.self.cycles-pp.rmqueue_bulk
0.46 ? 3% +0.1 0.56 perf-profile.self.cycles-pp.__free_one_page
75.05 +7.8 82.83 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath