LinuxLists.cc - Re: [linus:master] [filemap] c8be038067: vm-scalability.throughput -27.3% regression

2023-11-23 16:06:30

Subject: Re: [linus:master] [filemap] c8be038067: vm-scalability.throughput -27.3% regression

On 11/20/2023 9:48 PM, kernel test robot wrote:
>
> hi, Fengwei,
>
> we noticed c8be038067 is the fix commit for
> de74976eb65151a2f568e477fc2e0032df5b22b4 ("filemap: add filemap_map_folio_range()")
>
> and we captured numbers of improvement for this commit
> (refer to below
> "In addition to that, the commit also has significant impact on the following tests"
> part which includes several examples)
>
> however, recently, we found a regression as title mentioned.
I can reproduce the regression on an Ice Lake platform with 256G memory + 72C/144T.

>
> the extra information we want to share is we also tested on mainline tip when
> this bisect done, and noticed the regression disappear:
I can also reproduce this "regression disappear on latest mainline". And I found
the related commit was:

commit f5617ffeb450f84c57f7eba1a3524a29955d42b7
Author: Matthew Wilcox (Oracle) <[email protected]>
Date: Mon Jul 24 19:54:08 2023 +0100

mm: run the fault-around code under the VMA lock

With this commit, the map_pages() could be called twice. The first is with VMA lock hold.
The second one is with mmap_lock (even no set_pte because of !pte_none()).

With this commit reverted, the regression can be restored to some level.

And The reason that the "regression disappear on latest mainline" is related with

commit 12214eba1992642eee5813a9cc9f626e5b2d1815 (test6)
Author: Matthew Wilcox (Oracle) <[email protected]>
Date: Fri Oct 6 20:53:17 2023 +0100

mm: handle read faults under the VMA lock

This commit eliminates the second call of map_pages() and the regression can be
restored to some level.

We may still need to move following code block before fault around:
ret = vmf_can_call_fault(vmf);
if (ret)
return ret;

>
> # extra tests on head commit of linus/master
> # good: [9bacdd8996c77c42ca004440be610692275ff9d0] Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
>
> the data is even better than parent:
>
> "vm-scalability.throughput": [
> 54122867,
> 58465096,
> 55888729,
> 56960948,
> 56385702,
> 56392203
> ],
>
> and we also reverted c8be038067 on maineline tip, but found no big diff:
>
> # extra tests on revert first bad commit
> # good: [f13a82be4c3252dabd1328a437c309d84ec7a872] Revert "filemap: add filemap_map_order0_folio() to handle order0 folio"
>
> "vm-scalability.throughput": [
> 56434337,
> 56199754,
> 56214041,
> 55308070,
> 55401115,
> 55709753
> ],
>
>
> commit:
> 578d7699e5 ("proc: nommu: /proc/<pid>/maps: release mmap read lock")
> c8be038067 ("filemap: add filemap_map_order0_folio() to handle order0 folio")
>
> 578d7699e5c2add8 c8be03806738c86521dbf1e0503
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 146.95 ± 8% -83.0% 24.99 ± 3% vm-scalability.free_time
> 233050 -28.1% 167484 vm-scalability.median
> 590.30 ± 12% -590.2 0.06 ± 45% vm-scalability.stddev%
> 51589606 -27.3% 37516397 vm-scalability.throughput

I found very interesting behavior:
1. I am sure the filemap_map_order0_folio() is faster than filemap_map_folio_range()
if the folio has order 0 (true for shmem for now).

2. If I use tool ebpf_trace to get how many times the filemap_map_order0_folio() is
called during vm-scalability is running, the test result become better. In general,
the test result should become worse.

It looks slower filemap_map_order0_folio() can make better vm-scalability result.
I did another testing by adding 2us delay in filemap_map_order0_folio(), the
vm-scalability result get a little bit improved.

3. using perf with 578d7699e5 and c8be038067:
for do_read_fault() with 578d7699e5 :
- 48.58% 0.04% usemem [kernel.vmlinux] [k] do_read_fault
- 48.54% do_read_fault
- 44.34% filemap_map_pages
19.45% filemap_map_folio_range
3.22% next_uptodate_folio
+ 1.72% folio_wake_bit
+ 3.29% __do_fault
0.65% folio_wake_bit

for do_read_fault() with c8be038067:
- 72.98% 0.09% usemem [kernel.vmlinux] [k] do_read_fault
- 72.89% do_read_fault
- 52.70% filemap_map_pages
32.10% next_uptodate_folio <----- much higher than 578d7699e5
+ 12.35% folio_wake_bit
+ 1.53% set_pte_range
+ 12.43% __do_fault
+ 6.36% folio_wake_bit <----- higher than 578d7699e5
+ 0.97% finish_fault

I have theory that faster filemap_map_order0_folio() brings more contention in next_uptodate_folio().

4. I finally located what part of code brought higher contentions in next_uptodate_folio(). It's related with
following change:
diff --git a/mm/filemap.c b/mm/filemap.c
index 582f5317ff71..056a2d2e2428 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3481,7 +3481,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
struct vm_area_struct *vma = vmf->vma;
struct file *file = vma->vm_file;
struct page *page = folio_page(folio, start);
- unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
unsigned int count = 0;
pte_t *old_ptep = vmf->pte;

@@ -3489,9 +3488,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
if (PageHWPoison(page + count))
goto skip;

- if (mmap_miss > 0)
- mmap_miss--;
-
/*
* NOTE: If there're PTE markers, we'll leave them to be
* handled in the specific fault path, and it'll prohibit the
@@ -3525,7 +3521,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
}

vmf->pte = old_ptep;
- WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);

return ret;
}

If apply above change to 578d7699e5, the next_uptodate_folio() raised. perf command got:
- 68.83% 0.08% usemem [kernel.vmlinux] [k] do_read_fault
- 68.75% do_read_fault
- 49.34% filemap_map_pages
29.71% next_uptodate_folio
+ 11.82% folio_wake_bit
+ 2.34% filemap_map_folio_range
- 11.97% __do_fault
+ 11.93% shmem_fault
+ 6.12% folio_wake_bit
+ 0.92% finish_fault

And vm-scalability dropped to same level as latest mainline.

IMHO, we should slowdown filemap_map_order0_folio() because other benchmark can get benefit
from faster filemap_map_order0_folio(). Thanks.

Regards
Yin, Fengwei

2023-11-24 06:44:23

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [linus:master] [filemap] c8be038067: vm-scalability.throughput -27.3% regression

On 11/24/2023 12:05 AM, Yin, Fengwei wrote:
>
> On 11/20/2023 9:48 PM, kernel test robot wrote:
>>
>> hi, Fengwei,
>>
>> we noticed c8be038067 is the fix commit for
>> de74976eb65151a2f568e477fc2e0032df5b22b4 ("filemap: add filemap_map_folio_range()")
>>
>> and we captured numbers of improvement for this commit
>> (refer to below
>> "In addition to that, the commit also has significant impact on the following tests"
>> part which includes several examples)
>>
>> however, recently, we found a regression as title mentioned.
> I can reproduce the regression on an Ice Lake platform with 256G memory + 72C/144T.
>
>>
>> the extra information we want to share is we also tested on mainline tip when
>> this bisect done, and noticed the regression disappear:
> I can also reproduce this "regression disappear on latest mainline". And I found
> the related commit was:
>
> commit f5617ffeb450f84c57f7eba1a3524a29955d42b7
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Mon Jul 24 19:54:08 2023 +0100
>
> mm: run the fault-around code under the VMA lock
>
> With this commit, the map_pages() could be called twice. The first is with VMA lock hold.
> The second one is with mmap_lock (even no set_pte because of !pte_none()).
>
> With this commit reverted, the regression can be restored to some level.
>
>
> And The reason that the "regression disappear on latest mainline" is related with
>
> commit 12214eba1992642eee5813a9cc9f626e5b2d1815 (test6)
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Fri Oct 6 20:53:17 2023 +0100
>
> mm: handle read faults under the VMA lock
>
>
> This commit eliminates the second call of map_pages() and the regression can be
> restored to some level.
>
> We may still need to move following code block before fault around:
> ret = vmf_can_call_fault(vmf);
> if (ret)
> return ret;
>
>>
>> # extra tests on head commit of linus/master
>> # good: [9bacdd8996c77c42ca004440be610692275ff9d0] Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
>>
>> the data is even better than parent:
>>
>> "vm-scalability.throughput": [
>> 54122867,
>> 58465096,
>> 55888729,
>> 56960948,
>> 56385702,
>> 56392203
>> ],
>>
>> and we also reverted c8be038067 on maineline tip, but found no big diff:
>>
>> # extra tests on revert first bad commit
>> # good: [f13a82be4c3252dabd1328a437c309d84ec7a872] Revert "filemap: add filemap_map_order0_folio() to handle order0 folio"
>>
>> "vm-scalability.throughput": [
>> 56434337,
>> 56199754,
>> 56214041,
>> 55308070,
>> 55401115,
>> 55709753
>> ],
>>
>>
>> commit:
>> 578d7699e5 ("proc: nommu: /proc/<pid>/maps: release mmap read lock")
>> c8be038067 ("filemap: add filemap_map_order0_folio() to handle order0 folio")
>>
>> 578d7699e5c2add8 c8be03806738c86521dbf1e0503
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 146.95 ± 8% -83.0% 24.99 ± 3% vm-scalability.free_time
>> 233050 -28.1% 167484 vm-scalability.median
>> 590.30 ± 12% -590.2 0.06 ± 45% vm-scalability.stddev%
>> 51589606 -27.3% 37516397 vm-scalability.throughput
>
> I found very interesting behavior:
> 1. I am sure the filemap_map_order0_folio() is faster than filemap_map_folio_range()
> if the folio has order 0 (true for shmem for now).
>
> 2. If I use tool ebpf_trace to get how many times the filemap_map_order0_folio() is
> called during vm-scalability is running, the test result become better. In general,
> the test result should become worse.
>
> It looks slower filemap_map_order0_folio() can make better vm-scalability result.
> I did another testing by adding 2us delay in filemap_map_order0_folio(), the
> vm-scalability result get a little bit improved.
>
> 3. using perf with 578d7699e5 and c8be038067:
> for do_read_fault() with 578d7699e5 :
> - 48.58% 0.04% usemem [kernel.vmlinux] [k] do_read_fault
> - 48.54% do_read_fault
> - 44.34% filemap_map_pages
> 19.45% filemap_map_folio_range
> 3.22% next_uptodate_folio
> + 1.72% folio_wake_bit
> + 3.29% __do_fault
> 0.65% folio_wake_bit
>
> for do_read_fault() with c8be038067:
> - 72.98% 0.09% usemem [kernel.vmlinux] [k] do_read_fault
> - 72.89% do_read_fault
> - 52.70% filemap_map_pages
> 32.10% next_uptodate_folio <----- much higher than 578d7699e5
> + 12.35% folio_wake_bit
> + 1.53% set_pte_range
> + 12.43% __do_fault
> + 6.36% folio_wake_bit <----- higher than 578d7699e5
> + 0.97% finish_fault
>
> I have theory that faster filemap_map_order0_folio() brings more contention in next_uptodate_folio().
>
> 4. I finally located what part of code brought higher contentions in next_uptodate_folio(). It's related with
> following change:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 582f5317ff71..056a2d2e2428 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3481,7 +3481,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> struct vm_area_struct *vma = vmf->vma;
> struct file *file = vma->vm_file;
> struct page *page = folio_page(folio, start);
> - unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
> unsigned int count = 0;
> pte_t *old_ptep = vmf->pte;
>
> @@ -3489,9 +3488,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> if (PageHWPoison(page + count))
> goto skip;
>
> - if (mmap_miss > 0)
> - mmap_miss--;
> -
> /*
> * NOTE: If there're PTE markers, we'll leave them to be
> * handled in the specific fault path, and it'll prohibit the
> @@ -3525,7 +3521,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> }
>
> vmf->pte = old_ptep;
> - WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>
> return ret;
> }
>
> If apply above change to 578d7699e5, the next_uptodate_folio() raised. perf command got:
> - 68.83% 0.08% usemem [kernel.vmlinux] [k] do_read_fault
> - 68.75% do_read_fault
> - 49.34% filemap_map_pages
> 29.71% next_uptodate_folio
> + 11.82% folio_wake_bit
> + 2.34% filemap_map_folio_range
> - 11.97% __do_fault
> + 11.93% shmem_fault
> + 6.12% folio_wake_bit
> + 0.92% finish_fault
>
> And vm-scalability dropped to same level as latest mainline.
>
> IMHO, we should slowdown filemap_map_order0_folio() because other benchmark can get benefit
Sorry for the typo. I meant "we should not slowdown filemap_map_order0_folio()..".

> from faster filemap_map_order0_folio(). Thanks.
>
>
> Regards
> Yin, Fengwei
>