The global swap-in readahead policy takes in account the previous access
patterns, using a scaling heuristic to determine the optimal readahead
chunk dynamically.
This works pretty well in most cases, but like any heuristic there are
specific cases when this approach is not ideal, for example the swapoff
scenario.
During swapoff we just want to load back into memory all the swapped-out
pages and for this specific use case a fixed-size readahead is more
efficient.
The specific use case this patch is addressing is to improve swapoff
performance when a VM has been hibernated, resumed and all memory needs
to be forced back to RAM by disabling swap (see the test case below).
But it is not the only case where a fixed-size readahead can show its
benefits. More in general, the fixed-size approach can be beneficial in
all the cases when a task that is using a large part of swapped out
pages needs to load them back to memory as fast as possible.
Testing environment
===================
- Host:
CPU: 1.8GHz Intel Core i7-8565U (quad-core, 8MB cache)
HDD: PC401 NVMe SK hynix 512GB
MEM: 16GB
- Guest (kvm):
8GB of RAM
virtio block driver
16GB swap file on ext4 (/swapfile)
Test case
=========
- allocate 85% of memory
- `systemctl hibernate` to force all the pages to be swapped-out to the
swap file
- resume the system
- measure the time that swapoff takes to complete:
# /usr/bin/time swapoff /swapfile
Result
======
5.6 vanilla 5.6 w/ this patch
----------- -----------------
page-cluster=1 26.77s 21.25s
page-cluster=2 28.29s 12.66s
page-cluster=3 22.09s 8.77s
page-cluster=4 21.50s 7.60s
page-cluster=5 25.35s 7.75s
page-cluster=6 23.19s 8.32s
page-cluster=7 22.25s 9.40s
page-cluster=8 22.09s 8.93s
Signed-off-by: Andrea Righi <[email protected]>
---
Changes in v2:
- avoid introducing a new ABI to select the fixed-size readahead
NOTE: after running some tests with this new patch I don't see any
difference in terms of performance, so I'm reporting the same test
result of the previous version.
mm/swap_state.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ebed37bbf7a3..c71abc8df304 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -20,6 +20,7 @@
#include <linux/migrate.h>
#include <linux/vmalloc.h>
#include <linux/swap_slots.h>
+#include <linux/oom.h>
#include <linux/huge_mm.h>
#include <asm/pgtable.h>
@@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
max_pages = 1 << READ_ONCE(page_cluster);
if (max_pages <= 1)
return 1;
+ /*
+ * If current task is using too much memory or swapoff is running
+ * simply use the max readahead size. Since we likely want to load a
+ * lot of pages back into memory, using a fixed-size max readhaead can
+ * give better performance in this case.
+ */
+ if (oom_task_origin(current))
+ return max_pages;
hits = atomic_xchg(&swapin_readahead_hits, 0);
pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
--
2.25.1
Andrea Righi <[email protected]> writes:
[snip]
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ebed37bbf7a3..c71abc8df304 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -20,6 +20,7 @@
> #include <linux/migrate.h>
> #include <linux/vmalloc.h>
> #include <linux/swap_slots.h>
> +#include <linux/oom.h>
> #include <linux/huge_mm.h>
>
> #include <asm/pgtable.h>
> @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> max_pages = 1 << READ_ONCE(page_cluster);
> if (max_pages <= 1)
> return 1;
> + /*
> + * If current task is using too much memory or swapoff is running
> + * simply use the max readahead size. Since we likely want to load a
> + * lot of pages back into memory, using a fixed-size max readhaead can
> + * give better performance in this case.
> + */
> + if (oom_task_origin(current))
> + return max_pages;
>
> hits = atomic_xchg(&swapin_readahead_hits, 0);
> pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
Thinks this again. If my understanding were correct, the accessing
pattern during swapoff is sequential, why swap readahead doesn't work?
If so, can you root cause that firstly?
Best Regards,
Huang, Ying
On Mon, Apr 13, 2020 at 09:13:33PM +0800, Huang, Ying wrote:
> Andrea Righi <[email protected]> writes:
>
> > The global swap-in readahead policy takes in account the previous access
> > patterns, using a scaling heuristic to determine the optimal readahead
> > chunk dynamically.
> >
> > This works pretty well in most cases, but like any heuristic there are
> > specific cases when this approach is not ideal, for example the swapoff
> > scenario.
> >
> > During swapoff we just want to load back into memory all the swapped-out
> > pages and for this specific use case a fixed-size readahead is more
> > efficient.
> >
> > The specific use case this patch is addressing is to improve swapoff
> > performance when a VM has been hibernated, resumed and all memory needs
> > to be forced back to RAM by disabling swap (see the test case below).
>
> Why do you need to swapoff after resuming? The swap device isn't used
> except hibernation? I guess the process is,
>
> 1) add swap device to VM
> 2) hibernate
> 3) resume
> 4) swapoff
Correct, the swap device is used only for hibernation, when the system
is resumed the swap is disabled (swapoff).
>
> Some pages are swapped out in step 2? If os, can we just set
> /proc/sys/vm/swappiness to 0 to avoid swapping in step 2?
Sorry, can you elaborate more on this? All anonymous pages are swapped
out during step 2, it doesn't matter if we set swappiness to 0, they are
swapped out anyway, because we need save them somewhere in order to
hibernate, shutting down the system.
Thanks,
-Andrea
On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
> Andrea Righi <[email protected]> writes:
>
> [snip]
>
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ebed37bbf7a3..c71abc8df304 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -20,6 +20,7 @@
> > #include <linux/migrate.h>
> > #include <linux/vmalloc.h>
> > #include <linux/swap_slots.h>
> > +#include <linux/oom.h>
> > #include <linux/huge_mm.h>
> >
> > #include <asm/pgtable.h>
> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> > max_pages = 1 << READ_ONCE(page_cluster);
> > if (max_pages <= 1)
> > return 1;
> > + /*
> > + * If current task is using too much memory or swapoff is running
> > + * simply use the max readahead size. Since we likely want to load a
> > + * lot of pages back into memory, using a fixed-size max readhaead can
> > + * give better performance in this case.
> > + */
> > + if (oom_task_origin(current))
> > + return max_pages;
> >
> > hits = atomic_xchg(&swapin_readahead_hits, 0);
> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
>
> Thinks this again. If my understanding were correct, the accessing
> pattern during swapoff is sequential, why swap readahead doesn't work?
> If so, can you root cause that firstly?
Theoretically if the pattern is sequential the current heuristic should
already select a big readahead size, but apparently it's not doing that.
I'll repeat my tests tracing the readahead size during swapoff to see
exactly what's going on here.
Thanks,
-Andrea
Andrea Righi <[email protected]> writes:
> The global swap-in readahead policy takes in account the previous access
> patterns, using a scaling heuristic to determine the optimal readahead
> chunk dynamically.
>
> This works pretty well in most cases, but like any heuristic there are
> specific cases when this approach is not ideal, for example the swapoff
> scenario.
>
> During swapoff we just want to load back into memory all the swapped-out
> pages and for this specific use case a fixed-size readahead is more
> efficient.
>
> The specific use case this patch is addressing is to improve swapoff
> performance when a VM has been hibernated, resumed and all memory needs
> to be forced back to RAM by disabling swap (see the test case below).
Why do you need to swapoff after resuming? The swap device isn't used
except hibernation? I guess the process is,
1) add swap device to VM
2) hibernate
3) resume
4) swapoff
Some pages are swapped out in step 2? If os, can we just set
/proc/sys/vm/swappiness to 0 to avoid swapping in step 2?
Best Regards,
Huang, Ying
Andrea Righi <[email protected]> writes:
> On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
>> Andrea Righi <[email protected]> writes:
>>
>> [snip]
>>
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index ebed37bbf7a3..c71abc8df304 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -20,6 +20,7 @@
>> > #include <linux/migrate.h>
>> > #include <linux/vmalloc.h>
>> > #include <linux/swap_slots.h>
>> > +#include <linux/oom.h>
>> > #include <linux/huge_mm.h>
>> >
>> > #include <asm/pgtable.h>
>> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
>> > max_pages = 1 << READ_ONCE(page_cluster);
>> > if (max_pages <= 1)
>> > return 1;
>> > + /*
>> > + * If current task is using too much memory or swapoff is running
>> > + * simply use the max readahead size. Since we likely want to load a
>> > + * lot of pages back into memory, using a fixed-size max readhaead can
>> > + * give better performance in this case.
>> > + */
>> > + if (oom_task_origin(current))
>> > + return max_pages;
>> >
>> > hits = atomic_xchg(&swapin_readahead_hits, 0);
>> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
>>
>> Thinks this again. If my understanding were correct, the accessing
>> pattern during swapoff is sequential, why swap readahead doesn't work?
>> If so, can you root cause that firstly?
>
> Theoretically if the pattern is sequential the current heuristic should
> already select a big readahead size, but apparently it's not doing that.
>
> I'll repeat my tests tracing the readahead size during swapoff to see
> exactly what's going on here.
I haven't verify it. It may be helpful to call lookup_swap_cache()
before swapin_readahead() in unuse_pte_range(). The theory behind it is
to update the swap readahead statistics via lookup_swap_cache().
Best Regards,
Huang, Ying
Andrea Righi <[email protected]> writes:
> On Mon, Apr 13, 2020 at 09:13:33PM +0800, Huang, Ying wrote:
>> Andrea Righi <[email protected]> writes:
>>
>> > The global swap-in readahead policy takes in account the previous access
>> > patterns, using a scaling heuristic to determine the optimal readahead
>> > chunk dynamically.
>> >
>> > This works pretty well in most cases, but like any heuristic there are
>> > specific cases when this approach is not ideal, for example the swapoff
>> > scenario.
>> >
>> > During swapoff we just want to load back into memory all the swapped-out
>> > pages and for this specific use case a fixed-size readahead is more
>> > efficient.
>> >
>> > The specific use case this patch is addressing is to improve swapoff
>> > performance when a VM has been hibernated, resumed and all memory needs
>> > to be forced back to RAM by disabling swap (see the test case below).
>>
>> Why do you need to swapoff after resuming? The swap device isn't used
>> except hibernation? I guess the process is,
>>
>> 1) add swap device to VM
>> 2) hibernate
>> 3) resume
>> 4) swapoff
>
> Correct, the swap device is used only for hibernation, when the system
> is resumed the swap is disabled (swapoff).
>
>>
>> Some pages are swapped out in step 2? If os, can we just set
>> /proc/sys/vm/swappiness to 0 to avoid swapping in step 2?
>
> Sorry, can you elaborate more on this? All anonymous pages are swapped
> out during step 2, it doesn't matter if we set swappiness to 0, they are
> swapped out anyway, because we need save them somewhere in order to
> hibernate, shutting down the system.
All pages will be written to the swap device in step 2. But the normal
swapout path isn't used. So these pages will be restored in step 3
instead of step 4. But at the beginning of step 2, system may try to
reclaim some pages, the reclaimed anonymous pages will be restored in
step 4. This may be avoided via setting /proc/sys/vm/swappiness to 0
before step 2.
Best Regards,
Huang, Ying
> Thanks,
> -Andrea
On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
> Andrea Righi <[email protected]> writes:
>
> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
> >> Andrea Righi <[email protected]> writes:
> >>
> >> [snip]
> >>
> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> > index ebed37bbf7a3..c71abc8df304 100644
> >> > --- a/mm/swap_state.c
> >> > +++ b/mm/swap_state.c
> >> > @@ -20,6 +20,7 @@
> >> > #include <linux/migrate.h>
> >> > #include <linux/vmalloc.h>
> >> > #include <linux/swap_slots.h>
> >> > +#include <linux/oom.h>
> >> > #include <linux/huge_mm.h>
> >> >
> >> > #include <asm/pgtable.h>
> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> >> > max_pages = 1 << READ_ONCE(page_cluster);
> >> > if (max_pages <= 1)
> >> > return 1;
> >> > + /*
> >> > + * If current task is using too much memory or swapoff is running
> >> > + * simply use the max readahead size. Since we likely want to load a
> >> > + * lot of pages back into memory, using a fixed-size max readhaead can
> >> > + * give better performance in this case.
> >> > + */
> >> > + if (oom_task_origin(current))
> >> > + return max_pages;
> >> >
> >> > hits = atomic_xchg(&swapin_readahead_hits, 0);
> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> >>
> >> Thinks this again. If my understanding were correct, the accessing
> >> pattern during swapoff is sequential, why swap readahead doesn't work?
> >> If so, can you root cause that firstly?
> >
> > Theoretically if the pattern is sequential the current heuristic should
> > already select a big readahead size, but apparently it's not doing that.
> >
> > I'll repeat my tests tracing the readahead size during swapoff to see
> > exactly what's going on here.
>
> I haven't verify it. It may be helpful to call lookup_swap_cache()
> before swapin_readahead() in unuse_pte_range(). The theory behind it is
> to update the swap readahead statistics via lookup_swap_cache().
I did more tests trying to collect some useful information.
In particular I've been focusing at tracing the distribution of the
values returned by swapin_nr_pages() in different scenarios.
To do so I made swapin_nr_pages() trace-able and I used the following
bcc command to measure the distrubution of the returned values:
# argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
I've collected this metric in the following scenarios:
- 5.6 vanilla
- 5.6 + lookup_swap_cache() before swapin_readahead() in
unuse_pte_range()
- 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
in unuse_pte_range()
- 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
before swapin_readahead() in unuse_pte_range()
Each kernel has been tested both with swappiness=0 and swappiness=60.
Results are pretty much identical changing the swappiness, so I'm just
reporting the default case here (w/ swappiness=60).
Result
======
= swapoff performance (elapsed time) =
vanilla 22.09s
lookup_swap_cache() 23.87s
hits++ 16.10s
hits=last_ra_pages 8.81s
= swapin_nr_pages() $retval distribution =
5.6 vanilla:
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
36948 $retval = 8
44151 $retval = 4
49290 $retval = 1
527771 $retval = 2
5.6 lookup_swap_cache() before swapin_readahead():
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
13093 $retval = 1
56703 $retval = 8
123067 $retval = 2
366118 $retval = 4
5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
2589 $retval = 1
8016 $retval = 2
40021 $retval = 8
566038 $retval = 4
5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
785 $retval = 2
1072 $retval = 1
21844 $retval = 4
644168 $retval = 8
In the vanilla case, the readahead heuristic seems to choose 2 pages
most of the time. This is because we are not properly considering the
hits (hits are always 0 in the swapoff code path) and, as you correctly
pointed out, we can fix this by calling lookup_swap_cache() in
unuse_pte_range() before calling swapin_readahead().
With this change the distribution of the readahead size moves more
toward 4 pages, but we still have some 2s. That looks good, however it
doesn't seem to speed up swapoff very much... maybe because calling
lookup_swap_cache() introduces a small overhead? (still need to
investigate about this theory).
In the next test I've tried to always increment hits by 1 before calling
swapin_readahead() in unuse_pte_range(). This is basically cheating,
because I'm faking the hit ratio, forcing the heuristic to use a larger
readahead size; in fact, the readahead size moves even more toward 4
pages and swapoff performance are a little better now.
Pushing even more the "cheating" I can pretend that the previous
readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
I'm forcing the heuristic to move toward the max size and keep using it.
The result here is pretty much identical to my fixed-size patch, because
swapin_nr_pages() returns the max readahead size pretty much all the
time during swapoff (8 pages or, more in general, vm.page-cluster).
Personally I don't like very much forcing the heuristic in this way,
it'd be nice if it would just work by accounting the proper hit ratio
(so just by adding lookup_swap_cache() as you correctly suggested), but
this solution doesn't seem to improve performance in reality. For this
reason I still think we should consider the swapoff scenario like a
special one and somehow bypass the readahead heuristic and always return
the max readahead size.
Looking at the hits of the previous step in the swapoff case just
doesn't work, because we may have some misses, but they will become hits
very soon, since we are reading all the swapped out pages back into
memory. This is why using the max readahead size gives better
swapoff performance.
What do you think?
Thanks,
-Andrea
Andrea Righi <[email protected]> writes:
> mm/swapfile.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9fd47e6f7a86..cb9eb517178d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> vmf.pmd = pmd;
> last_ra = atomic_read(&last_readahead_pages);
> atomic_set(&swapin_readahead_hits, last_ra);
You need to remove the above 2 lines firstly.
Best Regards,
Huang, Ying
> - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
> + page = lookup_swap_cache(entry, vma, addr);
> + if (!page)
> + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
> if (!page) {
> if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD)
> goto try_next;
>
> Thanks!
> -Andrea
On Wed, Apr 15, 2020 at 03:44:08PM +0800, Huang, Ying wrote:
> Andrea Righi <[email protected]> writes:
>
> > mm/swapfile.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9fd47e6f7a86..cb9eb517178d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > vmf.pmd = pmd;
> > last_ra = atomic_read(&last_readahead_pages);
> > atomic_set(&swapin_readahead_hits, last_ra);
>
> You need to remove the above 2 lines firstly.
Meh... too much enthusiasm, and I definitely need more coffee this
morning. Here's the right patch applied:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5871a2aa86a5..8b38441b66fa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1940,7 +1940,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
vmf.vma = vma;
vmf.address = addr;
vmf.pmd = pmd;
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
+ page = lookup_swap_cache(entry, vma, addr);
+ if (!page)
+ page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
if (!page) {
if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD)
goto try_next;
And following the right results:
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
1618 $retval = 1
4960 $retval = 2
41315 $retval = 4
103521 $retval = 8
swapoff time: 12.19s
So, not as good as the fixed-size readahead, but it's definitely an
improvement, considering that the swapoff time is ~22s without this
patch applied.
I think this change can be a simple and reasonable compromise.
Thanks again and sorry for the noise,
-Andrea
On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote:
> On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote:
> > Andrea Righi <[email protected]> writes:
> >
> > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
> > >> Andrea Righi <[email protected]> writes:
> > >>
> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
> > >> >> Andrea Righi <[email protected]> writes:
> > >> >>
> > >> >> [snip]
> > >> >>
> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > >> >> > index ebed37bbf7a3..c71abc8df304 100644
> > >> >> > --- a/mm/swap_state.c
> > >> >> > +++ b/mm/swap_state.c
> > >> >> > @@ -20,6 +20,7 @@
> > >> >> > #include <linux/migrate.h>
> > >> >> > #include <linux/vmalloc.h>
> > >> >> > #include <linux/swap_slots.h>
> > >> >> > +#include <linux/oom.h>
> > >> >> > #include <linux/huge_mm.h>
> > >> >> >
> > >> >> > #include <asm/pgtable.h>
> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> > >> >> > max_pages = 1 << READ_ONCE(page_cluster);
> > >> >> > if (max_pages <= 1)
> > >> >> > return 1;
> > >> >> > + /*
> > >> >> > + * If current task is using too much memory or swapoff is running
> > >> >> > + * simply use the max readahead size. Since we likely want to load a
> > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can
> > >> >> > + * give better performance in this case.
> > >> >> > + */
> > >> >> > + if (oom_task_origin(current))
> > >> >> > + return max_pages;
> > >> >> >
> > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0);
> > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> > >> >>
> > >> >> Thinks this again. If my understanding were correct, the accessing
> > >> >> pattern during swapoff is sequential, why swap readahead doesn't work?
> > >> >> If so, can you root cause that firstly?
> > >> >
> > >> > Theoretically if the pattern is sequential the current heuristic should
> > >> > already select a big readahead size, but apparently it's not doing that.
> > >> >
> > >> > I'll repeat my tests tracing the readahead size during swapoff to see
> > >> > exactly what's going on here.
> > >>
> > >> I haven't verify it. It may be helpful to call lookup_swap_cache()
> > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is
> > >> to update the swap readahead statistics via lookup_swap_cache().
> > >
> > > I did more tests trying to collect some useful information.
> > >
> > > In particular I've been focusing at tracing the distribution of the
> > > values returned by swapin_nr_pages() in different scenarios.
> > >
> > > To do so I made swapin_nr_pages() trace-able and I used the following
> > > bcc command to measure the distrubution of the returned values:
> > >
> > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
> > >
> > > I've collected this metric in the following scenarios:
> > > - 5.6 vanilla
> > > - 5.6 + lookup_swap_cache() before swapin_readahead() in
> > > unuse_pte_range()
> > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
> > > in unuse_pte_range()
> > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
> > > before swapin_readahead() in unuse_pte_range()
> > >
> > > Each kernel has been tested both with swappiness=0 and swappiness=60.
> > > Results are pretty much identical changing the swappiness, so I'm just
> > > reporting the default case here (w/ swappiness=60).
> > >
> > > Result
> > > ======
> > >
> > > = swapoff performance (elapsed time) =
> > >
> > > vanilla 22.09s
> > > lookup_swap_cache() 23.87s
> > > hits++ 16.10s
> > > hits=last_ra_pages 8.81s
> > >
> > > = swapin_nr_pages() $retval distribution =
> > >
> > > 5.6 vanilla:
> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > > COUNT EVENT
> > > 36948 $retval = 8
> > > 44151 $retval = 4
> > > 49290 $retval = 1
> > > 527771 $retval = 2
> > >
> > > 5.6 lookup_swap_cache() before swapin_readahead():
> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > > COUNT EVENT
> > > 13093 $retval = 1
> > > 56703 $retval = 8
> > > 123067 $retval = 2
> > > 366118 $retval = 4
> > >
> > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > > COUNT EVENT
> > > 2589 $retval = 1
> > > 8016 $retval = 2
> > > 40021 $retval = 8
> > > 566038 $retval = 4
> > >
> > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > > COUNT EVENT
> > > 785 $retval = 2
> > > 1072 $retval = 1
> > > 21844 $retval = 4
> > > 644168 $retval = 8
> > >
> > > In the vanilla case, the readahead heuristic seems to choose 2 pages
> > > most of the time. This is because we are not properly considering the
> > > hits (hits are always 0 in the swapoff code path) and, as you correctly
> > > pointed out, we can fix this by calling lookup_swap_cache() in
> > > unuse_pte_range() before calling swapin_readahead().
> > >
> > > With this change the distribution of the readahead size moves more
> > > toward 4 pages, but we still have some 2s. That looks good, however it
> > > doesn't seem to speed up swapoff very much... maybe because calling
> > > lookup_swap_cache() introduces a small overhead? (still need to
> > > investigate about this theory).
> > >
> > > In the next test I've tried to always increment hits by 1 before calling
> > > swapin_readahead() in unuse_pte_range(). This is basically cheating,
> > > because I'm faking the hit ratio, forcing the heuristic to use a larger
> > > readahead size; in fact, the readahead size moves even more toward 4
> > > pages and swapoff performance are a little better now.
> > >
> > > Pushing even more the "cheating" I can pretend that the previous
> > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
> > > I'm forcing the heuristic to move toward the max size and keep using it.
> > > The result here is pretty much identical to my fixed-size patch, because
> > > swapin_nr_pages() returns the max readahead size pretty much all the
> > > time during swapoff (8 pages or, more in general, vm.page-cluster).
> > >
> > > Personally I don't like very much forcing the heuristic in this way,
> > > it'd be nice if it would just work by accounting the proper hit ratio
> > > (so just by adding lookup_swap_cache() as you correctly suggested), but
> > > this solution doesn't seem to improve performance in reality. For this
> > > reason I still think we should consider the swapoff scenario like a
> > > special one and somehow bypass the readahead heuristic and always return
> > > the max readahead size.
> > >
> > > Looking at the hits of the previous step in the swapoff case just
> > > doesn't work, because we may have some misses, but they will become hits
> > > very soon, since we are reading all the swapped out pages back into
> > > memory. This is why using the max readahead size gives better
> > > swapoff performance.
> > >
> > > What do you think?
> >
> > >From your description, it appears that you are using cluster readahead
> > instead of vma readahead. Can you verify this via,
> >
> > # cat /sys/kernel/mm/swap/vma_ra_enabled
>
> # cat /sys/kernel/mm/swap/vma_ra_enabled
> true
>
> However, it's still using the cluster readahead because I'm using a
> swap file and nr_rotate_swap=1, so, according to the following, it's
> never using the vma readahead:
>
> static inline bool swap_use_vma_readahead(void)
> {
> return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> }
>
> I'll investigate more on this, I think there's no reason to prevent the
> usage of vma readahead if the underlying device is non-rotational.
Few more details about this.
Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it
looks like my virtio_blk device is considered rotational by default:
$ cat /sys/block/vda/queue/rotational
1
Therefore the vma readahead is not used. If I change this to
"rotational" then the vma readahead is used; this is confirmed by the
fact that swapin_nr_pages isn't called anymore:
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
13 $retval = 1
18 $retval = 2
23 $retval = 4
29 $retval = 8
swapoff time: 16.44s
In terms of swapoff performance vma readahead works better than the
vanilla cluster readahead (~21s), but the "improved" cluster readahead
(with lookup_swap_cache() in unuse_pte_range()) still gives slightly
better results (~12s).
-Andrea
Andrea Righi <[email protected]> writes:
> On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
>> Andrea Righi <[email protected]> writes:
>>
>> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
>> >> Andrea Righi <[email protected]> writes:
>> >>
>> >> [snip]
>> >>
>> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> >> > index ebed37bbf7a3..c71abc8df304 100644
>> >> > --- a/mm/swap_state.c
>> >> > +++ b/mm/swap_state.c
>> >> > @@ -20,6 +20,7 @@
>> >> > #include <linux/migrate.h>
>> >> > #include <linux/vmalloc.h>
>> >> > #include <linux/swap_slots.h>
>> >> > +#include <linux/oom.h>
>> >> > #include <linux/huge_mm.h>
>> >> >
>> >> > #include <asm/pgtable.h>
>> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
>> >> > max_pages = 1 << READ_ONCE(page_cluster);
>> >> > if (max_pages <= 1)
>> >> > return 1;
>> >> > + /*
>> >> > + * If current task is using too much memory or swapoff is running
>> >> > + * simply use the max readahead size. Since we likely want to load a
>> >> > + * lot of pages back into memory, using a fixed-size max readhaead can
>> >> > + * give better performance in this case.
>> >> > + */
>> >> > + if (oom_task_origin(current))
>> >> > + return max_pages;
>> >> >
>> >> > hits = atomic_xchg(&swapin_readahead_hits, 0);
>> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
>> >>
>> >> Thinks this again. If my understanding were correct, the accessing
>> >> pattern during swapoff is sequential, why swap readahead doesn't work?
>> >> If so, can you root cause that firstly?
>> >
>> > Theoretically if the pattern is sequential the current heuristic should
>> > already select a big readahead size, but apparently it's not doing that.
>> >
>> > I'll repeat my tests tracing the readahead size during swapoff to see
>> > exactly what's going on here.
>>
>> I haven't verify it. It may be helpful to call lookup_swap_cache()
>> before swapin_readahead() in unuse_pte_range(). The theory behind it is
>> to update the swap readahead statistics via lookup_swap_cache().
>
> I did more tests trying to collect some useful information.
>
> In particular I've been focusing at tracing the distribution of the
> values returned by swapin_nr_pages() in different scenarios.
>
> To do so I made swapin_nr_pages() trace-able and I used the following
> bcc command to measure the distrubution of the returned values:
>
> # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
>
> I've collected this metric in the following scenarios:
> - 5.6 vanilla
> - 5.6 + lookup_swap_cache() before swapin_readahead() in
> unuse_pte_range()
> - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
> in unuse_pte_range()
> - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
> before swapin_readahead() in unuse_pte_range()
>
> Each kernel has been tested both with swappiness=0 and swappiness=60.
> Results are pretty much identical changing the swappiness, so I'm just
> reporting the default case here (w/ swappiness=60).
>
> Result
> ======
>
> = swapoff performance (elapsed time) =
>
> vanilla 22.09s
> lookup_swap_cache() 23.87s
> hits++ 16.10s
> hits=last_ra_pages 8.81s
>
> = swapin_nr_pages() $retval distribution =
>
> 5.6 vanilla:
> r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> COUNT EVENT
> 36948 $retval = 8
> 44151 $retval = 4
> 49290 $retval = 1
> 527771 $retval = 2
>
> 5.6 lookup_swap_cache() before swapin_readahead():
> r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> COUNT EVENT
> 13093 $retval = 1
> 56703 $retval = 8
> 123067 $retval = 2
> 366118 $retval = 4
>
> 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
> r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> COUNT EVENT
> 2589 $retval = 1
> 8016 $retval = 2
> 40021 $retval = 8
> 566038 $retval = 4
>
> 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
> r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> COUNT EVENT
> 785 $retval = 2
> 1072 $retval = 1
> 21844 $retval = 4
> 644168 $retval = 8
>
> In the vanilla case, the readahead heuristic seems to choose 2 pages
> most of the time. This is because we are not properly considering the
> hits (hits are always 0 in the swapoff code path) and, as you correctly
> pointed out, we can fix this by calling lookup_swap_cache() in
> unuse_pte_range() before calling swapin_readahead().
>
> With this change the distribution of the readahead size moves more
> toward 4 pages, but we still have some 2s. That looks good, however it
> doesn't seem to speed up swapoff very much... maybe because calling
> lookup_swap_cache() introduces a small overhead? (still need to
> investigate about this theory).
>
> In the next test I've tried to always increment hits by 1 before calling
> swapin_readahead() in unuse_pte_range(). This is basically cheating,
> because I'm faking the hit ratio, forcing the heuristic to use a larger
> readahead size; in fact, the readahead size moves even more toward 4
> pages and swapoff performance are a little better now.
>
> Pushing even more the "cheating" I can pretend that the previous
> readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
> I'm forcing the heuristic to move toward the max size and keep using it.
> The result here is pretty much identical to my fixed-size patch, because
> swapin_nr_pages() returns the max readahead size pretty much all the
> time during swapoff (8 pages or, more in general, vm.page-cluster).
>
> Personally I don't like very much forcing the heuristic in this way,
> it'd be nice if it would just work by accounting the proper hit ratio
> (so just by adding lookup_swap_cache() as you correctly suggested), but
> this solution doesn't seem to improve performance in reality. For this
> reason I still think we should consider the swapoff scenario like a
> special one and somehow bypass the readahead heuristic and always return
> the max readahead size.
>
> Looking at the hits of the previous step in the swapoff case just
> doesn't work, because we may have some misses, but they will become hits
> very soon, since we are reading all the swapped out pages back into
> memory. This is why using the max readahead size gives better
> swapoff performance.
>
> What do you think?
From your description, it appears that you are using cluster readahead
instead of vma readahead. Can you verify this via,
# cat /sys/kernel/mm/swap/vma_ra_enabled
And if it returns false, you can enable it via,
# echo 1 > /sys/kernel/mm/swap/vma_ra_enabled
Because now swapoff code swapin pages in the page table order instead of
the swap entry order. But this will turn the sequential disk read to
random disk read too. Let's see the performance results.
And please make sure that in unuse_pte_range(), after
lookup_swap_cache() returns non-NULL page, it's unnecessary to call
swapin_readahead().
Best Regards,
Huang, Ying
> Thanks,
> -Andrea
On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote:
> Andrea Righi <[email protected]> writes:
>
> > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
> >> Andrea Righi <[email protected]> writes:
> >>
> >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
> >> >> Andrea Righi <[email protected]> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> >> > index ebed37bbf7a3..c71abc8df304 100644
> >> >> > --- a/mm/swap_state.c
> >> >> > +++ b/mm/swap_state.c
> >> >> > @@ -20,6 +20,7 @@
> >> >> > #include <linux/migrate.h>
> >> >> > #include <linux/vmalloc.h>
> >> >> > #include <linux/swap_slots.h>
> >> >> > +#include <linux/oom.h>
> >> >> > #include <linux/huge_mm.h>
> >> >> >
> >> >> > #include <asm/pgtable.h>
> >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> >> >> > max_pages = 1 << READ_ONCE(page_cluster);
> >> >> > if (max_pages <= 1)
> >> >> > return 1;
> >> >> > + /*
> >> >> > + * If current task is using too much memory or swapoff is running
> >> >> > + * simply use the max readahead size. Since we likely want to load a
> >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can
> >> >> > + * give better performance in this case.
> >> >> > + */
> >> >> > + if (oom_task_origin(current))
> >> >> > + return max_pages;
> >> >> >
> >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0);
> >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> >> >>
> >> >> Thinks this again. If my understanding were correct, the accessing
> >> >> pattern during swapoff is sequential, why swap readahead doesn't work?
> >> >> If so, can you root cause that firstly?
> >> >
> >> > Theoretically if the pattern is sequential the current heuristic should
> >> > already select a big readahead size, but apparently it's not doing that.
> >> >
> >> > I'll repeat my tests tracing the readahead size during swapoff to see
> >> > exactly what's going on here.
> >>
> >> I haven't verify it. It may be helpful to call lookup_swap_cache()
> >> before swapin_readahead() in unuse_pte_range(). The theory behind it is
> >> to update the swap readahead statistics via lookup_swap_cache().
> >
> > I did more tests trying to collect some useful information.
> >
> > In particular I've been focusing at tracing the distribution of the
> > values returned by swapin_nr_pages() in different scenarios.
> >
> > To do so I made swapin_nr_pages() trace-able and I used the following
> > bcc command to measure the distrubution of the returned values:
> >
> > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
> >
> > I've collected this metric in the following scenarios:
> > - 5.6 vanilla
> > - 5.6 + lookup_swap_cache() before swapin_readahead() in
> > unuse_pte_range()
> > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
> > in unuse_pte_range()
> > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
> > before swapin_readahead() in unuse_pte_range()
> >
> > Each kernel has been tested both with swappiness=0 and swappiness=60.
> > Results are pretty much identical changing the swappiness, so I'm just
> > reporting the default case here (w/ swappiness=60).
> >
> > Result
> > ======
> >
> > = swapoff performance (elapsed time) =
> >
> > vanilla 22.09s
> > lookup_swap_cache() 23.87s
> > hits++ 16.10s
> > hits=last_ra_pages 8.81s
> >
> > = swapin_nr_pages() $retval distribution =
> >
> > 5.6 vanilla:
> > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > COUNT EVENT
> > 36948 $retval = 8
> > 44151 $retval = 4
> > 49290 $retval = 1
> > 527771 $retval = 2
> >
> > 5.6 lookup_swap_cache() before swapin_readahead():
> > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > COUNT EVENT
> > 13093 $retval = 1
> > 56703 $retval = 8
> > 123067 $retval = 2
> > 366118 $retval = 4
> >
> > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
> > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > COUNT EVENT
> > 2589 $retval = 1
> > 8016 $retval = 2
> > 40021 $retval = 8
> > 566038 $retval = 4
> >
> > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
> > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > COUNT EVENT
> > 785 $retval = 2
> > 1072 $retval = 1
> > 21844 $retval = 4
> > 644168 $retval = 8
> >
> > In the vanilla case, the readahead heuristic seems to choose 2 pages
> > most of the time. This is because we are not properly considering the
> > hits (hits are always 0 in the swapoff code path) and, as you correctly
> > pointed out, we can fix this by calling lookup_swap_cache() in
> > unuse_pte_range() before calling swapin_readahead().
> >
> > With this change the distribution of the readahead size moves more
> > toward 4 pages, but we still have some 2s. That looks good, however it
> > doesn't seem to speed up swapoff very much... maybe because calling
> > lookup_swap_cache() introduces a small overhead? (still need to
> > investigate about this theory).
> >
> > In the next test I've tried to always increment hits by 1 before calling
> > swapin_readahead() in unuse_pte_range(). This is basically cheating,
> > because I'm faking the hit ratio, forcing the heuristic to use a larger
> > readahead size; in fact, the readahead size moves even more toward 4
> > pages and swapoff performance are a little better now.
> >
> > Pushing even more the "cheating" I can pretend that the previous
> > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
> > I'm forcing the heuristic to move toward the max size and keep using it.
> > The result here is pretty much identical to my fixed-size patch, because
> > swapin_nr_pages() returns the max readahead size pretty much all the
> > time during swapoff (8 pages or, more in general, vm.page-cluster).
> >
> > Personally I don't like very much forcing the heuristic in this way,
> > it'd be nice if it would just work by accounting the proper hit ratio
> > (so just by adding lookup_swap_cache() as you correctly suggested), but
> > this solution doesn't seem to improve performance in reality. For this
> > reason I still think we should consider the swapoff scenario like a
> > special one and somehow bypass the readahead heuristic and always return
> > the max readahead size.
> >
> > Looking at the hits of the previous step in the swapoff case just
> > doesn't work, because we may have some misses, but they will become hits
> > very soon, since we are reading all the swapped out pages back into
> > memory. This is why using the max readahead size gives better
> > swapoff performance.
> >
> > What do you think?
>
> >From your description, it appears that you are using cluster readahead
> instead of vma readahead. Can you verify this via,
>
> # cat /sys/kernel/mm/swap/vma_ra_enabled
# cat /sys/kernel/mm/swap/vma_ra_enabled
true
However, it's still using the cluster readahead because I'm using a
swap file and nr_rotate_swap=1, so, according to the following, it's
never using the vma readahead:
static inline bool swap_use_vma_readahead(void)
{
return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
}
I'll investigate more on this, I think there's no reason to prevent the
usage of vma readahead if the underlying device is non-rotational.
>
> And if it returns false, you can enable it via,
>
> # echo 1 > /sys/kernel/mm/swap/vma_ra_enabled
>
> Because now swapoff code swapin pages in the page table order instead of
> the swap entry order. But this will turn the sequential disk read to
> random disk read too. Let's see the performance results.
>
> And please make sure that in unuse_pte_range(), after
> lookup_swap_cache() returns non-NULL page, it's unnecessary to call
> swapin_readahead().
This is what I was missing! Sorry about that, I was still calling
swapin_readahead() even if the page was already in case, that was
unnecessary overhead and it was messing up the hit ratio!
With lookup_swap_cache() applied properly eveything looks way better:
r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
COUNT EVENT
378 $retval = 1
2138 $retval = 2
25365 $retval = 4
105314 $retval = 8
swapof time: 9.87s
So, basically it gives the same performance result of my fixed-size
approach, but it's definitely a better and cleaner solution.
Just to make it clear, here's the patch that I applied (if it looks good
to you I can send another version with a better description):
mm/swapfile.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9fd47e6f7a86..cb9eb517178d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
vmf.pmd = pmd;
last_ra = atomic_read(&last_readahead_pages);
atomic_set(&swapin_readahead_hits, last_ra);
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
+ page = lookup_swap_cache(entry, vma, addr);
+ if (!page)
+ page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
if (!page) {
if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD)
goto try_next;
Thanks!
-Andrea
Andrea Righi <[email protected]> writes:
> On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote:
>> On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote:
>> > Andrea Righi <[email protected]> writes:
>> >
>> > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
>> > >> Andrea Righi <[email protected]> writes:
>> > >>
>> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
>> > >> >> Andrea Righi <[email protected]> writes:
>> > >> >>
>> > >> >> [snip]
>> > >> >>
>> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > >> >> > index ebed37bbf7a3..c71abc8df304 100644
>> > >> >> > --- a/mm/swap_state.c
>> > >> >> > +++ b/mm/swap_state.c
>> > >> >> > @@ -20,6 +20,7 @@
>> > >> >> > #include <linux/migrate.h>
>> > >> >> > #include <linux/vmalloc.h>
>> > >> >> > #include <linux/swap_slots.h>
>> > >> >> > +#include <linux/oom.h>
>> > >> >> > #include <linux/huge_mm.h>
>> > >> >> >
>> > >> >> > #include <asm/pgtable.h>
>> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
>> > >> >> > max_pages = 1 << READ_ONCE(page_cluster);
>> > >> >> > if (max_pages <= 1)
>> > >> >> > return 1;
>> > >> >> > + /*
>> > >> >> > + * If current task is using too much memory or swapoff is running
>> > >> >> > + * simply use the max readahead size. Since we likely want to load a
>> > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can
>> > >> >> > + * give better performance in this case.
>> > >> >> > + */
>> > >> >> > + if (oom_task_origin(current))
>> > >> >> > + return max_pages;
>> > >> >> >
>> > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0);
>> > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
>> > >> >>
>> > >> >> Thinks this again. If my understanding were correct, the accessing
>> > >> >> pattern during swapoff is sequential, why swap readahead doesn't work?
>> > >> >> If so, can you root cause that firstly?
>> > >> >
>> > >> > Theoretically if the pattern is sequential the current heuristic should
>> > >> > already select a big readahead size, but apparently it's not doing that.
>> > >> >
>> > >> > I'll repeat my tests tracing the readahead size during swapoff to see
>> > >> > exactly what's going on here.
>> > >>
>> > >> I haven't verify it. It may be helpful to call lookup_swap_cache()
>> > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is
>> > >> to update the swap readahead statistics via lookup_swap_cache().
>> > >
>> > > I did more tests trying to collect some useful information.
>> > >
>> > > In particular I've been focusing at tracing the distribution of the
>> > > values returned by swapin_nr_pages() in different scenarios.
>> > >
>> > > To do so I made swapin_nr_pages() trace-able and I used the following
>> > > bcc command to measure the distrubution of the returned values:
>> > >
>> > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
>> > >
>> > > I've collected this metric in the following scenarios:
>> > > - 5.6 vanilla
>> > > - 5.6 + lookup_swap_cache() before swapin_readahead() in
>> > > unuse_pte_range()
>> > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
>> > > in unuse_pte_range()
>> > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
>> > > before swapin_readahead() in unuse_pte_range()
>> > >
>> > > Each kernel has been tested both with swappiness=0 and swappiness=60.
>> > > Results are pretty much identical changing the swappiness, so I'm just
>> > > reporting the default case here (w/ swappiness=60).
>> > >
>> > > Result
>> > > ======
>> > >
>> > > = swapoff performance (elapsed time) =
>> > >
>> > > vanilla 22.09s
>> > > lookup_swap_cache() 23.87s
>> > > hits++ 16.10s
>> > > hits=last_ra_pages 8.81s
>> > >
>> > > = swapin_nr_pages() $retval distribution =
>> > >
>> > > 5.6 vanilla:
>> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
>> > > COUNT EVENT
>> > > 36948 $retval = 8
>> > > 44151 $retval = 4
>> > > 49290 $retval = 1
>> > > 527771 $retval = 2
>> > >
>> > > 5.6 lookup_swap_cache() before swapin_readahead():
>> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
>> > > COUNT EVENT
>> > > 13093 $retval = 1
>> > > 56703 $retval = 8
>> > > 123067 $retval = 2
>> > > 366118 $retval = 4
>> > >
>> > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
>> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
>> > > COUNT EVENT
>> > > 2589 $retval = 1
>> > > 8016 $retval = 2
>> > > 40021 $retval = 8
>> > > 566038 $retval = 4
>> > >
>> > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
>> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
>> > > COUNT EVENT
>> > > 785 $retval = 2
>> > > 1072 $retval = 1
>> > > 21844 $retval = 4
>> > > 644168 $retval = 8
>> > >
>> > > In the vanilla case, the readahead heuristic seems to choose 2 pages
>> > > most of the time. This is because we are not properly considering the
>> > > hits (hits are always 0 in the swapoff code path) and, as you correctly
>> > > pointed out, we can fix this by calling lookup_swap_cache() in
>> > > unuse_pte_range() before calling swapin_readahead().
>> > >
>> > > With this change the distribution of the readahead size moves more
>> > > toward 4 pages, but we still have some 2s. That looks good, however it
>> > > doesn't seem to speed up swapoff very much... maybe because calling
>> > > lookup_swap_cache() introduces a small overhead? (still need to
>> > > investigate about this theory).
>> > >
>> > > In the next test I've tried to always increment hits by 1 before calling
>> > > swapin_readahead() in unuse_pte_range(). This is basically cheating,
>> > > because I'm faking the hit ratio, forcing the heuristic to use a larger
>> > > readahead size; in fact, the readahead size moves even more toward 4
>> > > pages and swapoff performance are a little better now.
>> > >
>> > > Pushing even more the "cheating" I can pretend that the previous
>> > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
>> > > I'm forcing the heuristic to move toward the max size and keep using it.
>> > > The result here is pretty much identical to my fixed-size patch, because
>> > > swapin_nr_pages() returns the max readahead size pretty much all the
>> > > time during swapoff (8 pages or, more in general, vm.page-cluster).
>> > >
>> > > Personally I don't like very much forcing the heuristic in this way,
>> > > it'd be nice if it would just work by accounting the proper hit ratio
>> > > (so just by adding lookup_swap_cache() as you correctly suggested), but
>> > > this solution doesn't seem to improve performance in reality. For this
>> > > reason I still think we should consider the swapoff scenario like a
>> > > special one and somehow bypass the readahead heuristic and always return
>> > > the max readahead size.
>> > >
>> > > Looking at the hits of the previous step in the swapoff case just
>> > > doesn't work, because we may have some misses, but they will become hits
>> > > very soon, since we are reading all the swapped out pages back into
>> > > memory. This is why using the max readahead size gives better
>> > > swapoff performance.
>> > >
>> > > What do you think?
>> >
>> > >From your description, it appears that you are using cluster readahead
>> > instead of vma readahead. Can you verify this via,
>> >
>> > # cat /sys/kernel/mm/swap/vma_ra_enabled
>>
>> # cat /sys/kernel/mm/swap/vma_ra_enabled
>> true
>>
>> However, it's still using the cluster readahead because I'm using a
>> swap file and nr_rotate_swap=1, so, according to the following, it's
>> never using the vma readahead:
>>
>> static inline bool swap_use_vma_readahead(void)
>> {
>> return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> }
>>
>> I'll investigate more on this, I think there's no reason to prevent the
>> usage of vma readahead if the underlying device is non-rotational.
>
> Few more details about this.
>
> Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it
> looks like my virtio_blk device is considered rotational by default:
>
> $ cat /sys/block/vda/queue/rotational
> 1
>
> Therefore the vma readahead is not used. If I change this to
> "rotational" then the vma readahead is used; this is confirmed by the
> fact that swapin_nr_pages isn't called anymore:
>
> r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> COUNT EVENT
> 13 $retval = 1
> 18 $retval = 2
> 23 $retval = 4
> 29 $retval = 8
>
> swapoff time: 16.44s
>
> In terms of swapoff performance vma readahead works better than the
> vanilla cluster readahead (~21s), but the "improved" cluster readahead
> (with lookup_swap_cache() in unuse_pte_range()) still gives slightly
> better results (~12s).
Thanks for testing. Can you check the return value of
__swapin_nr_pages() to verify whether vma readahead works as expected?
Best Regards,
Huang, Ying
> -Andrea
Andrea Righi <[email protected]> writes:
> On Wed, Apr 15, 2020 at 03:44:08PM +0800, Huang, Ying wrote:
>> Andrea Righi <[email protected]> writes:
>>
>> > mm/swapfile.c | 4 +++-
>> > 1 file changed, 3 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 9fd47e6f7a86..cb9eb517178d 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -1944,7 +1944,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> > vmf.pmd = pmd;
>> > last_ra = atomic_read(&last_readahead_pages);
>> > atomic_set(&swapin_readahead_hits, last_ra);
>>
>> You need to remove the above 2 lines firstly.
>
> Meh... too much enthusiasm, and I definitely need more coffee this
> morning. Here's the right patch applied:
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5871a2aa86a5..8b38441b66fa 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1940,7 +1940,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> vmf.vma = vma;
> vmf.address = addr;
> vmf.pmd = pmd;
> - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
> + page = lookup_swap_cache(entry, vma, addr);
> + if (!page)
> + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf);
The vmf assignment can be moved inside "if" block. Otherwise the patch
looks good to me.
> if (!page) {
> if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD)
> goto try_next;
>
> And following the right results:
>
> r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> COUNT EVENT
> 1618 $retval = 1
> 4960 $retval = 2
> 41315 $retval = 4
> 103521 $retval = 8
>
> swapoff time: 12.19s
>
> So, not as good as the fixed-size readahead, but it's definitely an
> improvement, considering that the swapoff time is ~22s without this
> patch applied.
>
> I think this change can be a simple and reasonable compromise.
Yes. I think so too.
Best Regards,
Huang, Ying
> Thanks again and sorry for the noise,
> -Andrea
On Thu, Apr 16, 2020 at 08:41:39AM +0800, Huang, Ying wrote:
> Andrea Righi <[email protected]> writes:
>
> > On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote:
> >> On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote:
> >> > Andrea Righi <[email protected]> writes:
> >> >
> >> > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
> >> > >> Andrea Righi <[email protected]> writes:
> >> > >>
> >> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
> >> > >> >> Andrea Righi <[email protected]> writes:
> >> > >> >>
> >> > >> >> [snip]
> >> > >> >>
> >> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> > >> >> > index ebed37bbf7a3..c71abc8df304 100644
> >> > >> >> > --- a/mm/swap_state.c
> >> > >> >> > +++ b/mm/swap_state.c
> >> > >> >> > @@ -20,6 +20,7 @@
> >> > >> >> > #include <linux/migrate.h>
> >> > >> >> > #include <linux/vmalloc.h>
> >> > >> >> > #include <linux/swap_slots.h>
> >> > >> >> > +#include <linux/oom.h>
> >> > >> >> > #include <linux/huge_mm.h>
> >> > >> >> >
> >> > >> >> > #include <asm/pgtable.h>
> >> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> >> > >> >> > max_pages = 1 << READ_ONCE(page_cluster);
> >> > >> >> > if (max_pages <= 1)
> >> > >> >> > return 1;
> >> > >> >> > + /*
> >> > >> >> > + * If current task is using too much memory or swapoff is running
> >> > >> >> > + * simply use the max readahead size. Since we likely want to load a
> >> > >> >> > + * lot of pages back into memory, using a fixed-size max readhaead can
> >> > >> >> > + * give better performance in this case.
> >> > >> >> > + */
> >> > >> >> > + if (oom_task_origin(current))
> >> > >> >> > + return max_pages;
> >> > >> >> >
> >> > >> >> > hits = atomic_xchg(&swapin_readahead_hits, 0);
> >> > >> >> > pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> >> > >> >>
> >> > >> >> Thinks this again. If my understanding were correct, the accessing
> >> > >> >> pattern during swapoff is sequential, why swap readahead doesn't work?
> >> > >> >> If so, can you root cause that firstly?
> >> > >> >
> >> > >> > Theoretically if the pattern is sequential the current heuristic should
> >> > >> > already select a big readahead size, but apparently it's not doing that.
> >> > >> >
> >> > >> > I'll repeat my tests tracing the readahead size during swapoff to see
> >> > >> > exactly what's going on here.
> >> > >>
> >> > >> I haven't verify it. It may be helpful to call lookup_swap_cache()
> >> > >> before swapin_readahead() in unuse_pte_range(). The theory behind it is
> >> > >> to update the swap readahead statistics via lookup_swap_cache().
> >> > >
> >> > > I did more tests trying to collect some useful information.
> >> > >
> >> > > In particular I've been focusing at tracing the distribution of the
> >> > > values returned by swapin_nr_pages() in different scenarios.
> >> > >
> >> > > To do so I made swapin_nr_pages() trace-able and I used the following
> >> > > bcc command to measure the distrubution of the returned values:
> >> > >
> >> > > # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
> >> > >
> >> > > I've collected this metric in the following scenarios:
> >> > > - 5.6 vanilla
> >> > > - 5.6 + lookup_swap_cache() before swapin_readahead() in
> >> > > unuse_pte_range()
> >> > > - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
> >> > > in unuse_pte_range()
> >> > > - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
> >> > > before swapin_readahead() in unuse_pte_range()
> >> > >
> >> > > Each kernel has been tested both with swappiness=0 and swappiness=60.
> >> > > Results are pretty much identical changing the swappiness, so I'm just
> >> > > reporting the default case here (w/ swappiness=60).
> >> > >
> >> > > Result
> >> > > ======
> >> > >
> >> > > = swapoff performance (elapsed time) =
> >> > >
> >> > > vanilla 22.09s
> >> > > lookup_swap_cache() 23.87s
> >> > > hits++ 16.10s
> >> > > hits=last_ra_pages 8.81s
> >> > >
> >> > > = swapin_nr_pages() $retval distribution =
> >> > >
> >> > > 5.6 vanilla:
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > COUNT EVENT
> >> > > 36948 $retval = 8
> >> > > 44151 $retval = 4
> >> > > 49290 $retval = 1
> >> > > 527771 $retval = 2
> >> > >
> >> > > 5.6 lookup_swap_cache() before swapin_readahead():
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > COUNT EVENT
> >> > > 13093 $retval = 1
> >> > > 56703 $retval = 8
> >> > > 123067 $retval = 2
> >> > > 366118 $retval = 4
> >> > >
> >> > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > COUNT EVENT
> >> > > 2589 $retval = 1
> >> > > 8016 $retval = 2
> >> > > 40021 $retval = 8
> >> > > 566038 $retval = 4
> >> > >
> >> > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > COUNT EVENT
> >> > > 785 $retval = 2
> >> > > 1072 $retval = 1
> >> > > 21844 $retval = 4
> >> > > 644168 $retval = 8
> >> > >
> >> > > In the vanilla case, the readahead heuristic seems to choose 2 pages
> >> > > most of the time. This is because we are not properly considering the
> >> > > hits (hits are always 0 in the swapoff code path) and, as you correctly
> >> > > pointed out, we can fix this by calling lookup_swap_cache() in
> >> > > unuse_pte_range() before calling swapin_readahead().
> >> > >
> >> > > With this change the distribution of the readahead size moves more
> >> > > toward 4 pages, but we still have some 2s. That looks good, however it
> >> > > doesn't seem to speed up swapoff very much... maybe because calling
> >> > > lookup_swap_cache() introduces a small overhead? (still need to
> >> > > investigate about this theory).
> >> > >
> >> > > In the next test I've tried to always increment hits by 1 before calling
> >> > > swapin_readahead() in unuse_pte_range(). This is basically cheating,
> >> > > because I'm faking the hit ratio, forcing the heuristic to use a larger
> >> > > readahead size; in fact, the readahead size moves even more toward 4
> >> > > pages and swapoff performance are a little better now.
> >> > >
> >> > > Pushing even more the "cheating" I can pretend that the previous
> >> > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
> >> > > I'm forcing the heuristic to move toward the max size and keep using it.
> >> > > The result here is pretty much identical to my fixed-size patch, because
> >> > > swapin_nr_pages() returns the max readahead size pretty much all the
> >> > > time during swapoff (8 pages or, more in general, vm.page-cluster).
> >> > >
> >> > > Personally I don't like very much forcing the heuristic in this way,
> >> > > it'd be nice if it would just work by accounting the proper hit ratio
> >> > > (so just by adding lookup_swap_cache() as you correctly suggested), but
> >> > > this solution doesn't seem to improve performance in reality. For this
> >> > > reason I still think we should consider the swapoff scenario like a
> >> > > special one and somehow bypass the readahead heuristic and always return
> >> > > the max readahead size.
> >> > >
> >> > > Looking at the hits of the previous step in the swapoff case just
> >> > > doesn't work, because we may have some misses, but they will become hits
> >> > > very soon, since we are reading all the swapped out pages back into
> >> > > memory. This is why using the max readahead size gives better
> >> > > swapoff performance.
> >> > >
> >> > > What do you think?
> >> >
> >> > >From your description, it appears that you are using cluster readahead
> >> > instead of vma readahead. Can you verify this via,
> >> >
> >> > # cat /sys/kernel/mm/swap/vma_ra_enabled
> >>
> >> # cat /sys/kernel/mm/swap/vma_ra_enabled
> >> true
> >>
> >> However, it's still using the cluster readahead because I'm using a
> >> swap file and nr_rotate_swap=1, so, according to the following, it's
> >> never using the vma readahead:
> >>
> >> static inline bool swap_use_vma_readahead(void)
> >> {
> >> return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> >> }
> >>
> >> I'll investigate more on this, I think there's no reason to prevent the
> >> usage of vma readahead if the underlying device is non-rotational.
> >
> > Few more details about this.
> >
> > Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it
> > looks like my virtio_blk device is considered rotational by default:
> >
> > $ cat /sys/block/vda/queue/rotational
> > 1
> >
> > Therefore the vma readahead is not used. If I change this to
> > "rotational" then the vma readahead is used; this is confirmed by the
> > fact that swapin_nr_pages isn't called anymore:
> >
> > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > COUNT EVENT
> > 13 $retval = 1
> > 18 $retval = 2
> > 23 $retval = 4
> > 29 $retval = 8
> >
> > swapoff time: 16.44s
> >
> > In terms of swapoff performance vma readahead works better than the
> > vanilla cluster readahead (~21s), but the "improved" cluster readahead
> > (with lookup_swap_cache() in unuse_pte_range()) still gives slightly
> > better results (~12s).
>
> Thanks for testing. Can you check the return value of
> __swapin_nr_pages() to verify whether vma readahead works as expected?
This is the vanilla kernel w/ vma readahead enabled:
r::__swapin_nr_pages():unsigned int:$retval
COUNT EVENT
1286 $retval = 1
6516 $retval = 4
81252 $retval = 8
260815 $retval = 2
swapoff time: real 18.20
And this is the kernel with the lookup_swap_cache() patch and vma
readahead enabled:
r::__swapin_nr_pages():unsigned int:$retval
COUNT EVENT
321 $retval = 2
509 $retval = 1
3017 $retval = 4
124090 $retval = 8
swapoff time: 15.33s
So, vma readahead seems to work as expected and swapoff now is faster
also with vma readahead enabled.
-Andrea