2009-04-20 20:26:35

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/3] mm: fix pageref leak in do_swap_page()

By the time the memory cgroup code is notified about a swapin we
already hold a reference on the fault page.

If the cgroup callback fails make sure to unlock AND release the page
or we leak the reference.

Signed-off-by: Johannes Weiner <[email protected]>
Cc: Balbir Singh <[email protected]>
---
mm/memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 366dab5..db126b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2536,8 +2536,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
ret = VM_FAULT_OOM;
- unlock_page(page);
- goto out;
+ goto out_page;
}

/*
@@ -2599,6 +2598,7 @@ out:
out_nomap:
mem_cgroup_cancel_charge_swapin(ptr);
pte_unmap_unlock(page_table, ptl);
+out_page:
unlock_page(page);
page_cache_release(page);
return ret;
--
1.6.2.1.135.gde769


2009-04-20 20:26:19

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/3][rfc] vmscan: batched swap slot allocation

Every swap slot allocation tries to be subsequent to the previous one
to help keeping the LRU order of anon pages intact when they are
swapped out.

With an increasing number of concurrent reclaimers, the average
distance between two subsequent slot allocations of one reclaimer
increases as well. The contiguous LRU list chunks each reclaimer
swaps out get 'multiplexed' on the swap space as they allocate the
slots concurrently.

2 processes isolating 15 pages each and allocating swap slots
concurrently:

#0 #1

page 0 slot 0 page 15 slot 1
page 1 slot 2 page 16 slot 3
page 2 slot 4 page 17 slot 5
...

-> average slot distance of 2

All reclaimers being equally fast, this becomes a problem when the
total number of concurrent reclaimers gets so high that even equal
distribution makes the average distance between the slots of one
reclaimer too wide for optimistic swap-in to compensate.

But right now, one reclaimer can take much longer than another one
because its pages are mapped into more page tables and it has thus
more work to do and the faster reclaimer will allocate multiple swap
slots between two slot allocations of the slower one.

This patch makes shrink_page_list() allocate swap slots in batches,
collecting all the anonymous memory pages in a list without
rescheduling and actual reclaim in between. And only after all anon
pages are swap cached, unmap and write-out starts for them.

While this does not fix the fundamental issue of slot distribution
increasing with reclaimers, it mitigates the problem by balancing the
resulting fragmentation equally between the allocators.

Signed-off-by: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
---
mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70092fa..b3823fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
enum pageout_io sync_writeback)
{
LIST_HEAD(ret_pages);
+ LIST_HEAD(swap_pages);
struct pagevec freed_pvec;
- int pgactivate = 0;
+ int pgactivate = 0, restart = 0;
unsigned long nr_reclaimed = 0;

cond_resched();

pagevec_init(&freed_pvec, 1);
+restart:
while (!list_empty(page_list)) {
struct address_space *mapping;
struct page *page;
int may_enter_fs;
int referenced;

- cond_resched();
+ if (list_empty(&swap_pages))
+ cond_resched();

page = lru_to_page(page_list);
list_del(&page->lru);

+ if (restart) {
+ /*
+ * We are allowed to do IO when we restart for
+ * swap pages.
+ */
+ may_enter_fs = 1;
+ /*
+ * Referenced pages will be sorted out by
+ * try_to_unmap() and unmapped (anon!) pages
+ * are not to be referenced anymore.
+ */
+ referenced = 0;
+ goto reclaim;
+ }
+
if (!trylock_page(page))
goto keep;

@@ -655,14 +673,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
- if (PageAnon(page) && !PageSwapCache(page)) {
- if (!(sc->gfp_mask & __GFP_IO))
- goto keep_locked;
- if (!add_to_swap(page))
- goto activate_locked;
- may_enter_fs = 1;
+ if (PageAnon(page)) {
+ if (!PageSwapCache(page)) {
+ if (!(sc->gfp_mask & __GFP_IO))
+ goto keep_locked;
+ if (!add_to_swap(page))
+ goto activate_locked;
+ } else if (!may_enter_fs)
+ /*
+ * It's no use to batch when we are
+ * not allocating swap for this GFP
+ * mask.
+ */
+ goto reclaim;
+ list_add(&page->lru, &swap_pages);
+ continue;
}

+ reclaim:
mapping = page_mapping(page);

/*
@@ -794,6 +822,11 @@ keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
+ if (!list_empty(&swap_pages)) {
+ list_splice_init(&swap_pages, page_list);
+ restart = 1;
+ goto restart;
+ }
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
__pagevec_free(&freed_pvec);
--
1.6.2.1.135.gde769

2009-04-20 20:26:49

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/3][rfc] swap: try to reuse freed slots in the allocation area

A swap slot for an anonymous memory page might get freed again just
after allocating it when further steps in the eviction process fail.

But the clustered slot allocation will go ahead allocating after this
now unused slot, leaving a hole at this position. Holes waste space
and act as a boundary for optimistic swap-in.

To avoid this, check if the next page to be swapped out can sensibly
be placed at this just freed position. And if so, point the next
cluster offset to it.

The acceptable 'look-back' distance is the number of slots swap-in
clustering uses as well so that the latter continues to get related
context when reading surrounding swap slots optimistically.

Signed-off-by: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
---
mm/swapfile.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 312fafe..fc88278 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -484,6 +484,15 @@ static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
p->lowest_bit = offset;
if (offset > p->highest_bit)
p->highest_bit = offset;
+ /*
+ * If the next allocation is only some slots
+ * ahead, reuse this now free slot instead of
+ * leaving a hole.
+ */
+ if (p->cluster_next - offset <= 1 << page_cluster) {
+ p->cluster_next = offset;
+ p->cluster_nr++;
+ }
if (p->prio > swap_info[swap_list.next].prio)
swap_list.next = p - swap_info;
nr_swap_pages++;
--
1.6.2.1.135.gde769

2009-04-20 20:32:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

A test program creates an anonymous memory mapping the size of the
system's RAM (2G). It faults all pages of it linearly, then kicks off
128 reclaimers (on 4 cores) that map, fault and unmap 2G in sum and
parallel, thereby evicting the first mapping onto swap.

The time is then taken for the initial mapping to get faulted in from
swap linearly again, thus measuring how bad the 128 reclaimers
distributed the pages on the swap space.

Average over 5 runs, standard deviation in parens:

swap-in user system total

old: 74.97s (0.38s) 0.52s (0.02s) 291.07s (3.28s) 2m52.66s (0m1.32s)
new: 45.26s (0.68s) 0.53s (0.01s) 250.47s (5.17s) 2m45.93s (0m2.63s)

where old is current mmotm snapshot 2009-04-17-15-19 and new is these
three patches applied to it.

Test program attached. Kernbench didn't show any differences on my
single core x86 laptop with 256mb ram (poor thing).


Attachments:
(No filename) (929.00 B)
contswap2.c (1.48 kB)
Download all attachments

2009-04-20 21:20:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Mon, 20 Apr 2009 22:31:19 +0200
Johannes Weiner <[email protected]> wrote:

> A test program creates an anonymous memory mapping the size of the
> system's RAM (2G). It faults all pages of it linearly, then kicks off
> 128 reclaimers (on 4 cores) that map, fault and unmap 2G in sum and
> parallel, thereby evicting the first mapping onto swap.
>
> The time is then taken for the initial mapping to get faulted in from
> swap linearly again, thus measuring how bad the 128 reclaimers
> distributed the pages on the swap space.
>
> Average over 5 runs, standard deviation in parens:
>
> swap-in user system total
>
> old: 74.97s (0.38s) 0.52s (0.02s) 291.07s (3.28s) 2m52.66s (0m1.32s)
> new: 45.26s (0.68s) 0.53s (0.01s) 250.47s (5.17s) 2m45.93s (0m2.63s)
>
> where old is current mmotm snapshot 2009-04-17-15-19 and new is these
> three patches applied to it.
>
> Test program attached. Kernbench didn't show any differences on my
> single core x86 laptop with 256mb ram (poor thing).

qsbench is pretty good at fragmenting swapspace. It would be vaguely
interesting to see what effect you've had on its runtime.

I've found that qsbench's runtimes are fairly chaotic when it's
operating at the transition point between all-in-core and
madly-swapping, so a bit of thought and caution is needed.

I used to run it with

./qsbench -p 4 -m 96

on a 256MB machine and it had sufficiently repeatable runtimes to be
useful.

There's a copy of qsbench in
http://userweb.kernel.org/~akpm/stuff/ext3-tools.tar.gz


I wonder what effect this patch has upon hibernate/resume performance.

2009-04-20 21:40:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Mon, Apr 20, 2009 at 01:53:03PM -0700, Andrew Morton wrote:
> On Mon, 20 Apr 2009 22:31:19 +0200
> Johannes Weiner <[email protected]> wrote:
>
> > A test program creates an anonymous memory mapping the size of the
> > system's RAM (2G). It faults all pages of it linearly, then kicks off
> > 128 reclaimers (on 4 cores) that map, fault and unmap 2G in sum and
> > parallel, thereby evicting the first mapping onto swap.
> >
> > The time is then taken for the initial mapping to get faulted in from
> > swap linearly again, thus measuring how bad the 128 reclaimers
> > distributed the pages on the swap space.
> >
> > Average over 5 runs, standard deviation in parens:
> >
> > swap-in user system total
> >
> > old: 74.97s (0.38s) 0.52s (0.02s) 291.07s (3.28s) 2m52.66s (0m1.32s)
> > new: 45.26s (0.68s) 0.53s (0.01s) 250.47s (5.17s) 2m45.93s (0m2.63s)
> >
> > where old is current mmotm snapshot 2009-04-17-15-19 and new is these
> > three patches applied to it.
> >
> > Test program attached. Kernbench didn't show any differences on my
> > single core x86 laptop with 256mb ram (poor thing).
>
> qsbench is pretty good at fragmenting swapspace. It would be vaguely
> interesting to see what effect you've had on its runtime.
>
> I've found that qsbench's runtimes are fairly chaotic when it's
> operating at the transition point between all-in-core and
> madly-swapping, so a bit of thought and caution is needed.
>
> I used to run it with
>
> ./qsbench -p 4 -m 96
>
> on a 256MB machine and it had sufficiently repeatable runtimes to be
> useful.
>
> There's a copy of qsbench in
> http://userweb.kernel.org/~akpm/stuff/ext3-tools.tar.gz

Thanks a lot.

> I wonder what effect this patch has upon hibernate/resume performance.

Good point, I will test this.

2009-04-20 23:36:59

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch 1/3] mm: fix pageref leak in do_swap_page()

On Mon, 20 Apr 2009 22:24:43 +0200
Johannes Weiner <[email protected]> wrote:

Nice catch!!
Reviewed-by: Minchan Kim <[email protected]>

> By the time the memory cgroup code is notified about a swapin we
> already hold a reference on the fault page.
>
> If the cgroup callback fails make sure to unlock AND release the page
> or we leak the reference.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Balbir Singh <[email protected]>
> ---
> mm/memory.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 366dab5..db126b6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2536,8 +2536,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
> ret = VM_FAULT_OOM;
> - unlock_page(page);
> - goto out;
> + goto out_page;
> }
>
> /*
> @@ -2599,6 +2598,7 @@ out:
> out_nomap:
> mem_cgroup_cancel_charge_swapin(ptr);
> pte_unmap_unlock(page_table, ptl);
> +out_page:
> unlock_page(page);
> page_cache_release(page);
> return ret;
> --
> 1.6.2.1.135.gde769
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


--
Kinds Regards
Minchan Kim

2009-04-21 01:00:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Mon, 20 Apr 2009 22:24:45 +0200
Johannes Weiner <[email protected]> wrote:

> Every swap slot allocation tries to be subsequent to the previous one
> to help keeping the LRU order of anon pages intact when they are
> swapped out.
>
> With an increasing number of concurrent reclaimers, the average
> distance between two subsequent slot allocations of one reclaimer
> increases as well. The contiguous LRU list chunks each reclaimer
> swaps out get 'multiplexed' on the swap space as they allocate the
> slots concurrently.
>
> 2 processes isolating 15 pages each and allocating swap slots
> concurrently:
>
> #0 #1
>
> page 0 slot 0 page 15 slot 1
> page 1 slot 2 page 16 slot 3
> page 2 slot 4 page 17 slot 5
> ...
>
> -> average slot distance of 2
>
> All reclaimers being equally fast, this becomes a problem when the
> total number of concurrent reclaimers gets so high that even equal
> distribution makes the average distance between the slots of one
> reclaimer too wide for optimistic swap-in to compensate.
>
> But right now, one reclaimer can take much longer than another one
> because its pages are mapped into more page tables and it has thus
> more work to do and the faster reclaimer will allocate multiple swap
> slots between two slot allocations of the slower one.
>
> This patch makes shrink_page_list() allocate swap slots in batches,
> collecting all the anonymous memory pages in a list without
> rescheduling and actual reclaim in between. And only after all anon
> pages are swap cached, unmap and write-out starts for them.
>
> While this does not fix the fundamental issue of slot distribution
> increasing with reclaimers, it mitigates the problem by balancing the
> resulting fragmentation equally between the allocators.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> ---
> mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 70092fa..b3823fe 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> enum pageout_io sync_writeback)
> {
> LIST_HEAD(ret_pages);
> + LIST_HEAD(swap_pages);
> struct pagevec freed_pvec;
> - int pgactivate = 0;
> + int pgactivate = 0, restart = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
>
> pagevec_init(&freed_pvec, 1);
> +restart:
> while (!list_empty(page_list)) {
> struct address_space *mapping;
> struct page *page;
> int may_enter_fs;
> int referenced;
>
> - cond_resched();
> + if (list_empty(&swap_pages))
> + cond_resched();
>
Why this ?

> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> + if (restart) {
> + /*
> + * We are allowed to do IO when we restart for
> + * swap pages.
> + */
> + may_enter_fs = 1;
> + /*
> + * Referenced pages will be sorted out by
> + * try_to_unmap() and unmapped (anon!) pages
> + * are not to be referenced anymore.
> + */
> + referenced = 0;
> + goto reclaim;
> + }
> +
> if (!trylock_page(page))
> goto keep;
>
Keeping multiple pages locked while they stay on private list ?

BTW, isn't it better to add "allocate multiple swap space at once" function
like
- void get_swap_pages(nr, swp_entry_array[])
? "nr" will not be bigger than SWAP_CLUSTER_MAX.

Regards,
-Kame

2009-04-21 03:15:25

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 1/3] mm: fix pageref leak in do_swap_page()

* Johannes Weiner <[email protected]> [2009-04-20 22:24:43]:

> By the time the memory cgroup code is notified about a swapin we
> already hold a reference on the fault page.
>
> If the cgroup callback fails make sure to unlock AND release the page
> or we leak the reference.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Balbir Singh <[email protected]>

Seems reasonable to me, could you make the changelog more verbose and
mention that lookup_swap_cache() gets a reference to the page and we
need to release the extra reference.

BTW, have you had any luck reproducing the issue? How did you catch
the problem?

--
Balbir

2009-04-21 03:45:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 1/3] mm: fix pageref leak in do_swap_page()

On Mon, 20 Apr 2009 22:24:43 +0200
Johannes Weiner <[email protected]> wrote:

> By the time the memory cgroup code is notified about a swapin we
> already hold a reference on the fault page.
>
> If the cgroup callback fails make sure to unlock AND release the page
> or we leak the reference.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Balbir Singh <[email protected]>

Wow, thanks.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>


> ---
> mm/memory.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 366dab5..db126b6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2536,8 +2536,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
> ret = VM_FAULT_OOM;
> - unlock_page(page);
> - goto out;
> + goto out_page;
> }
>
> /*
> @@ -2599,6 +2598,7 @@ out:
> out_nomap:
> mem_cgroup_cancel_charge_swapin(ptr);
> pte_unmap_unlock(page_table, ptl);
> +out_page:
> unlock_page(page);
> page_cache_release(page);
> return ret;
> --
> 1.6.2.1.135.gde769
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-21 08:20:46

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/3] mm: fix pageref leak in do_swap_page()

On Tue, Apr 21, 2009 at 08:44:20AM +0530, Balbir Singh wrote:
> * Johannes Weiner <[email protected]> [2009-04-20 22:24:43]:
>
> > By the time the memory cgroup code is notified about a swapin we
> > already hold a reference on the fault page.
> >
> > If the cgroup callback fails make sure to unlock AND release the page
> > or we leak the reference.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: Balbir Singh <[email protected]>
>
> Seems reasonable to me, could you make the changelog more verbose and
> mention that lookup_swap_cache() gets a reference to the page and we
> need to release the extra reference.

Okay, I will add that information.

> BTW, have you had any luck reproducing the issue? How did you catch
> the problem?

I reviewed all the exit points when I shuffled code around in there
for another series that uses a lighter version of do_wp_page() for
swap write-faults. I never triggered that problem.

2009-04-21 08:47:15

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 1/3] mm: fix pageref leak in do_swap_page()

* Johannes Weiner <[email protected]> [2009-04-21 10:19:03]:

> On Tue, Apr 21, 2009 at 08:44:20AM +0530, Balbir Singh wrote:
> > * Johannes Weiner <[email protected]> [2009-04-20 22:24:43]:
> >
> > > By the time the memory cgroup code is notified about a swapin we
> > > already hold a reference on the fault page.
> > >
> > > If the cgroup callback fails make sure to unlock AND release the page
> > > or we leak the reference.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > Cc: Balbir Singh <[email protected]>
> >
> > Seems reasonable to me, could you make the changelog more verbose and
> > mention that lookup_swap_cache() gets a reference to the page and we
> > need to release the extra reference.
>
> Okay, I will add that information.
>

Thanks

> > BTW, have you had any luck reproducing the issue? How did you catch
> > the problem?
>
> I reviewed all the exit points when I shuffled code around in there
> for another series that uses a lighter version of do_wp_page() for
> swap write-faults. I never triggered that problem.
>

Good way of catching a problem, play with the code :)

--
Balbir

2009-04-21 08:54:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Tue, Apr 21, 2009 at 09:58:57AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 20 Apr 2009 22:24:45 +0200
> Johannes Weiner <[email protected]> wrote:
>
> > Every swap slot allocation tries to be subsequent to the previous one
> > to help keeping the LRU order of anon pages intact when they are
> > swapped out.
> >
> > With an increasing number of concurrent reclaimers, the average
> > distance between two subsequent slot allocations of one reclaimer
> > increases as well. The contiguous LRU list chunks each reclaimer
> > swaps out get 'multiplexed' on the swap space as they allocate the
> > slots concurrently.
> >
> > 2 processes isolating 15 pages each and allocating swap slots
> > concurrently:
> >
> > #0 #1
> >
> > page 0 slot 0 page 15 slot 1
> > page 1 slot 2 page 16 slot 3
> > page 2 slot 4 page 17 slot 5
> > ...
> >
> > -> average slot distance of 2
> >
> > All reclaimers being equally fast, this becomes a problem when the
> > total number of concurrent reclaimers gets so high that even equal
> > distribution makes the average distance between the slots of one
> > reclaimer too wide for optimistic swap-in to compensate.
> >
> > But right now, one reclaimer can take much longer than another one
> > because its pages are mapped into more page tables and it has thus
> > more work to do and the faster reclaimer will allocate multiple swap
> > slots between two slot allocations of the slower one.
> >
> > This patch makes shrink_page_list() allocate swap slots in batches,
> > collecting all the anonymous memory pages in a list without
> > rescheduling and actual reclaim in between. And only after all anon
> > pages are swap cached, unmap and write-out starts for them.
> >
> > While this does not fix the fundamental issue of slot distribution
> > increasing with reclaimers, it mitigates the problem by balancing the
> > resulting fragmentation equally between the allocators.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > ---
> > mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
> > 1 files changed, 41 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 70092fa..b3823fe 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > enum pageout_io sync_writeback)
> > {
> > LIST_HEAD(ret_pages);
> > + LIST_HEAD(swap_pages);
> > struct pagevec freed_pvec;
> > - int pgactivate = 0;
> > + int pgactivate = 0, restart = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> >
> > pagevec_init(&freed_pvec, 1);
> > +restart:
> > while (!list_empty(page_list)) {
> > struct address_space *mapping;
> > struct page *page;
> > int may_enter_fs;
> > int referenced;
> >
> > - cond_resched();
> > + if (list_empty(&swap_pages))
> > + cond_resched();
> >
> Why this ?

It shouldn't schedule anymore when it's allocated the first swap slot.
Another reclaimer could e.g. sleep on the cond_resched() before the
loop and when we schedule while having swap slots allocated, we might
continue further allocations multiple slots ahead.

> > page = lru_to_page(page_list);
> > list_del(&page->lru);
> >
> > + if (restart) {
> > + /*
> > + * We are allowed to do IO when we restart for
> > + * swap pages.
> > + */
> > + may_enter_fs = 1;
> > + /*
> > + * Referenced pages will be sorted out by
> > + * try_to_unmap() and unmapped (anon!) pages
> > + * are not to be referenced anymore.
> > + */
> > + referenced = 0;
> > + goto reclaim;
> > + }
> > +
> > if (!trylock_page(page))
> > goto keep;
> >
> Keeping multiple pages locked while they stay on private list ?

Yeah, it's a bit suboptimal but I don't see a way around it.

> BTW, isn't it better to add "allocate multiple swap space at once" function
> like
> - void get_swap_pages(nr, swp_entry_array[])
> ? "nr" will not be bigger than SWAP_CLUSTER_MAX.

It will sometimes be, see __zone_reclaim().

I had such a function once. The interesting part is: how and when do
you call it? If you drop the page lock in between, you need to redo
the checks for unevictability and whether the page has become mapped
etc.

You also need to have the pages in swap cache as soon as possible or
optimistic swap-in will 'steal' your swap slots. See add_to_swap()
when the cache radix tree says -EEXIST.

2009-04-21 09:25:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Tue, 21 Apr 2009 10:52:31 +0200
Johannes Weiner <[email protected]> wrote:

> > Keeping multiple pages locked while they stay on private list ?
>
> Yeah, it's a bit suboptimal but I don't see a way around it.
>
Hmm, seems to increase stale swap cache dramatically under memcg ;)

> > BTW, isn't it better to add "allocate multiple swap space at once" function
> > like
> > - void get_swap_pages(nr, swp_entry_array[])
> > ? "nr" will not be bigger than SWAP_CLUSTER_MAX.
>
> It will sometimes be, see __zone_reclaim().
>
Hm ? If I read the code correctly, __zone_reclaim() just call shrink_zone() and
"nr" to shrink_page_list() is SWAP_CLUSTER_MAX, at most.

> I had such a function once. The interesting part is: how and when do
> you call it? If you drop the page lock in between, you need to redo
> the checks for unevictability and whether the page has become mapped
> etc.
>
> You also need to have the pages in swap cache as soon as possible or
> optimistic swap-in will 'steal' your swap slots. See add_to_swap()
> when the cache radix tree says -EEXIST.
>

If I was you, modify "offset" calculation of
get_swap_pages()
-> scan_swap_map()
to allow that a cpu tends to find countinous swap page cluster.
Too difficult ?

Regards,
-Kame

2009-04-21 09:27:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

> > > - cond_resched();
> > > + if (list_empty(&swap_pages))
> > > + cond_resched();
> > >
> > Why this ?
>
> It shouldn't schedule anymore when it's allocated the first swap slot.
> Another reclaimer could e.g. sleep on the cond_resched() before the
> loop and when we schedule while having swap slots allocated, we might
> continue further allocations multiple slots ahead.

Oops, It seems regression. this cond_resched() intent to

cond_resched();
pageout();
cond_resched();
pageout();
cond_resched();
pageout();


2009-04-21 09:40:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Tue, Apr 21, 2009 at 06:27:08PM +0900, KOSAKI Motohiro wrote:
> > > > - cond_resched();
> > > > + if (list_empty(&swap_pages))
> > > > + cond_resched();
> > > >
> > > Why this ?
> >
> > It shouldn't schedule anymore when it's allocated the first swap slot.
> > Another reclaimer could e.g. sleep on the cond_resched() before the
> > loop and when we schedule while having swap slots allocated, we might
> > continue further allocations multiple slots ahead.
>
> Oops, It seems regression. this cond_resched() intent to
>
> cond_resched();
> pageout();
> cond_resched();
> pageout();
> cond_resched();
> pageout();

It still does that. While it collects swap pages (swap_pages list is
non-empty), it doesn't page out. And if it restarts for unmap and
page-out, the swap_pages list is empty and cond_resched() is called.

2009-04-21 09:41:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

> On Tue, Apr 21, 2009 at 06:27:08PM +0900, KOSAKI Motohiro wrote:
> > > > > - cond_resched();
> > > > > + if (list_empty(&swap_pages))
> > > > > + cond_resched();
> > > > >
> > > > Why this ?
> > >
> > > It shouldn't schedule anymore when it's allocated the first swap slot.
> > > Another reclaimer could e.g. sleep on the cond_resched() before the
> > > loop and when we schedule while having swap slots allocated, we might
> > > continue further allocations multiple slots ahead.
> >
> > Oops, It seems regression. this cond_resched() intent to
> >
> > cond_resched();
> > pageout();
> > cond_resched();
> > pageout();
> > cond_resched();
> > pageout();
>
> It still does that. While it collects swap pages (swap_pages list is
> non-empty), it doesn't page out. And if it restarts for unmap and
> page-out, the swap_pages list is empty and cond_resched() is called.

Ah, ok.

2009-04-21 09:56:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Tue, Apr 21, 2009 at 06:23:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 21 Apr 2009 10:52:31 +0200
> Johannes Weiner <[email protected]> wrote:
>
> > > Keeping multiple pages locked while they stay on private list ?
> >
> > Yeah, it's a bit suboptimal but I don't see a way around it.
> >
> Hmm, seems to increase stale swap cache dramatically under memcg ;)

Hmpf, not good.

> > > BTW, isn't it better to add "allocate multiple swap space at once" function
> > > like
> > > - void get_swap_pages(nr, swp_entry_array[])
> > > ? "nr" will not be bigger than SWAP_CLUSTER_MAX.
> >
> > It will sometimes be, see __zone_reclaim().
> >
> Hm ? If I read the code correctly, __zone_reclaim() just call shrink_zone() and
> "nr" to shrink_page_list() is SWAP_CLUSTER_MAX, at most.

shrink_zone() and shrink_inactive_list() use whatever is set in
sc->swap_cluster_max and for __zone_reclaim() this is:

.swap_cluster_max = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX)

SWAP_CLUSTER_MAX is 32 (2^5), so if you have an order 6 allocation
doing reclaim, you end up with sc->swap_cluster_max == 64 already.
Not common, but it happens.

> > I had such a function once. The interesting part is: how and when do
> > you call it? If you drop the page lock in between, you need to redo
> > the checks for unevictability and whether the page has become mapped
> > etc.
> >
> > You also need to have the pages in swap cache as soon as possible or
> > optimistic swap-in will 'steal' your swap slots. See add_to_swap()
> > when the cache radix tree says -EEXIST.
> >
>
> If I was you, modify "offset" calculation of
> get_swap_pages()
> -> scan_swap_map()
> to allow that a cpu tends to find countinous swap page cluster.
> Too difficult ?

This goes in the direction of extent-based allocations. I tried that
once by providing every reclaimer with a cookie that is passed in for
swap allocations and used to find per-reclaimer offsets.

Something went wrong, I can not quite remember. Will have another
look at this.

2009-04-22 20:07:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 2/3][rfc] swap: try to reuse freed slots in the allocation area

On Mon, 20 Apr 2009, Johannes Weiner wrote:

> A swap slot for an anonymous memory page might get freed again just
> after allocating it when further steps in the eviction process fail.
>
> But the clustered slot allocation will go ahead allocating after this
> now unused slot, leaving a hole at this position. Holes waste space
> and act as a boundary for optimistic swap-in.
>
> To avoid this, check if the next page to be swapped out can sensibly
> be placed at this just freed position. And if so, point the next
> cluster offset to it.
>
> The acceptable 'look-back' distance is the number of slots swap-in
> clustering uses as well so that the latter continues to get related
> context when reading surrounding swap slots optimistically.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Rik van Riel <[email protected]>

I'm glad you're looking into this area, thank you.
I've a feeling that you're going to come up with something good
here, but that neither of these patches (2/3 and 3/3) is yet it.

This patch looks plausible, but I'm not persuaded by it.

I wonder what contribution it made to the impressive figures in
your testing - I suspect none, that it barely exercised this path.

I worry that by jumping back to use the slot in this way, you're
actually propagating the glitch: by which I mean, if the pages are
all as nicely linear as you're supposing, then now one of them
will get placed out of sequence, unlike with the existing code.

And note that swapin's page_cluster is used in a strictly aligned
way (unlike swap allocation's SWAPFILE_CLUSTER): if you're going
to use page_cluster to bound this, then perhaps you should be
aligning too. Perhaps, perhaps not.

If this patch is worthwhile, then don't you want also to be
removing the " && vm_swap_full()" test from vmscan.c, where
shrink_page_list() activate_locked does try_to_free_swap(page)?

But bigger And/Or: you remark that "holes act as a boundary for
optimistic swap-in". Maybe that's more worth attacking? I think
that behaviour is dictated purely by the convenience of a simple
offset:length interface between swapfile.c's valid_swaphandles()
and swap_state.c's swapin_readahead().

If swapin readahead is a good thing (I tend to be pessimistic about
it: think it's worth reading several pages while the disk head is
there, but hold no great hopes that the other pages will be useful -
though when I've experimented with removing, it's certainly proved
to be of some value), then I think you'd do better to restructure
that interface, so as not to stop at the holes.

Hugh

> ---
> mm/swapfile.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 312fafe..fc88278 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -484,6 +484,15 @@ static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
> p->lowest_bit = offset;
> if (offset > p->highest_bit)
> p->highest_bit = offset;
> + /*
> + * If the next allocation is only some slots
> + * ahead, reuse this now free slot instead of
> + * leaving a hole.
> + */
> + if (p->cluster_next - offset <= 1 << page_cluster) {
> + p->cluster_next = offset;
> + p->cluster_nr++;
> + }
> if (p->prio > swap_info[swap_list.next].prio)
> swap_list.next = p - swap_info;
> nr_swap_pages++;
> --
> 1.6.2.1.135.gde769

2009-04-22 20:38:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Mon, 20 Apr 2009, Johannes Weiner wrote:

> Every swap slot allocation tries to be subsequent to the previous one
> to help keeping the LRU order of anon pages intact when they are
> swapped out.
>
> With an increasing number of concurrent reclaimers, the average
> distance between two subsequent slot allocations of one reclaimer
> increases as well. The contiguous LRU list chunks each reclaimer
> swaps out get 'multiplexed' on the swap space as they allocate the
> slots concurrently.
>
> 2 processes isolating 15 pages each and allocating swap slots
> concurrently:
>
> #0 #1
>
> page 0 slot 0 page 15 slot 1
> page 1 slot 2 page 16 slot 3
> page 2 slot 4 page 17 slot 5
> ...
>
> -> average slot distance of 2
>
> All reclaimers being equally fast, this becomes a problem when the
> total number of concurrent reclaimers gets so high that even equal
> distribution makes the average distance between the slots of one
> reclaimer too wide for optimistic swap-in to compensate.
>
> But right now, one reclaimer can take much longer than another one
> because its pages are mapped into more page tables and it has thus
> more work to do and the faster reclaimer will allocate multiple swap
> slots between two slot allocations of the slower one.
>
> This patch makes shrink_page_list() allocate swap slots in batches,
> collecting all the anonymous memory pages in a list without
> rescheduling and actual reclaim in between. And only after all anon
> pages are swap cached, unmap and write-out starts for them.
>
> While this does not fix the fundamental issue of slot distribution
> increasing with reclaimers, it mitigates the problem by balancing the
> resulting fragmentation equally between the allocators.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>

You're right to be thinking along these lines, and probing for
improvements to be made here, but I don't think this patch is
what we want.

Its spaghetti just about defeated me. If it were what we wanted,
I think it ought to be restructured. Thanks to KAMEZAWA-san for
pointing out the issue of multiple locked pages, I'm not keen on
that either. And I don't like the
> + if (list_empty(&swap_pages))
> + cond_resched();
because that kind of thing only makes a difference on !CONFIG_PREEMPT
(which may cover most distros, but still seems regrettable).

Your testing looked good, but wasn't it precisely the test that
would be improved by these changes? Linear touching, some memory
pressure chaos, then repeated linear touching.

I think you're placing too much emphasis on the expectation that
the pages which come off the bottom of the LRU are linear and
belonging to a single object. Isn't it more realistic that
they'll come from scattered locations within independent objects
of different lifetimes? Or, the single linear without the chaos.

There may well be changes you can make here to reflect that better,
yet still keep your advantage in the exceptional case that there's
just the one linear.

An experiment I've never made, maybe you'd like to try, is to have
a level of indirection between the swap entries inserted into ptes
and the actual offsets on swap: assigning the actual offset on swap
at the last moment in swap_writepage, so the writes are in sequence
and merged at the block layer (whichever CPU they come from). Whether
swapins will be bunched together we cannot know, but we do know that
bunching the writes together should pay off (both on HDD and SSD).

Hugh

> ---
> mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 70092fa..b3823fe 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> enum pageout_io sync_writeback)
> {
> LIST_HEAD(ret_pages);
> + LIST_HEAD(swap_pages);
> struct pagevec freed_pvec;
> - int pgactivate = 0;
> + int pgactivate = 0, restart = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
>
> pagevec_init(&freed_pvec, 1);
> +restart:
> while (!list_empty(page_list)) {
> struct address_space *mapping;
> struct page *page;
> int may_enter_fs;
> int referenced;
>
> - cond_resched();
> + if (list_empty(&swap_pages))
> + cond_resched();
>
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> + if (restart) {
> + /*
> + * We are allowed to do IO when we restart for
> + * swap pages.
> + */
> + may_enter_fs = 1;
> + /*
> + * Referenced pages will be sorted out by
> + * try_to_unmap() and unmapped (anon!) pages
> + * are not to be referenced anymore.
> + */
> + referenced = 0;
> + goto reclaim;
> + }
> +
> if (!trylock_page(page))
> goto keep;
>
> @@ -655,14 +673,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> * Anonymous process memory has backing store?
> * Try to allocate it some swap space here.
> */
> - if (PageAnon(page) && !PageSwapCache(page)) {
> - if (!(sc->gfp_mask & __GFP_IO))
> - goto keep_locked;
> - if (!add_to_swap(page))
> - goto activate_locked;
> - may_enter_fs = 1;
> + if (PageAnon(page)) {
> + if (!PageSwapCache(page)) {
> + if (!(sc->gfp_mask & __GFP_IO))
> + goto keep_locked;
> + if (!add_to_swap(page))
> + goto activate_locked;
> + } else if (!may_enter_fs)
> + /*
> + * It's no use to batch when we are
> + * not allocating swap for this GFP
> + * mask.
> + */
> + goto reclaim;
> + list_add(&page->lru, &swap_pages);
> + continue;
> }
>
> + reclaim:
> mapping = page_mapping(page);
>
> /*
> @@ -794,6 +822,11 @@ keep:
> list_add(&page->lru, &ret_pages);
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
> + if (!list_empty(&swap_pages)) {
> + list_splice_init(&swap_pages, page_list);
> + restart = 1;
> + goto restart;
> + }
> list_splice(&ret_pages, page_list);
> if (pagevec_count(&freed_pvec))
> __pagevec_free(&freed_pvec);
> --
> 1.6.2.1.135.gde769

2009-04-27 07:48:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3][rfc] vmscan: batched swap slot allocation

On Wed, Apr 22, 2009 at 09:37:09PM +0100, Hugh Dickins wrote:
> On Mon, 20 Apr 2009, Johannes Weiner wrote:
>
> > Every swap slot allocation tries to be subsequent to the previous one
> > to help keeping the LRU order of anon pages intact when they are
> > swapped out.
> >
> > With an increasing number of concurrent reclaimers, the average
> > distance between two subsequent slot allocations of one reclaimer
> > increases as well. The contiguous LRU list chunks each reclaimer
> > swaps out get 'multiplexed' on the swap space as they allocate the
> > slots concurrently.
> >
> > 2 processes isolating 15 pages each and allocating swap slots
> > concurrently:
> >
> > #0 #1
> >
> > page 0 slot 0 page 15 slot 1
> > page 1 slot 2 page 16 slot 3
> > page 2 slot 4 page 17 slot 5
> > ...
> >
> > -> average slot distance of 2
> >
> > All reclaimers being equally fast, this becomes a problem when the
> > total number of concurrent reclaimers gets so high that even equal
> > distribution makes the average distance between the slots of one
> > reclaimer too wide for optimistic swap-in to compensate.
> >
> > But right now, one reclaimer can take much longer than another one
> > because its pages are mapped into more page tables and it has thus
> > more work to do and the faster reclaimer will allocate multiple swap
> > slots between two slot allocations of the slower one.
> >
> > This patch makes shrink_page_list() allocate swap slots in batches,
> > collecting all the anonymous memory pages in a list without
> > rescheduling and actual reclaim in between. And only after all anon
> > pages are swap cached, unmap and write-out starts for them.
> >
> > While this does not fix the fundamental issue of slot distribution
> > increasing with reclaimers, it mitigates the problem by balancing the
> > resulting fragmentation equally between the allocators.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
>
> You're right to be thinking along these lines, and probing for
> improvements to be made here, but I don't think this patch is
> what we want.
>
> Its spaghetti just about defeated me. If it were what we wanted,
> I think it ought to be restructured. Thanks to KAMEZAWA-san for
> pointing out the issue of multiple locked pages, I'm not keen on
> that either. And I don't like the
> > + if (list_empty(&swap_pages))
> > + cond_resched();
> because that kind of thing only makes a difference on !CONFIG_PREEMPT
> (which may cover most distros, but still seems regrettable).
>
> Your testing looked good, but wasn't it precisely the test that
> would be improved by these changes? Linear touching, some memory
> pressure chaos, then repeated linear touching.

Agreed, it was. I started to play around with qsbench per Andrew's
suggestion but stopped now and went back to the scratch pad. I agree
that these patches are not the solution.

> I think you're placing too much emphasis on the expectation that
> the pages which come off the bottom of the LRU are linear and
> belonging to a single object. Isn't it more realistic that
> they'll come from scattered locations within independent objects
> of different lifetimes? Or, the single linear without the chaos.

Oh that is certainly great input, thank you!

> There may well be changes you can make here to reflect that better,
> yet still keep your advantage in the exceptional case that there's
> just the one linear.
>
> An experiment I've never made, maybe you'd like to try, is to have
> a level of indirection between the swap entries inserted into ptes
> and the actual offsets on swap: assigning the actual offset on swap
> at the last moment in swap_writepage, so the writes are in sequence
> and merged at the block layer (whichever CPU they come from). Whether
> swapins will be bunched together we cannot know, but we do know that
> bunching the writes together should pay off (both on HDD and SSD).

I thought about indirect ptes as well but wasn't sure about it and
hoped we could get away with less invasive changes. It might not be
the case. Thanks for poking in that direction, I will see what I can
come up with.

Hannes

2009-04-27 08:04:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/3][rfc] swap: try to reuse freed slots in the allocation area

On Wed, Apr 22, 2009 at 08:59:06PM +0100, Hugh Dickins wrote:
> On Mon, 20 Apr 2009, Johannes Weiner wrote:
>
> > A swap slot for an anonymous memory page might get freed again just
> > after allocating it when further steps in the eviction process fail.
> >
> > But the clustered slot allocation will go ahead allocating after this
> > now unused slot, leaving a hole at this position. Holes waste space
> > and act as a boundary for optimistic swap-in.
> >
> > To avoid this, check if the next page to be swapped out can sensibly
> > be placed at this just freed position. And if so, point the next
> > cluster offset to it.
> >
> > The acceptable 'look-back' distance is the number of slots swap-in
> > clustering uses as well so that the latter continues to get related
> > context when reading surrounding swap slots optimistically.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Rik van Riel <[email protected]>
>
> I'm glad you're looking into this area, thank you.
> I've a feeling that you're going to come up with something good
> here, but that neither of these patches (2/3 and 3/3) is yet it.
>
> This patch looks plausible, but I'm not persuaded by it.
>
> I wonder what contribution it made to the impressive figures in
> your testing - I suspect none, that it barely exercised this path.
>
> I worry that by jumping back to use the slot in this way, you're
> actually propagating the glitch: by which I mean, if the pages are
> all as nicely linear as you're supposing, then now one of them
> will get placed out of sequence, unlike with the existing code.
>
> And note that swapin's page_cluster is used in a strictly aligned
> way (unlike swap allocation's SWAPFILE_CLUSTER): if you're going
> to use page_cluster to bound this, then perhaps you should be
> aligning too. Perhaps, perhaps not.

Thank you, will think about that.

> If this patch is worthwhile, then don't you want also to be
> removing the " && vm_swap_full()" test from vmscan.c, where
> shrink_page_list() activate_locked does try_to_free_swap(page)?

I fear this swap releasing there can fail quite easily anyway. At
least that is what my testing patches suggest - we hit quite a lot of
already swap cached pages in shrink_page_list() and I think that is
where they come from. It's a different issue, though.

> But bigger And/Or: you remark that "holes act as a boundary for
> optimistic swap-in". Maybe that's more worth attacking? I think
> that behaviour is dictated purely by the convenience of a simple
> offset:length interface between swapfile.c's valid_swaphandles()
> and swap_state.c's swapin_readahead().
>
> If swapin readahead is a good thing (I tend to be pessimistic about
> it: think it's worth reading several pages while the disk head is
> there, but hold no great hopes that the other pages will be useful -
> though when I've experimented with removing, it's certainly proved
> to be of some value), then I think you'd do better to restructure
> that interface, so as not to stop at the holes.

Hm, let's try that. I am thinking of extending valid_swap_handles()
to return exactly those through a bitmap that can represent holes.

I think the read-in makes sense but not when the system is already
thrashing. Then it will just use memory for data we are not sure of
being needed at all. Perhaps it should be throttled or disabled at
some point.

Hugh, thanks a lot for your great feedback.

Hannes