2013-03-27 02:22:18

by Minchan Kim

[permalink] [raw]
Subject: [RFC] mm: remove swapcache page early

Swap subsystem does lazy swap slot free with expecting the page
would be swapped out again so we can't avoid unnecessary write.

But the problem in in-memory swap is that it consumes memory space
until vm_swap_full(ie, used half of all of swap device) condition
meet. It could be bad if we use multiple swap device, small in-memory swap
and big storage swap or in-memory swap alone.

This patch changes vm_swap_full logic slightly so it could free
swap slot early if the backed device is really fast.
For it, I used SWP_SOLIDSTATE but It might be controversial.
So let's add Ccing Shaohua and Hugh.
If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
or something for z* family.

Other problem is zram is block device so that it can set SWP_INMEMORY
or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
I have no idea to use it for frontswap.

Any idea?

Other optimize point is we remove it unconditionally when we
found it's exclusive when swap in happen.
It could help frontswap family, too.
What do you think about it?

Cc: Hugh Dickins <[email protected]>
Cc: Dan Magenheimer <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Shaohua Li <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 11 ++++++++---
mm/memory.c | 3 ++-
mm/swapfile.c | 11 +++++++----
mm/vmscan.c | 2 +-
4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..1f4df66 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;

-/* Swap 50% full? Release swapcache more aggressively.. */
-static inline bool vm_swap_full(void)
+/*
+ * Swap 50% full or fast backed device?
+ * Release swapcache more aggressively.
+ */
+static inline bool vm_swap_full(struct swap_info_struct *si)
{
+ if (si->flags & SWP_SOLIDSTATE)
+ return true;
return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
}

@@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
#define get_nr_swap_pages() 0L
#define total_swap_pages 0L
#define total_swapcache_pages() 0UL
-#define vm_swap_full() 0
+#define vm_swap_full(si) 0

#define si_swapinfo(val) \
do { (val)->freeswap = (val)->totalswap = 0; } while (0)
diff --git a/mm/memory.c b/mm/memory.c
index 705473a..1ca21a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
mem_cgroup_commit_charge_swapin(page, ptr);

swap_free(entry);
- if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+ if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
+ || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
try_to_free_swap(page);
unlock_page(page);
if (page != swapcache) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1bee6fa..f9cc701 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -293,7 +293,7 @@ checks:
scan_base = offset = si->lowest_bit;

/* reuse swap entry of cache-only swap if not busy. */
- if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
int swap_was_freed;
spin_unlock(&si->lock);
swap_was_freed = __try_to_reclaim_swap(si, offset);
@@ -382,7 +382,8 @@ scan:
spin_lock(&si->lock);
goto checks;
}
- if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ if (vm_swap_full(si) &&
+ si->swap_map[offset] == SWAP_HAS_CACHE) {
spin_lock(&si->lock);
goto checks;
}
@@ -397,7 +398,8 @@ scan:
spin_lock(&si->lock);
goto checks;
}
- if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ if (vm_swap_full(si) &&
+ si->swap_map[offset] == SWAP_HAS_CACHE) {
spin_lock(&si->lock);
goto checks;
}
@@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
* Also recheck PageSwapCache now page is locked (above).
*/
if (PageSwapCache(page) && !PageWriteback(page) &&
- (!page_mapped(page) || vm_swap_full())) {
+ (!page_mapped(page) ||
+ vm_swap_full(page_swap_info(page)))) {
delete_from_swap_cache(page);
SetPageDirty(page);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df78d17..145c59c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -933,7 +933,7 @@ cull_mlocked:

activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
- if (PageSwapCache(page) && vm_swap_full())
+ if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
try_to_free_swap(page);
VM_BUG_ON(PageActive(page));
SetPageActive(page);
--
1.8.2


2013-03-27 05:03:29

by Kyungmin Park

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi,

On Wed, Mar 27, 2013 at 11:22 AM, Minchan Kim <[email protected]> wrote:
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
>
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
>
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
I perfer to add new SWP_INMEMORY for z* family. as you know SSD and
memory is different characteristics.
and if new type is added, it doesn't need to modify lots of codes.

Do you have any data for it? do you get meaningful performance gain or
efficiency of z* family? If yes, please share it.

Thank you,
Kyungmin Park

>
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
>
> Any idea?
>
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.
> What do you think about it?
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Dan Magenheimer <[email protected]>
> Cc: Seth Jennings <[email protected]>
> Cc: Nitin Gupta <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Shaohua Li <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/swap.h | 11 ++++++++---
> mm/memory.c | 3 ++-
> mm/swapfile.c | 11 +++++++----
> mm/vmscan.c | 2 +-
> 4 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2818a12..1f4df66 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
> extern atomic_long_t nr_swap_pages;
> extern long total_swap_pages;
>
> -/* Swap 50% full? Release swapcache more aggressively.. */
> -static inline bool vm_swap_full(void)
> +/*
> + * Swap 50% full or fast backed device?
> + * Release swapcache more aggressively.
> + */
> +static inline bool vm_swap_full(struct swap_info_struct *si)
> {
> + if (si->flags & SWP_SOLIDSTATE)
> + return true;
> return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
> }
>
> @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> #define get_nr_swap_pages() 0L
> #define total_swap_pages 0L
> #define total_swapcache_pages() 0UL
> -#define vm_swap_full() 0
> +#define vm_swap_full(si) 0
>
> #define si_swapinfo(val) \
> do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> diff --git a/mm/memory.c b/mm/memory.c
> index 705473a..1ca21a9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> swap_free(entry);
> - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
> + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
> try_to_free_swap(page);
> unlock_page(page);
> if (page != swapcache) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1bee6fa..f9cc701 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -293,7 +293,7 @@ checks:
> scan_base = offset = si->lowest_bit;
>
> /* reuse swap entry of cache-only swap if not busy. */
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
> int swap_was_freed;
> spin_unlock(&si->lock);
> swap_was_freed = __try_to_reclaim_swap(si, offset);
> @@ -382,7 +382,8 @@ scan:
> spin_lock(&si->lock);
> goto checks;
> }
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + if (vm_swap_full(si) &&
> + si->swap_map[offset] == SWAP_HAS_CACHE) {
> spin_lock(&si->lock);
> goto checks;
> }
> @@ -397,7 +398,8 @@ scan:
> spin_lock(&si->lock);
> goto checks;
> }
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + if (vm_swap_full(si) &&
> + si->swap_map[offset] == SWAP_HAS_CACHE) {
> spin_lock(&si->lock);
> goto checks;
> }
> @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
> * Also recheck PageSwapCache now page is locked (above).
> */
> if (PageSwapCache(page) && !PageWriteback(page) &&
> - (!page_mapped(page) || vm_swap_full())) {
> + (!page_mapped(page) ||
> + vm_swap_full(page_swap_info(page)))) {
> delete_from_swap_cache(page);
> SetPageDirty(page);
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df78d17..145c59c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -933,7 +933,7 @@ cull_mlocked:
>
> activate_locked:
> /* Not a candidate for swapping, so reclaim swap space. */
> - if (PageSwapCache(page) && vm_swap_full())
> + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
> try_to_free_swap(page);
> VM_BUG_ON(PageActive(page));
> SetPageActive(page);
> --
> 1.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-03-27 05:16:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

(2013/03/27 11:22), Minchan Kim wrote:
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
>
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
>
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
>
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
>
> Any idea?
>
Another thinking....in what case, in what system configuration,
vm_swap_full() should return false and delay swp_entry freeing ?

Thanks,
-Kame

2013-03-27 07:05:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi Kame,

On Wed, Mar 27, 2013 at 02:15:41PM +0900, Kamezawa Hiroyuki wrote:
> (2013/03/27 11:22), Minchan Kim wrote:
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> >
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
> >
> > Other problem is zram is block device so that it can set SWP_INMEMORY
> > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> > I have no idea to use it for frontswap.
> >
> > Any idea?
> >
> Another thinking....in what case, in what system configuration,
> vm_swap_full() should return false and delay swp_entry freeing ?

It's a really good question I had have in mind from long time ago.
If I catch your point properly, your question is "Couldn't we remove
vm_swap_full logic?"

If so, the answer is "I have no idea and would like to ask it
to Hugh".

Academically, it does make sense swap-out page is unlikely to be
working set so it could be swap out again and I believe it was
merged since we had the workload could be enhanced by the logic
at that time.

And I think it's not easy to prove it's useless thesedays because
I couldn't have all recent workloads over the world so I'd like to
avoid such adventure. :)

Thanks.

>
> Thanks,
> -Kame
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-03-27 17:21:10

by Seth Jennings

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On 03/26/2013 09:22 PM, Minchan Kim wrote:
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
>
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
>
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.

Great idea!

> For it, I used SWP_SOLIDSTATE but It might be controversial.

The comment for SWP_SOLIDSTATE is that "blkdev seeks are cheap". Just
because seeks are cheap doesn't mean the read itself is also cheap.
For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of
them can be pretty slow.

> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.

Afaict, setting SWP_SOLIDSTATE depends on characteristics of the
underlying block device (i.e. blk_queue_nonrot()). zram is a block
device but zcache and zswap are not.

Any idea by what criteria SWP_INMEMORY would be set?

Also, frontswap backends (zcache and zswap) are a caching layer on top
of the real swap device, which might actually be rotating media. So
you have the issue of to different characteristics, in-memory caching
on top of rotation media, present in a single swap device.

Thanks,
Seth

2013-03-27 21:20:47

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] mm: remove swapcache page early

> From: Minchan Kim [mailto:[email protected]]
> Subject: [RFC] mm: remove swapcache page early
>
> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
>
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.
>
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.
> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
>
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
>
> Any idea?
>
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.

By passing a struct page * to vm_swap_full() you can then call
frontswap_test()... if it returns true, then vm_swap_full
can return true. Note that this precisely checks whether
the page is in zcache/zswap or not, so Seth's concern that
some pages may be in-memory and some may be in rotating
storage is no longer an issue.

> What do you think about it?

By removing the page from swapcache, you are now increasing
the risk that pages will "thrash" between uncompressed state
(in swapcache) and compressed state (in z*). I think this is
a better tradeoff though than keeping a copy of both the
compressed page AND the uncompressed page in memory.

You should probably rename vm_swap_full() because you are
now overloading it with other meanings. Maybe
vm_swap_reclaimable()?

Do you have any measurements? I think you are correct
that it may help a LOT.

Thanks,
Dan

2013-03-27 21:41:30

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Wed, 27 Mar 2013, Minchan Kim wrote:

> Swap subsystem does lazy swap slot free with expecting the page
> would be swapped out again so we can't avoid unnecessary write.
so we can avoid unnecessary write.
>
> But the problem in in-memory swap is that it consumes memory space
> until vm_swap_full(ie, used half of all of swap device) condition
> meet. It could be bad if we use multiple swap device, small in-memory swap
> and big storage swap or in-memory swap alone.

That is a very good realization: it's surprising that none of us
thought of it before - no disrespect to you, well done, thank you.

And I guess swap readahead is utterly unhelpful in this case too.

>
> This patch changes vm_swap_full logic slightly so it could free
> swap slot early if the backed device is really fast.
> For it, I used SWP_SOLIDSTATE but It might be controversial.

But I strongly disagree with almost everything in your patch :)
I disagree with addressing it in vm_swap_full(), I disagree that
it can be addressed by device, I disagree that it has anything to
do with SWP_SOLIDSTATE.

This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
is it? In those cases, a fixed amount of memory has been set aside
for swap, and it works out just like with disk block devices. The
memory set aside may be wasted, but that is accepted upfront.

Similarly, this is not a problem with swapping to SSD. There might
or might not be other reasons for adjusting the vm_swap_full() logic
for SSD or generally, but those have nothing to do with this issue.

The problem here is peculiar to frontswap, and the variably sized
memory behind it, isn't it? We are accustomed to using swap to free
up memory by transferring its data to some other, cheaper but slower
resource.

But in the case of frontswap and zmem (I'll say that to avoid thinking
through which backends are actually involved), it is not a cheaper and
slower resource, but the very same memory we are trying to save: swap
is stolen from the memory under reclaim, so any duplication becomes
counter-productive (if we ignore cpu compression/decompression costs:
I have no idea how fair it is to do so, but anyone who chooses zmem
is prepared to pay some cpu price for that).

And because it's a frontswap thing, we cannot decide this by device:
frontswap may or may not stand in front of each device. There is no
problem with swapcache duplicated on disk (until that area approaches
being full or fragmented), but at the higher level we cannot see what
is in zmem and what is on disk: we only want to free up the zmem dup.

I believe the answer is for frontswap/zmem to invalidate the frontswap
copy of the page (to free up the compressed memory when possible) and
SetPageDirty on the PageUptodate PageSwapCache page when swapping in
(setting page dirty so nothing will later go to read it from the
unfreed location on backing swap disk, which was never written).

We cannot rely on freeing the swap itself, because in general there
may be multiple references to the swap, and we only satisfy the one
which has faulted. It may or may not be a good idea to use rmap to
locate the other places to insert pte in place of swap entry, to
resolve them all at once; but we have chosen not to do so in the
past, and there's no need for that, if the zmem gets invalidated
and the swapcache page set dirty.

Hugh

> So let's add Ccing Shaohua and Hugh.
> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> or something for z* family.
>
> Other problem is zram is block device so that it can set SWP_INMEMORY
> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> I have no idea to use it for frontswap.
>
> Any idea?
>
> Other optimize point is we remove it unconditionally when we
> found it's exclusive when swap in happen.
> It could help frontswap family, too.
> What do you think about it?
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Dan Magenheimer <[email protected]>
> Cc: Seth Jennings <[email protected]>
> Cc: Nitin Gupta <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Shaohua Li <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> include/linux/swap.h | 11 ++++++++---
> mm/memory.c | 3 ++-
> mm/swapfile.c | 11 +++++++----
> mm/vmscan.c | 2 +-
> 4 files changed, 18 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2818a12..1f4df66 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
> extern atomic_long_t nr_swap_pages;
> extern long total_swap_pages;
>
> -/* Swap 50% full? Release swapcache more aggressively.. */
> -static inline bool vm_swap_full(void)
> +/*
> + * Swap 50% full or fast backed device?
> + * Release swapcache more aggressively.
> + */
> +static inline bool vm_swap_full(struct swap_info_struct *si)
> {
> + if (si->flags & SWP_SOLIDSTATE)
> + return true;
> return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
> }
>
> @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> #define get_nr_swap_pages() 0L
> #define total_swap_pages 0L
> #define total_swapcache_pages() 0UL
> -#define vm_swap_full() 0
> +#define vm_swap_full(si) 0
>
> #define si_swapinfo(val) \
> do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> diff --git a/mm/memory.c b/mm/memory.c
> index 705473a..1ca21a9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> swap_free(entry);
> - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
> + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
> try_to_free_swap(page);
> unlock_page(page);
> if (page != swapcache) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1bee6fa..f9cc701 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -293,7 +293,7 @@ checks:
> scan_base = offset = si->lowest_bit;
>
> /* reuse swap entry of cache-only swap if not busy. */
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
> int swap_was_freed;
> spin_unlock(&si->lock);
> swap_was_freed = __try_to_reclaim_swap(si, offset);
> @@ -382,7 +382,8 @@ scan:
> spin_lock(&si->lock);
> goto checks;
> }
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + if (vm_swap_full(si) &&
> + si->swap_map[offset] == SWAP_HAS_CACHE) {
> spin_lock(&si->lock);
> goto checks;
> }
> @@ -397,7 +398,8 @@ scan:
> spin_lock(&si->lock);
> goto checks;
> }
> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> + if (vm_swap_full(si) &&
> + si->swap_map[offset] == SWAP_HAS_CACHE) {
> spin_lock(&si->lock);
> goto checks;
> }
> @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
> * Also recheck PageSwapCache now page is locked (above).
> */
> if (PageSwapCache(page) && !PageWriteback(page) &&
> - (!page_mapped(page) || vm_swap_full())) {
> + (!page_mapped(page) ||
> + vm_swap_full(page_swap_info(page)))) {
> delete_from_swap_cache(page);
> SetPageDirty(page);
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df78d17..145c59c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -933,7 +933,7 @@ cull_mlocked:
>
> activate_locked:
> /* Not a candidate for swapping, so reclaim swap space. */
> - if (PageSwapCache(page) && vm_swap_full())
> + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
> try_to_free_swap(page);
> VM_BUG_ON(PageActive(page));
> SetPageActive(page);
> --
> 1.8.2

2013-03-27 22:24:38

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] mm: remove swapcache page early

> From: Hugh Dickins [mailto:[email protected]]
> Subject: Re: [RFC] mm: remove swapcache page early
>
> On Wed, 27 Mar 2013, Minchan Kim wrote:
>
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> so we can avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
>
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.

Yes, my compliments also Minchan. This problem has been thought of before
but this patch is the first to identify a possible solution.

> And I guess swap readahead is utterly unhelpful in this case too.

Yes... as is any "swap writeahead". Excuse my ignorance, but I
think this is not done in the swap subsystem but instead the kernel
assumes write-coalescing will be done in the block I/O subsystem,
which means swap writeahead would affect zram but not zcache/zswap
(since frontswap subverts the block I/O subsystem).

However I think a swap-readahead solution would be helpful to
zram as well as zcache/zswap.

> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
>
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
>
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it? In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices. The
> memory set aside may be wasted, but that is accepted upfront.

It is (I believe) also a problem with swapping to ram. Two
copies of the same page are kept in memory in different places,
right? Fixed vs variable size is irrelevant I think. Or am
I misunderstanding something about swap-to-ram?

> Similarly, this is not a problem with swapping to SSD. There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.

I think it is at least highly related. The key issue is the
tradeoff of the likelihood that the page will soon be read/written
again while it is in swap cache vs the time/resource-usage necessary
to "reconstitute" the page into swap cache. Reconstituting from disk
requires a LOT of elapsed time. Reconstituting from
an SSD likely takes much less time. Reconstituting from
zcache/zram takes thousands of CPU cycles.

> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it? We are accustomed to using swap to free
> up memory by transferring its data to some other, cheaper but slower
> resource.

Frontswap does make the problem more complex because some pages
are in "fairly fast" storage (zcache, needs decompression) and
some are on the actual (usually) rotating media. Fortunately,
differentiating between these two cases is just a table lookup
(see frontswap_test).

> But in the case of frontswap and zmem (I'll say that to avoid thinking
> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).

Exactly. There is some "robbing of Peter to pay Paul" and
other complex resource tradeoffs. Presumably, though, it is
not "the very same memory we are trying to save" but a
fraction of it, saving the same page of data more efficiently
in memory, using less than a page, at some CPU cost.

> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device. There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.

I *think* frontswap_test(page) resolves this problem, as long as
we have a specific page available to use as a parameter.

> I believe the answer is for frontswap/zmem to invalidate the frontswap
> copy of the page (to free up the compressed memory when possible) and
> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> (setting page dirty so nothing will later go to read it from the
> unfreed location on backing swap disk, which was never written).

There are two duplication issues: (1) When can the page be removed
from the swap cache after a call to frontswap_store; and (2) When
can the page be removed from the frontswap storage after it
has been brought back into memory via frontswap_load.

This patch from Minchan addresses (1). The issue you are raising
here is (2). You may not know that (2) has recently been solved
in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
If this is enabled (and it is for zcache but not yet for zswap),
what you suggest (SetPageDirty) is what happens.

> We cannot rely on freeing the swap itself, because in general there
> may be multiple references to the swap, and we only satisfy the one
> which has faulted. It may or may not be a good idea to use rmap to
> locate the other places to insert pte in place of swap entry, to
> resolve them all at once; but we have chosen not to do so in the
> past, and there's no need for that, if the zmem gets invalidated
> and the swapcache page set dirty.

I see. Minchan's patch handles the removal "reactively"... it
might be possible to handle it more proactively. Or it may
be possible to take the number of references into account when
deciding whether to frontswap_store the page as, presumably,
the likelihood of needing to "reconstitute" the page sooner increases
with each additional reference.

> Hugh

Very useful thoughts, Hugh. Thanks much and looking forward
to more discussion at LSF/MM!

Dan

P.S. When I refer to zcache, I am referring to the version in
drivers/staging/zcache in 3.9. The code in drivers/staging/zcache
in 3.8 is "old zcache"... "new zcache" is in drivers/staging/ramster
in 3.8. Sorry for any confusion...

2013-03-27 23:17:16

by Hugh Dickins

[permalink] [raw]
Subject: RE: [RFC] mm: remove swapcache page early

On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > From: Hugh Dickins [mailto:[email protected]]
> > Subject: Re: [RFC] mm: remove swapcache page early
> >
> > On Wed, 27 Mar 2013, Minchan Kim wrote:
> >
> > > Swap subsystem does lazy swap slot free with expecting the page
> > > would be swapped out again so we can't avoid unnecessary write.
> > so we can avoid unnecessary write.
> > >
> > > But the problem in in-memory swap is that it consumes memory space
> > > until vm_swap_full(ie, used half of all of swap device) condition
> > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > and big storage swap or in-memory swap alone.
> >
> > That is a very good realization: it's surprising that none of us
> > thought of it before - no disrespect to you, well done, thank you.
>
> Yes, my compliments also Minchan. This problem has been thought of before
> but this patch is the first to identify a possible solution.
>
> > And I guess swap readahead is utterly unhelpful in this case too.
>
> Yes... as is any "swap writeahead". Excuse my ignorance, but I
> think this is not done in the swap subsystem but instead the kernel
> assumes write-coalescing will be done in the block I/O subsystem,
> which means swap writeahead would affect zram but not zcache/zswap
> (since frontswap subverts the block I/O subsystem).

I don't know what swap writeahead is; but write coalescing, yes.
I don't see any problem with it in this context.

>
> However I think a swap-readahead solution would be helpful to
> zram as well as zcache/zswap.

Whereas swap readahead on zmem is uncompressing zmem to pagecache
which may never be needed, and may take a circuit of the inactive
LRU before it gets reclaimed (if it turns out not to be needed,
at least it will remain clean and be easily reclaimed).

>
> > > This patch changes vm_swap_full logic slightly so it could free
> > > swap slot early if the backed device is really fast.
> > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> >
> > But I strongly disagree with almost everything in your patch :)
> > I disagree with addressing it in vm_swap_full(), I disagree that
> > it can be addressed by device, I disagree that it has anything to
> > do with SWP_SOLIDSTATE.
> >
> > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > is it? In those cases, a fixed amount of memory has been set aside
> > for swap, and it works out just like with disk block devices. The
> > memory set aside may be wasted, but that is accepted upfront.
>
> It is (I believe) also a problem with swapping to ram. Two
> copies of the same page are kept in memory in different places,
> right? Fixed vs variable size is irrelevant I think. Or am
> I misunderstanding something about swap-to-ram?

I may be misrembering how /dev/ram0 works, or simply assuming that
if you want to use it for swap (interesting for testing, but probably
not for general use), then you make sure to allocate each page of it
in advance.

The pages of /dev/ram0 don't get freed, or not before it's closed
(swapoff'ed) anyway. Yes, swapcache would be duplicating data from
other memory into /dev/ram0 memory; but that /dev/ram0 memory has
been set aside for this purpose, and removing from swapcache won't
free any more memory.

>
> > Similarly, this is not a problem with swapping to SSD. There might
> > or might not be other reasons for adjusting the vm_swap_full() logic
> > for SSD or generally, but those have nothing to do with this issue.
>
> I think it is at least highly related. The key issue is the
> tradeoff of the likelihood that the page will soon be read/written
> again while it is in swap cache vs the time/resource-usage necessary
> to "reconstitute" the page into swap cache. Reconstituting from disk
> requires a LOT of elapsed time. Reconstituting from
> an SSD likely takes much less time. Reconstituting from
> zcache/zram takes thousands of CPU cycles.

I acknowledge my complete ignorance of how to judge the tradeoff
between memory usage and cpu usage, but I think Minchan's main
concern was with the memory usage. Neither hard disk nor SSD
is occupying memory.

>
> > The problem here is peculiar to frontswap, and the variably sized
> > memory behind it, isn't it? We are accustomed to using swap to free
> > up memory by transferring its data to some other, cheaper but slower
> > resource.
>
> Frontswap does make the problem more complex because some pages
> are in "fairly fast" storage (zcache, needs decompression) and
> some are on the actual (usually) rotating media. Fortunately,
> differentiating between these two cases is just a table lookup
> (see frontswap_test).
>
> > But in the case of frontswap and zmem (I'll say that to avoid thinking
> > through which backends are actually involved), it is not a cheaper and
> > slower resource, but the very same memory we are trying to save: swap
> > is stolen from the memory under reclaim, so any duplication becomes
> > counter-productive (if we ignore cpu compression/decompression costs:
> > I have no idea how fair it is to do so, but anyone who chooses zmem
> > is prepared to pay some cpu price for that).
>
> Exactly. There is some "robbing of Peter to pay Paul" and
> other complex resource tradeoffs. Presumably, though, it is
> not "the very same memory we are trying to save" but a
> fraction of it, saving the same page of data more efficiently
> in memory, using less than a page, at some CPU cost.

Yes, I'm not saying that frontswap/zmem is pointless: just agreeing
with Minchan that in this case the duplication inherent in swapcache
can be waste of memory that we should try to avoid.

>
> > And because it's a frontswap thing, we cannot decide this by device:
> > frontswap may or may not stand in front of each device. There is no
> > problem with swapcache duplicated on disk (until that area approaches
> > being full or fragmented), but at the higher level we cannot see what
> > is in zmem and what is on disk: we only want to free up the zmem dup.
>
> I *think* frontswap_test(page) resolves this problem, as long as
> we have a specific page available to use as a parameter.
>
> > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > copy of the page (to free up the compressed memory when possible) and
> > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > (setting page dirty so nothing will later go to read it from the
> > unfreed location on backing swap disk, which was never written).
>
> There are two duplication issues: (1) When can the page be removed
> from the swap cache after a call to frontswap_store; and (2) When
> can the page be removed from the frontswap storage after it
> has been brought back into memory via frontswap_load.
>
> This patch from Minchan addresses (1).

Ying Han was reminding me of this case a couple of hours ago, we don't
see a problem there: when frontswap_store() succeeds, there's an
end_page_writeback() as there should be, and shrink_page_list()
should reclaim the page immediately. So I think (1) is already
handled and Minchan was not trying to address it.

> The issue you are raising
> here is (2). You may not know that (2) has recently been solved
> in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> If this is enabled (and it is for zcache but not yet for zswap),
> what you suggest (SetPageDirty) is what happens.

Ah, and I have a dim, perhaps mistaken, memory that I gave you
input on that before, suggesting the SetPageDirty. Good, sounds
like the solution is already in place, if not actually activated.

Thanks, must dash,
Hugh

>
> > We cannot rely on freeing the swap itself, because in general there
> > may be multiple references to the swap, and we only satisfy the one
> > which has faulted. It may or may not be a good idea to use rmap to
> > locate the other places to insert pte in place of swap entry, to
> > resolve them all at once; but we have chosen not to do so in the
> > past, and there's no need for that, if the zmem gets invalidated
> > and the swapcache page set dirty.
>
> I see. Minchan's patch handles the removal "reactively"... it
> might be possible to handle it more proactively. Or it may
> be possible to take the number of references into account when
> deciding whether to frontswap_store the page as, presumably,
> the likelihood of needing to "reconstitute" the page sooner increases
> with each additional reference.
>
> > Hugh
>
> Very useful thoughts, Hugh. Thanks much and looking forward
> to more discussion at LSF/MM!
>
> Dan
>
> P.S. When I refer to zcache, I am referring to the version in
> drivers/staging/zcache in 3.9. The code in drivers/staging/zcache
> in 3.8 is "old zcache"... "new zcache" is in drivers/staging/ramster
> in 3.8. Sorry for any confusion...

2013-03-28 00:36:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi Hugh,

On Wed, Mar 27, 2013 at 02:41:07PM -0700, Hugh Dickins wrote:
> On Wed, 27 Mar 2013, Minchan Kim wrote:
>
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> so we can avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
>
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.
>
> And I guess swap readahead is utterly unhelpful in this case too.
>
> >
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
>
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
>
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it? In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices. The

Brd is okay but it seems you are miunderstanding zram.
The zram doesn't reserve any memory and allocate dynamic memory when
swap out happens so it can make duplicate space in pusdo block device
and memory.

> memory set aside may be wasted, but that is accepted upfront.
>
> Similarly, this is not a problem with swapping to SSD. There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.

Yes.

>
> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it? We are accustomed to using swap to free

Zram, too.

> up memory by transferring its data to some other, cheaper but slower
> resource.
>
> But in the case of frontswap and zmem (I'll say that to avoid thinking

Frankly speaking, I couldn't understand what you means, frontswap and zmem.
The frontswap is just layer for hook the swap subsystem.
Real instance of frontswap is zcache and zswap at the moment.
I will understand them as zcache and zswap. Okay?

> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).

Agree.

>
> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device. There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.

That's what I really have a concern and why I begged idea.

>
> I believe the answer is for frontswap/zmem to invalidate the frontswap
> copy of the page (to free up the compressed memory when possible) and
> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> (setting page dirty so nothing will later go to read it from the
> unfreed location on backing swap disk, which was never written).

You mean that zcache and zswap have to do garbage collection by some
policy? It could be but how about zram? It's just pseudo block device
and he don't have any knowledge on top of it. It could be swap or normal
block device. I mean zram has no information of swap to handle it.

>
> We cannot rely on freeing the swap itself, because in general there
> may be multiple references to the swap, and we only satisfy the one
> which has faulted. It may or may not be a good idea to use rmap to
> locate the other places to insert pte in place of swap entry, to
> resolve them all at once; but we have chosen not to do so in the
> past, and there's no need for that, if the zmem gets invalidated
> and the swapcache page set dirty.

Yes it could be better but as I mentioned above, it couldn't handle
zram case. If there is a solution for zram, I will be happy. :)

And another point, fronstwap is already percolated into swap subsystem
very tightly. So I doubt adding one another hook is a really problem.

Thanks for great comment, Hugh!

>
> Hugh
>
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
> >
> > Other problem is zram is block device so that it can set SWP_INMEMORY
> > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
> > I have no idea to use it for frontswap.
> >
> > Any idea?
> >
> > Other optimize point is we remove it unconditionally when we
> > found it's exclusive when swap in happen.
> > It could help frontswap family, too.
> > What do you think about it?
> >
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Dan Magenheimer <[email protected]>
> > Cc: Seth Jennings <[email protected]>
> > Cc: Nitin Gupta <[email protected]>
> > Cc: Konrad Rzeszutek Wilk <[email protected]>
> > Cc: Shaohua Li <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > include/linux/swap.h | 11 ++++++++---
> > mm/memory.c | 3 ++-
> > mm/swapfile.c | 11 +++++++----
> > mm/vmscan.c | 2 +-
> > 4 files changed, 18 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2818a12..1f4df66 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
> > extern atomic_long_t nr_swap_pages;
> > extern long total_swap_pages;
> >
> > -/* Swap 50% full? Release swapcache more aggressively.. */
> > -static inline bool vm_swap_full(void)
> > +/*
> > + * Swap 50% full or fast backed device?
> > + * Release swapcache more aggressively.
> > + */
> > +static inline bool vm_swap_full(struct swap_info_struct *si)
> > {
> > + if (si->flags & SWP_SOLIDSTATE)
> > + return true;
> > return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
> > }
> >
> > @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
> > #define get_nr_swap_pages() 0L
> > #define total_swap_pages 0L
> > #define total_swapcache_pages() 0UL
> > -#define vm_swap_full() 0
> > +#define vm_swap_full(si) 0
> >
> > #define si_swapinfo(val) \
> > do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 705473a..1ca21a9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > mem_cgroup_commit_charge_swapin(page, ptr);
> >
> > swap_free(entry);
> > - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> > + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
> > + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
> > try_to_free_swap(page);
> > unlock_page(page);
> > if (page != swapcache) {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 1bee6fa..f9cc701 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -293,7 +293,7 @@ checks:
> > scan_base = offset = si->lowest_bit;
> >
> > /* reuse swap entry of cache-only swap if not busy. */
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > int swap_was_freed;
> > spin_unlock(&si->lock);
> > swap_was_freed = __try_to_reclaim_swap(si, offset);
> > @@ -382,7 +382,8 @@ scan:
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + if (vm_swap_full(si) &&
> > + si->swap_map[offset] == SWAP_HAS_CACHE) {
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > @@ -397,7 +398,8 @@ scan:
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> > + if (vm_swap_full(si) &&
> > + si->swap_map[offset] == SWAP_HAS_CACHE) {
> > spin_lock(&si->lock);
> > goto checks;
> > }
> > @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
> > * Also recheck PageSwapCache now page is locked (above).
> > */
> > if (PageSwapCache(page) && !PageWriteback(page) &&
> > - (!page_mapped(page) || vm_swap_full())) {
> > + (!page_mapped(page) ||
> > + vm_swap_full(page_swap_info(page)))) {
> > delete_from_swap_cache(page);
> > SetPageDirty(page);
> > }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index df78d17..145c59c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -933,7 +933,7 @@ cull_mlocked:
> >
> > activate_locked:
> > /* Not a candidate for swapping, so reclaim swap space. */
> > - if (PageSwapCache(page) && vm_swap_full())
> > + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
> > try_to_free_swap(page);
> > VM_BUG_ON(PageActive(page));
> > SetPageActive(page);
> > --
> > 1.8.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-03-28 01:07:12

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi Dan,

On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > From: Hugh Dickins [mailto:[email protected]]
> > Subject: Re: [RFC] mm: remove swapcache page early
> >
> > On Wed, 27 Mar 2013, Minchan Kim wrote:
> >
> > > Swap subsystem does lazy swap slot free with expecting the page
> > > would be swapped out again so we can't avoid unnecessary write.
> > so we can avoid unnecessary write.
> > >
> > > But the problem in in-memory swap is that it consumes memory space
> > > until vm_swap_full(ie, used half of all of swap device) condition
> > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > and big storage swap or in-memory swap alone.
> >
> > That is a very good realization: it's surprising that none of us
> > thought of it before - no disrespect to you, well done, thank you.
>
> Yes, my compliments also Minchan. This problem has been thought of before
> but this patch is the first to identify a possible solution.

Thanks!

>
> > And I guess swap readahead is utterly unhelpful in this case too.
>
> Yes... as is any "swap writeahead". Excuse my ignorance, but I
> think this is not done in the swap subsystem but instead the kernel
> assumes write-coalescing will be done in the block I/O subsystem,
> which means swap writeahead would affect zram but not zcache/zswap
> (since frontswap subverts the block I/O subsystem).

Frankly speaking, I don't know why you mentioned "swap writeahead"
in this point. Anyway, I dobut how it effect zram, too. A gain I can
have a mind is compress ratio would be high thorough multiple page
compression all at once.

>
> However I think a swap-readahead solution would be helpful to
> zram as well as zcache/zswap.

Hmm, why? swap-readahead is just hint to reduce big stall time to
reduce on big seek overhead storage. But in-memory swap is no cost
for seeking. So unnecessary swap-readahead can make memory pressure
high and it could cause another page swap out so it could be swap-thrashing.
And for good swap-readahead hit ratio, swap device shouldn't be fragmented.
But as you know, there are many factor to prevent it in the kernel now
and Shaohua is tackling on it.

>
> > > This patch changes vm_swap_full logic slightly so it could free
> > > swap slot early if the backed device is really fast.
> > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> >
> > But I strongly disagree with almost everything in your patch :)
> > I disagree with addressing it in vm_swap_full(), I disagree that
> > it can be addressed by device, I disagree that it has anything to
> > do with SWP_SOLIDSTATE.
> >
> > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > is it? In those cases, a fixed amount of memory has been set aside
> > for swap, and it works out just like with disk block devices. The
> > memory set aside may be wasted, but that is accepted upfront.
>
> It is (I believe) also a problem with swapping to ram. Two
> copies of the same page are kept in memory in different places,
> right? Fixed vs variable size is irrelevant I think. Or am
> I misunderstanding something about swap-to-ram?
>
> > Similarly, this is not a problem with swapping to SSD. There might
> > or might not be other reasons for adjusting the vm_swap_full() logic
> > for SSD or generally, but those have nothing to do with this issue.
>
> I think it is at least highly related. The key issue is the
> tradeoff of the likelihood that the page will soon be read/written
> again while it is in swap cache vs the time/resource-usage necessary
> to "reconstitute" the page into swap cache. Reconstituting from disk
> requires a LOT of elapsed time. Reconstituting from
> an SSD likely takes much less time. Reconstituting from
> zcache/zram takes thousands of CPU cycles.

Yeb. That's why I wanted to use SWP_SOLIDSTATE.

>
> > The problem here is peculiar to frontswap, and the variably sized
> > memory behind it, isn't it? We are accustomed to using swap to free
> > up memory by transferring its data to some other, cheaper but slower
> > resource.
>
> Frontswap does make the problem more complex because some pages
> are in "fairly fast" storage (zcache, needs decompression) and
> some are on the actual (usually) rotating media. Fortunately,
> differentiating between these two cases is just a table lookup
> (see frontswap_test).

Yeb, I thouht it could be a last resort because I'd like to avoid
lookup every swapin if possible.

>
> > But in the case of frontswap and zmem (I'll say that to avoid thinking
> > through which backends are actually involved), it is not a cheaper and
> > slower resource, but the very same memory we are trying to save: swap
> > is stolen from the memory under reclaim, so any duplication becomes
> > counter-productive (if we ignore cpu compression/decompression costs:
> > I have no idea how fair it is to do so, but anyone who chooses zmem
> > is prepared to pay some cpu price for that).
>
> Exactly. There is some "robbing of Peter to pay Paul" and
> other complex resource tradeoffs. Presumably, though, it is
> not "the very same memory we are trying to save" but a
> fraction of it, saving the same page of data more efficiently
> in memory, using less than a page, at some CPU cost.
>
> > And because it's a frontswap thing, we cannot decide this by device:
> > frontswap may or may not stand in front of each device. There is no
> > problem with swapcache duplicated on disk (until that area approaches
> > being full or fragmented), but at the higher level we cannot see what
> > is in zmem and what is on disk: we only want to free up the zmem dup.
>
> I *think* frontswap_test(page) resolves this problem, as long as
> we have a specific page available to use as a parameter.

Agreed. Will do the method if we all agree on the way because there isn't
better approach.

>
> > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > copy of the page (to free up the compressed memory when possible) and
> > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > (setting page dirty so nothing will later go to read it from the
> > unfreed location on backing swap disk, which was never written).
>
> There are two duplication issues: (1) When can the page be removed
> from the swap cache after a call to frontswap_store; and (2) When
> can the page be removed from the frontswap storage after it
> has been brought back into memory via frontswap_load.
>
> This patch from Minchan addresses (1). The issue you are raising

No. I am addressing (2).

> here is (2). You may not know that (2) has recently been solved
> in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> If this is enabled (and it is for zcache but not yet for zswap),
> what you suggest (SetPageDirty) is what happens.

I am blind on zcache so I didn't see it. Anyway, I'd like to address it
on zram and zswap.

>
> > We cannot rely on freeing the swap itself, because in general there
> > may be multiple references to the swap, and we only satisfy the one
> > which has faulted. It may or may not be a good idea to use rmap to
> > locate the other places to insert pte in place of swap entry, to
> > resolve them all at once; but we have chosen not to do so in the
> > past, and there's no need for that, if the zmem gets invalidated
> > and the swapcache page set dirty.
>
> I see. Minchan's patch handles the removal "reactively"... it
> might be possible to handle it more proactively. Or it may
> be possible to take the number of references into account when
> deciding whether to frontswap_store the page as, presumably,
> the likelihood of needing to "reconstitute" the page sooner increases
> with each additional reference.
>
> > Hugh
>
> Very useful thoughts, Hugh. Thanks much and looking forward
> to more discussion at LSF/MM!

Dan, Your thought is VERY useful. Thanks much and looking forward
to more discsussion at LFS/MM!

--
Kind regards,
Minchan Kim

2013-03-28 01:18:30

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote:
> On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:[email protected]]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > >
> > > > Swap subsystem does lazy swap slot free with expecting the page
> > > > would be swapped out again so we can't avoid unnecessary write.
> > > so we can avoid unnecessary write.
> > > >
> > > > But the problem in in-memory swap is that it consumes memory space
> > > > until vm_swap_full(ie, used half of all of swap device) condition
> > > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > > and big storage swap or in-memory swap alone.
> > >
> > > That is a very good realization: it's surprising that none of us
> > > thought of it before - no disrespect to you, well done, thank you.
> >
> > Yes, my compliments also Minchan. This problem has been thought of before
> > but this patch is the first to identify a possible solution.
> >
> > > And I guess swap readahead is utterly unhelpful in this case too.
> >
> > Yes... as is any "swap writeahead". Excuse my ignorance, but I
> > think this is not done in the swap subsystem but instead the kernel
> > assumes write-coalescing will be done in the block I/O subsystem,
> > which means swap writeahead would affect zram but not zcache/zswap
> > (since frontswap subverts the block I/O subsystem).
>
> I don't know what swap writeahead is; but write coalescing, yes.
> I don't see any problem with it in this context.
>
> >
> > However I think a swap-readahead solution would be helpful to
> > zram as well as zcache/zswap.
>
> Whereas swap readahead on zmem is uncompressing zmem to pagecache
> which may never be needed, and may take a circuit of the inactive
> LRU before it gets reclaimed (if it turns out not to be needed,
> at least it will remain clean and be easily reclaimed).

But it could evict more important pages before reaching out the tail.
That's thing we really want to avoid if possible.

>
> >
> > > > This patch changes vm_swap_full logic slightly so it could free
> > > > swap slot early if the backed device is really fast.
> > > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > >
> > > But I strongly disagree with almost everything in your patch :)
> > > I disagree with addressing it in vm_swap_full(), I disagree that
> > > it can be addressed by device, I disagree that it has anything to
> > > do with SWP_SOLIDSTATE.
> > >
> > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > > is it? In those cases, a fixed amount of memory has been set aside
> > > for swap, and it works out just like with disk block devices. The
> > > memory set aside may be wasted, but that is accepted upfront.
> >
> > It is (I believe) also a problem with swapping to ram. Two
> > copies of the same page are kept in memory in different places,
> > right? Fixed vs variable size is irrelevant I think. Or am
> > I misunderstanding something about swap-to-ram?
>
> I may be misrembering how /dev/ram0 works, or simply assuming that
> if you want to use it for swap (interesting for testing, but probably
> not for general use), then you make sure to allocate each page of it
> in advance.
>
> The pages of /dev/ram0 don't get freed, or not before it's closed
> (swapoff'ed) anyway. Yes, swapcache would be duplicating data from
> other memory into /dev/ram0 memory; but that /dev/ram0 memory has
> been set aside for this purpose, and removing from swapcache won't
> free any more memory.
>
> >
> > > Similarly, this is not a problem with swapping to SSD. There might
> > > or might not be other reasons for adjusting the vm_swap_full() logic
> > > for SSD or generally, but those have nothing to do with this issue.
> >
> > I think it is at least highly related. The key issue is the
> > tradeoff of the likelihood that the page will soon be read/written
> > again while it is in swap cache vs the time/resource-usage necessary
> > to "reconstitute" the page into swap cache. Reconstituting from disk
> > requires a LOT of elapsed time. Reconstituting from
> > an SSD likely takes much less time. Reconstituting from
> > zcache/zram takes thousands of CPU cycles.
>
> I acknowledge my complete ignorance of how to judge the tradeoff
> between memory usage and cpu usage, but I think Minchan's main
> concern was with the memory usage. Neither hard disk nor SSD
> is occupying memory.

Hmm, It seems I misunderstood Dan's opinion in previous thread.
You're right, Hugh. My main concern is memory usage but the rationale
I used SWP_SOLIDSTATE is writing on SSD could be cheap rather than
storage. Yeb, it depends on SSD's internal's FTL algorith and fragment
ratio due to wear-leveling. That's why I said "It might be controversial".

>
> >
> > > The problem here is peculiar to frontswap, and the variably sized
> > > memory behind it, isn't it? We are accustomed to using swap to free
> > > up memory by transferring its data to some other, cheaper but slower
> > > resource.
> >
> > Frontswap does make the problem more complex because some pages
> > are in "fairly fast" storage (zcache, needs decompression) and
> > some are on the actual (usually) rotating media. Fortunately,
> > differentiating between these two cases is just a table lookup
> > (see frontswap_test).
> >
> > > But in the case of frontswap and zmem (I'll say that to avoid thinking
> > > through which backends are actually involved), it is not a cheaper and
> > > slower resource, but the very same memory we are trying to save: swap
> > > is stolen from the memory under reclaim, so any duplication becomes
> > > counter-productive (if we ignore cpu compression/decompression costs:
> > > I have no idea how fair it is to do so, but anyone who chooses zmem
> > > is prepared to pay some cpu price for that).
> >
> > Exactly. There is some "robbing of Peter to pay Paul" and
> > other complex resource tradeoffs. Presumably, though, it is
> > not "the very same memory we are trying to save" but a
> > fraction of it, saving the same page of data more efficiently
> > in memory, using less than a page, at some CPU cost.
>
> Yes, I'm not saying that frontswap/zmem is pointless: just agreeing
> with Minchan that in this case the duplication inherent in swapcache
> can be waste of memory that we should try to avoid.
>
> >
> > > And because it's a frontswap thing, we cannot decide this by device:
> > > frontswap may or may not stand in front of each device. There is no
> > > problem with swapcache duplicated on disk (until that area approaches
> > > being full or fragmented), but at the higher level we cannot see what
> > > is in zmem and what is on disk: we only want to free up the zmem dup.
> >
> > I *think* frontswap_test(page) resolves this problem, as long as
> > we have a specific page available to use as a parameter.
> >
> > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > copy of the page (to free up the compressed memory when possible) and
> > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > (setting page dirty so nothing will later go to read it from the
> > > unfreed location on backing swap disk, which was never written).
> >
> > There are two duplication issues: (1) When can the page be removed
> > from the swap cache after a call to frontswap_store; and (2) When
> > can the page be removed from the frontswap storage after it
> > has been brought back into memory via frontswap_load.
> >
> > This patch from Minchan addresses (1).
>
> Ying Han was reminding me of this case a couple of hours ago, we don't
> see a problem there: when frontswap_store() succeeds, there's an
> end_page_writeback() as there should be, and shrink_page_list()
> should reclaim the page immediately. So I think (1) is already
> handled and Minchan was not trying to address it.

Absolutely.

>
> > The issue you are raising
> > here is (2). You may not know that (2) has recently been solved
> > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
>
> Ah, and I have a dim, perhaps mistaken, memory that I gave you
> input on that before, suggesting the SetPageDirty. Good, sounds
> like the solution is already in place, if not actually activated.
>
> Thanks, must dash,
> Hugh
>
> >
> > > We cannot rely on freeing the swap itself, because in general there
> > > may be multiple references to the swap, and we only satisfy the one
> > > which has faulted. It may or may not be a good idea to use rmap to
> > > locate the other places to insert pte in place of swap entry, to
> > > resolve them all at once; but we have chosen not to do so in the
> > > past, and there's no need for that, if the zmem gets invalidated
> > > and the swapcache page set dirty.
> >
> > I see. Minchan's patch handles the removal "reactively"... it
> > might be possible to handle it more proactively. Or it may
> > be possible to take the number of references into account when
> > deciding whether to frontswap_store the page as, presumably,
> > the likelihood of needing to "reconstitute" the page sooner increases
> > with each additional reference.
> >
> > > Hugh
> >
> > Very useful thoughts, Hugh. Thanks much and looking forward
> > to more discussion at LSF/MM!
> >
> > Dan
> >
> > P.S. When I refer to zcache, I am referring to the version in
> > drivers/staging/zcache in 3.9. The code in drivers/staging/zcache
> > in 3.8 is "old zcache"... "new zcache" is in drivers/staging/ramster
> > in 3.8. Sorry for any confusion...
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-03-28 01:36:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi Seth,

On Wed, Mar 27, 2013 at 12:19:11PM -0500, Seth Jennings wrote:
> On 03/26/2013 09:22 PM, Minchan Kim wrote:
> > Swap subsystem does lazy swap slot free with expecting the page
> > would be swapped out again so we can't avoid unnecessary write.
> >
> > But the problem in in-memory swap is that it consumes memory space
> > until vm_swap_full(ie, used half of all of swap device) condition
> > meet. It could be bad if we use multiple swap device, small in-memory swap
> > and big storage swap or in-memory swap alone.
> >
> > This patch changes vm_swap_full logic slightly so it could free
> > swap slot early if the backed device is really fast.
>
> Great idea!

Thanks!

>
> > For it, I used SWP_SOLIDSTATE but It might be controversial.
>
> The comment for SWP_SOLIDSTATE is that "blkdev seeks are cheap". Just
> because seeks are cheap doesn't mean the read itself is also cheap.

The "read" isn't not concern but "write".

> For example, QUEUE_FLAG_NONROT is set for mmc devices, but some of
> them can be pretty slow.

Yeb.

>
> > So let's add Ccing Shaohua and Hugh.
> > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
> > or something for z* family.
>
> Afaict, setting SWP_SOLIDSTATE depends on characteristics of the
> underlying block device (i.e. blk_queue_nonrot()). zram is a block
> device but zcache and zswap are not.
>
> Any idea by what criteria SWP_INMEMORY would be set?

Just in-memory swap, zram, zswap and zcache at the moment. :)

>
> Also, frontswap backends (zcache and zswap) are a caching layer on top
> of the real swap device, which might actually be rotating media. So
> you have the issue of to different characteristics, in-memory caching
> on top of rotation media, present in a single swap device.

Please read my patch completely. I already pointed out the problem and
Hugh and Dan are suggesting ideas.

Thanks!

>
> Thanks,
> Seth
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-03-28 02:02:26

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Thu, Mar 28, 2013 at 10:18:24AM +0900, Minchan Kim wrote:
> On Wed, Mar 27, 2013 at 04:16:48PM -0700, Hugh Dickins wrote:
> > On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > > From: Hugh Dickins [mailto:[email protected]]
> > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > >
> > > > On Wed, 27 Mar 2013, Minchan Kim wrote:
> > > >
> > > > > Swap subsystem does lazy swap slot free with expecting the page
> > > > > would be swapped out again so we can't avoid unnecessary write.
> > > > so we can avoid unnecessary write.
> > > > >
> > > > > But the problem in in-memory swap is that it consumes memory space
> > > > > until vm_swap_full(ie, used half of all of swap device) condition
> > > > > meet. It could be bad if we use multiple swap device, small in-memory swap
> > > > > and big storage swap or in-memory swap alone.
> > > >
> > > > That is a very good realization: it's surprising that none of us
> > > > thought of it before - no disrespect to you, well done, thank you.
> > >
> > > Yes, my compliments also Minchan. This problem has been thought of before
> > > but this patch is the first to identify a possible solution.
> > >
> > > > And I guess swap readahead is utterly unhelpful in this case too.
> > >
> > > Yes... as is any "swap writeahead". Excuse my ignorance, but I
> > > think this is not done in the swap subsystem but instead the kernel
> > > assumes write-coalescing will be done in the block I/O subsystem,
> > > which means swap writeahead would affect zram but not zcache/zswap
> > > (since frontswap subverts the block I/O subsystem).
> >
> > I don't know what swap writeahead is; but write coalescing, yes.
> > I don't see any problem with it in this context.
> >
> > >
> > > However I think a swap-readahead solution would be helpful to
> > > zram as well as zcache/zswap.
> >
> > Whereas swap readahead on zmem is uncompressing zmem to pagecache
> > which may never be needed, and may take a circuit of the inactive
> > LRU before it gets reclaimed (if it turns out not to be needed,
> > at least it will remain clean and be easily reclaimed).
>
> But it could evict more important pages before reaching out the tail.
> That's thing we really want to avoid if possible.
>
> >
> > >
> > > > > This patch changes vm_swap_full logic slightly so it could free
> > > > > swap slot early if the backed device is really fast.
> > > > > For it, I used SWP_SOLIDSTATE but It might be controversial.
> > > >
> > > > But I strongly disagree with almost everything in your patch :)
> > > > I disagree with addressing it in vm_swap_full(), I disagree that
> > > > it can be addressed by device, I disagree that it has anything to
> > > > do with SWP_SOLIDSTATE.
> > > >
> > > > This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> > > > is it? In those cases, a fixed amount of memory has been set aside
> > > > for swap, and it works out just like with disk block devices. The
> > > > memory set aside may be wasted, but that is accepted upfront.
> > >
> > > It is (I believe) also a problem with swapping to ram. Two
> > > copies of the same page are kept in memory in different places,
> > > right? Fixed vs variable size is irrelevant I think. Or am
> > > I misunderstanding something about swap-to-ram?
> >
> > I may be misrembering how /dev/ram0 works, or simply assuming that
> > if you want to use it for swap (interesting for testing, but probably
> > not for general use), then you make sure to allocate each page of it
> > in advance.
> >
> > The pages of /dev/ram0 don't get freed, or not before it's closed
> > (swapoff'ed) anyway. Yes, swapcache would be duplicating data from
> > other memory into /dev/ram0 memory; but that /dev/ram0 memory has
> > been set aside for this purpose, and removing from swapcache won't
> > free any more memory.
> >
> > >
> > > > Similarly, this is not a problem with swapping to SSD. There might
> > > > or might not be other reasons for adjusting the vm_swap_full() logic
> > > > for SSD or generally, but those have nothing to do with this issue.
> > >
> > > I think it is at least highly related. The key issue is the
> > > tradeoff of the likelihood that the page will soon be read/written
> > > again while it is in swap cache vs the time/resource-usage necessary
> > > to "reconstitute" the page into swap cache. Reconstituting from disk
> > > requires a LOT of elapsed time. Reconstituting from
> > > an SSD likely takes much less time. Reconstituting from
> > > zcache/zram takes thousands of CPU cycles.
> >
> > I acknowledge my complete ignorance of how to judge the tradeoff
> > between memory usage and cpu usage, but I think Minchan's main
> > concern was with the memory usage. Neither hard disk nor SSD
> > is occupying memory.
>
> Hmm, It seems I misunderstood Dan's opinion in previous thread.
> You're right, Hugh. My main concern is memory usage but the rationale
> I used SWP_SOLIDSTATE is writing on SSD could be cheap rather than
> storage. Yeb, it depends on SSD's internal's FTL algorith and fragment
> ratio due to wear-leveling. That's why I said "It might be controversial".

Even SSD is fast, there is tradeoff. And unncessary write to SSD should be
avoided if possible, because write makes wear out faster and makes subsequent
write slower potentially (if garbage collection runs).

Thanks,
Shaohua

2013-03-28 17:37:35

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] mm: remove swapcache page early

> From: Hugh Dickins [mailto:[email protected]]
> Subject: RE: [RFC] mm: remove swapcache page early
>
> On Wed, 27 Mar 2013, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:[email protected]]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > The issue you are raising
> > here is (2). You may not know that (2) has recently been solved
> > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
>
> Ah, and I have a dim, perhaps mistaken, memory that I gave you
> input on that before, suggesting the SetPageDirty. Good, sounds
> like the solution is already in place, if not actually activated.
>
> Thanks, must dash,
> Hugh

Hi Hugh --

Credit where it is due... Yes, I do recall now that the idea
was originally yours. It went on a to-do list where I eventually
tried it and it worked... I'm sorry I had forgotten and neglected
to give you credit!

(BTW, it is activated for zcache in 3.9.)

Thanks,
Dan

2013-03-28 18:19:48

by Dan Magenheimer

[permalink] [raw]
Subject: RE: [RFC] mm: remove swapcache page early

> From: Minchan Kim [mailto:[email protected]]
> Subject: Re: [RFC] mm: remove swapcache page early
>
> Hi Dan,
>
> On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > From: Hugh Dickins [mailto:[email protected]]
> > > Subject: Re: [RFC] mm: remove swapcache page early
> > >
> > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > copy of the page (to free up the compressed memory when possible) and
> > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > (setting page dirty so nothing will later go to read it from the
> > > unfreed location on backing swap disk, which was never written).
> >
> > There are two duplication issues: (1) When can the page be removed
> > from the swap cache after a call to frontswap_store; and (2) When
> > can the page be removed from the frontswap storage after it
> > has been brought back into memory via frontswap_load.
> >
> > This patch from Minchan addresses (1). The issue you are raising
>
> No. I am addressing (2).
>
> > here is (2). You may not know that (2) has recently been solved
> > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> > If this is enabled (and it is for zcache but not yet for zswap),
> > what you suggest (SetPageDirty) is what happens.
>
> I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> on zram and zswap.

Zswap can enable it trivially by adding a function call in init_zswap.
(Note that it is not enabled by default for all frontswap backends
because it is another complicated tradeoff of cpu time vs memory space
that needs more study on a broad set of workloads.)

I wonder if something like this would have a similar result for zram?
(Completely untested... snippet stolen from swap_entry_free with
SetPageDirty added... doesn't compile yet, but should give you the idea.)

diff --git a/mm/page_io.c b/mm/page_io.c
index 56276fe..2d10988 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
iminor(bio->bi_bdev->bd_inode),
(unsigned long long)bio->bi_sector);
} else {
+ struct swap_info_struct *sis;
+
SetPageUptodate(page);
+ sis = page_swap_info(page);
+ if (sis->flags & SWP_BLKDEV) {
+ struct gendisk *disk = sis->bdev->bd_disk;
+ if (disk->fops->swap_slot_free_notify) {
+ SetPageDirty(page);
+ disk->fops->swap_slot_free_notify(sis->bdev,
+ offset);
+ }
+ }
}
unlock_page(page);
bio_put(bio);

2013-03-29 01:18:06

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > From: Minchan Kim [mailto:[email protected]]
> > Subject: Re: [RFC] mm: remove swapcache page early
> >
> > Hi Dan,
> >
> > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > > From: Hugh Dickins [mailto:[email protected]]
> > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > >
> > > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > > copy of the page (to free up the compressed memory when possible) and
> > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > > (setting page dirty so nothing will later go to read it from the
> > > > unfreed location on backing swap disk, which was never written).
> > >
> > > There are two duplication issues: (1) When can the page be removed
> > > from the swap cache after a call to frontswap_store; and (2) When
> > > can the page be removed from the frontswap storage after it
> > > has been brought back into memory via frontswap_load.
> > >
> > > This patch from Minchan addresses (1). The issue you are raising
> >
> > No. I am addressing (2).
> >
> > > here is (2). You may not know that (2) has recently been solved
> > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> > > If this is enabled (and it is for zcache but not yet for zswap),
> > > what you suggest (SetPageDirty) is what happens.
> >
> > I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> > on zram and zswap.
>
> Zswap can enable it trivially by adding a function call in init_zswap.
> (Note that it is not enabled by default for all frontswap backends
> because it is another complicated tradeoff of cpu time vs memory space
> that needs more study on a broad set of workloads.)
>
> I wonder if something like this would have a similar result for zram?
> (Completely untested... snippet stolen from swap_entry_free with
> SetPageDirty added... doesn't compile yet, but should give you the idea.)

Nice idea!

After I see your patch, I realized it was Hugh's suggestion and
you implemented it in proper place.

Will resend it after testing. Maybe nextweek.
Thanks!

>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 56276fe..2d10988 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
> iminor(bio->bi_bdev->bd_inode),
> (unsigned long long)bio->bi_sector);
> } else {
> + struct swap_info_struct *sis;
> +
> SetPageUptodate(page);
> + sis = page_swap_info(page);
> + if (sis->flags & SWP_BLKDEV) {
> + struct gendisk *disk = sis->bdev->bd_disk;
> + if (disk->fops->swap_slot_free_notify) {
> + SetPageDirty(page);
> + disk->fops->swap_slot_free_notify(sis->bdev,
> + offset);
> + }
> + }
> }
> unlock_page(page);
> bio_put(bio);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-03-29 20:01:42

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Fri, 29 Mar 2013, Minchan Kim wrote:
> On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > From: Minchan Kim [mailto:[email protected]]
> > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > > > From: Hugh Dickins [mailto:[email protected]]
> > > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > > >
> > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > > > copy of the page (to free up the compressed memory when possible) and
> > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > > > (setting page dirty so nothing will later go to read it from the
> > > > > unfreed location on backing swap disk, which was never written).
> > > >
> > > > There are two duplication issues: (1) When can the page be removed
> > > > from the swap cache after a call to frontswap_store; and (2) When
> > > > can the page be removed from the frontswap storage after it
> > > > has been brought back into memory via frontswap_load.
> > > >
> > > > This patch from Minchan addresses (1). The issue you are raising
> > >
> > > No. I am addressing (2).
> > >
> > > > here is (2). You may not know that (2) has recently been solved
> > > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> > > > If this is enabled (and it is for zcache but not yet for zswap),
> > > > what you suggest (SetPageDirty) is what happens.
> > >
> > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> > > on zram and zswap.
> >
> > Zswap can enable it trivially by adding a function call in init_zswap.
> > (Note that it is not enabled by default for all frontswap backends
> > because it is another complicated tradeoff of cpu time vs memory space
> > that needs more study on a broad set of workloads.)
> >
> > I wonder if something like this would have a similar result for zram?
> > (Completely untested... snippet stolen from swap_entry_free with
> > SetPageDirty added... doesn't compile yet, but should give you the idea.)

Thanks for correcting me on zram (in earlier mail of this thread), yes,
I was forgetting about the swap_slot_free_notify entry point which lets
that memory be freed.

>
> Nice idea!
>
> After I see your patch, I realized it was Hugh's suggestion and
> you implemented it in proper place.
>
> Will resend it after testing. Maybe nextweek.
> Thanks!

Be careful, although Dan is right that something like this can be
done for zram, I believe you will find that it needs a little more:
either a separate new entry point (not my preference) or a flags arg
(or boolean) added to swap_slot_free_notify.

Because this is a different operation: end_swap_bio_read() wants
to free up zram's compressed copy of the page, but the swp_entry_t
must remain valid until swap_entry_free() can clear up the rest.
Precisely how much of the work each should do, you will discover.

Hugh

>
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 56276fe..2d10988 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -81,7 +81,17 @@ void end_swap_bio_read(struct bio *bio, int err)
> > iminor(bio->bi_bdev->bd_inode),
> > (unsigned long long)bio->bi_sector);
> > } else {
> > + struct swap_info_struct *sis;
> > +
> > SetPageUptodate(page);
> > + sis = page_swap_info(page);
> > + if (sis->flags & SWP_BLKDEV) {
> > + struct gendisk *disk = sis->bdev->bd_disk;
> > + if (disk->fops->swap_slot_free_notify) {
> > + SetPageDirty(page);
> > + disk->fops->swap_slot_free_notify(sis->bdev,
> > + offset);
> > + }
> > + }
> > }
> > unlock_page(page);
> > bio_put(bio);

2013-04-02 02:04:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi Hugh,

On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
> On Fri, 29 Mar 2013, Minchan Kim wrote:
> > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > > From: Minchan Kim [mailto:[email protected]]
> > > > On Wed, Mar 27, 2013 at 03:24:00PM -0700, Dan Magenheimer wrote:
> > > > > > From: Hugh Dickins [mailto:[email protected]]
> > > > > > Subject: Re: [RFC] mm: remove swapcache page early
> > > > > >
> > > > > > I believe the answer is for frontswap/zmem to invalidate the frontswap
> > > > > > copy of the page (to free up the compressed memory when possible) and
> > > > > > SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> > > > > > (setting page dirty so nothing will later go to read it from the
> > > > > > unfreed location on backing swap disk, which was never written).
> > > > >
> > > > > There are two duplication issues: (1) When can the page be removed
> > > > > from the swap cache after a call to frontswap_store; and (2) When
> > > > > can the page be removed from the frontswap storage after it
> > > > > has been brought back into memory via frontswap_load.
> > > > >
> > > > > This patch from Minchan addresses (1). The issue you are raising
> > > >
> > > > No. I am addressing (2).
> > > >
> > > > > here is (2). You may not know that (2) has recently been solved
> > > > > in frontswap, at least for zcache. See frontswap_exclusive_gets_enabled.
> > > > > If this is enabled (and it is for zcache but not yet for zswap),
> > > > > what you suggest (SetPageDirty) is what happens.
> > > >
> > > > I am blind on zcache so I didn't see it. Anyway, I'd like to address it
> > > > on zram and zswap.
> > >
> > > Zswap can enable it trivially by adding a function call in init_zswap.
> > > (Note that it is not enabled by default for all frontswap backends
> > > because it is another complicated tradeoff of cpu time vs memory space
> > > that needs more study on a broad set of workloads.)
> > >
> > > I wonder if something like this would have a similar result for zram?
> > > (Completely untested... snippet stolen from swap_entry_free with
> > > SetPageDirty added... doesn't compile yet, but should give you the idea.)
>
> Thanks for correcting me on zram (in earlier mail of this thread), yes,
> I was forgetting about the swap_slot_free_notify entry point which lets
> that memory be freed.
>
> >
> > Nice idea!
> >
> > After I see your patch, I realized it was Hugh's suggestion and
> > you implemented it in proper place.
> >
> > Will resend it after testing. Maybe nextweek.
> > Thanks!
>
> Be careful, although Dan is right that something like this can be
> done for zram, I believe you will find that it needs a little more:
> either a separate new entry point (not my preference) or a flags arg
> (or boolean) added to swap_slot_free_notify.
>
> Because this is a different operation: end_swap_bio_read() wants
> to free up zram's compressed copy of the page, but the swp_entry_t
> must remain valid until swap_entry_free() can clear up the rest.
> Precisely how much of the work each should do, you will discover.

First of all, Thanks for noticing it for me!

If I parse your concern correctly, you are concerning about
different semantic on two functions.
(end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).

But current implementatoin on zram_slot_free_notify could cover both cases
properly with luck.

zram_free_page caused by end_swap_bio_read will free compressed copy
of the page and zram_free_page caused by swap_entry_free later won't find
right index from zram->table and just return.
So I think there is no problem.

Remained problem is zram->stats.notify_free, which could be counted
redundantly but not sure it's valuable to count exactly.

If I miss your point, please pinpoint your concern. :)

Thanks!
--
Kind regards,
Minchan Kim

2013-04-02 05:14:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Tue, 2 Apr 2013, Minchan Kim wrote:
> On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
> > On Fri, 29 Mar 2013, Minchan Kim wrote:
> > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > >
> > > > I wonder if something like this would have a similar result for zram?
> > > > (Completely untested... snippet stolen from swap_entry_free with
> > > > SetPageDirty added... doesn't compile yet, but should give you the idea.)
> >
> > Be careful, although Dan is right that something like this can be
> > done for zram, I believe you will find that it needs a little more:
> > either a separate new entry point (not my preference) or a flags arg
> > (or boolean) added to swap_slot_free_notify.
> >
> > Because this is a different operation: end_swap_bio_read() wants
> > to free up zram's compressed copy of the page, but the swp_entry_t
> > must remain valid until swap_entry_free() can clear up the rest.
> > Precisely how much of the work each should do, you will discover.
>
> First of all, Thanks for noticing it for me!
>
> If I parse your concern correctly, you are concerning about
> different semantic on two functions.
> (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).
>
> But current implementatoin on zram_slot_free_notify could cover both cases
> properly with luck.
>
> zram_free_page caused by end_swap_bio_read will free compressed copy
> of the page and zram_free_page caused by swap_entry_free later won't find
> right index from zram->table and just return.
> So I think there is no problem.
>
> Remained problem is zram->stats.notify_free, which could be counted
> redundantly but not sure it's valuable to count exactly.
>
> If I miss your point, please pinpoint your concern. :)

Looking at it again, I do believe you and Dan are perfectly correct,
and I was again the confused one. Though I'd be happier if I could
see just how I was misreading it: makes me wonder if I had a great
insight that I can no longer grasp hold of! I think I was paranoid
about a swp_entry_t getting recycled prematurely: but swap_entry_free
remains in control of that - freeing a swap entry is no part of what
notify_free gets up to. Sorry for wasting your time.

Hugh

2013-04-02 05:56:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On Mon, Apr 01, 2013 at 10:13:58PM -0700, Hugh Dickins wrote:
> On Tue, 2 Apr 2013, Minchan Kim wrote:
> > On Fri, Mar 29, 2013 at 01:01:14PM -0700, Hugh Dickins wrote:
> > > On Fri, 29 Mar 2013, Minchan Kim wrote:
> > > > On Thu, Mar 28, 2013 at 11:19:12AM -0700, Dan Magenheimer wrote:
> > > > >
> > > > > I wonder if something like this would have a similar result for zram?
> > > > > (Completely untested... snippet stolen from swap_entry_free with
> > > > > SetPageDirty added... doesn't compile yet, but should give you the idea.)
> > >
> > > Be careful, although Dan is right that something like this can be
> > > done for zram, I believe you will find that it needs a little more:
> > > either a separate new entry point (not my preference) or a flags arg
> > > (or boolean) added to swap_slot_free_notify.
> > >
> > > Because this is a different operation: end_swap_bio_read() wants
> > > to free up zram's compressed copy of the page, but the swp_entry_t
> > > must remain valid until swap_entry_free() can clear up the rest.
> > > Precisely how much of the work each should do, you will discover.
> >
> > First of all, Thanks for noticing it for me!
> >
> > If I parse your concern correctly, you are concerning about
> > different semantic on two functions.
> > (end_swap_bio_read's swap_slot_free_notify VS swap_entry_free's one).
> >
> > But current implementatoin on zram_slot_free_notify could cover both cases
> > properly with luck.
> >
> > zram_free_page caused by end_swap_bio_read will free compressed copy
> > of the page and zram_free_page caused by swap_entry_free later won't find
> > right index from zram->table and just return.
> > So I think there is no problem.
> >
> > Remained problem is zram->stats.notify_free, which could be counted
> > redundantly but not sure it's valuable to count exactly.
> >
> > If I miss your point, please pinpoint your concern. :)
>
> Looking at it again, I do believe you and Dan are perfectly correct,
> and I was again the confused one. Though I'd be happier if I could
> see just how I was misreading it: makes me wonder if I had a great
> insight that I can no longer grasp hold of! I think I was paranoid
> about a swp_entry_t getting recycled prematurely: but swap_entry_free
> remains in control of that - freeing a swap entry is no part of what
> notify_free gets up to. Sorry for wasting your time.

Hey, Hugh, Please don't do apology.
It gives me a chance to look into that part in detail.
It never wasted my time.

And your deep insight and kind advise always makes everybody happier.

Looking forward to seeing you soon in LSF/MM.
Thanks!

--
Kind regards,
Minchan Kim

2013-04-02 13:40:44

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hi Hugh,
On 03/28/2013 05:41 AM, Hugh Dickins wrote:
> On Wed, 27 Mar 2013, Minchan Kim wrote:
>
>> Swap subsystem does lazy swap slot free with expecting the page
>> would be swapped out again so we can't avoid unnecessary write.
> so we can avoid unnecessary write.

If page can be swap out again, which codes can avoid unnecessary write?
Could you point out to me? Thanks in advance. ;-)

>> But the problem in in-memory swap is that it consumes memory space
>> until vm_swap_full(ie, used half of all of swap device) condition
>> meet. It could be bad if we use multiple swap device, small in-memory swap
>> and big storage swap or in-memory swap alone.
> That is a very good realization: it's surprising that none of us
> thought of it before - no disrespect to you, well done, thank you.
>
> And I guess swap readahead is utterly unhelpful in this case too.
>
>> This patch changes vm_swap_full logic slightly so it could free
>> swap slot early if the backed device is really fast.
>> For it, I used SWP_SOLIDSTATE but It might be controversial.
> But I strongly disagree with almost everything in your patch :)
> I disagree with addressing it in vm_swap_full(), I disagree that
> it can be addressed by device, I disagree that it has anything to
> do with SWP_SOLIDSTATE.
>
> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
> is it? In those cases, a fixed amount of memory has been set aside
> for swap, and it works out just like with disk block devices. The
> memory set aside may be wasted, but that is accepted upfront.
>
> Similarly, this is not a problem with swapping to SSD. There might
> or might not be other reasons for adjusting the vm_swap_full() logic
> for SSD or generally, but those have nothing to do with this issue.
>
> The problem here is peculiar to frontswap, and the variably sized
> memory behind it, isn't it? We are accustomed to using swap to free
> up memory by transferring its data to some other, cheaper but slower
> resource.
>
> But in the case of frontswap and zmem (I'll say that to avoid thinking
> through which backends are actually involved), it is not a cheaper and
> slower resource, but the very same memory we are trying to save: swap
> is stolen from the memory under reclaim, so any duplication becomes
> counter-productive (if we ignore cpu compression/decompression costs:
> I have no idea how fair it is to do so, but anyone who chooses zmem
> is prepared to pay some cpu price for that).
>
> And because it's a frontswap thing, we cannot decide this by device:
> frontswap may or may not stand in front of each device. There is no
> problem with swapcache duplicated on disk (until that area approaches
> being full or fragmented), but at the higher level we cannot see what
> is in zmem and what is on disk: we only want to free up the zmem dup.
>
> I believe the answer is for frontswap/zmem to invalidate the frontswap
> copy of the page (to free up the compressed memory when possible) and
> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
> (setting page dirty so nothing will later go to read it from the
> unfreed location on backing swap disk, which was never written).
>
> We cannot rely on freeing the swap itself, because in general there
> may be multiple references to the swap, and we only satisfy the one
> which has faulted. It may or may not be a good idea to use rmap to
> locate the other places to insert pte in place of swap entry, to
> resolve them all at once; but we have chosen not to do so in the
> past, and there's no need for that, if the zmem gets invalidated
> and the swapcache page set dirty.
>
> Hugh
>
>> So let's add Ccing Shaohua and Hugh.
>> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
>> or something for z* family.
>>
>> Other problem is zram is block device so that it can set SWP_INMEMORY
>> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
>> I have no idea to use it for frontswap.
>>
>> Any idea?
>>
>> Other optimize point is we remove it unconditionally when we
>> found it's exclusive when swap in happen.
>> It could help frontswap family, too.
>> What do you think about it?
>>
>> Cc: Hugh Dickins <[email protected]>
>> Cc: Dan Magenheimer <[email protected]>
>> Cc: Seth Jennings <[email protected]>
>> Cc: Nitin Gupta <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Shaohua Li <[email protected]>
>> Signed-off-by: Minchan Kim <[email protected]>
>> ---
>> include/linux/swap.h | 11 ++++++++---
>> mm/memory.c | 3 ++-
>> mm/swapfile.c | 11 +++++++----
>> mm/vmscan.c | 2 +-
>> 4 files changed, 18 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 2818a12..1f4df66 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t,
>> extern atomic_long_t nr_swap_pages;
>> extern long total_swap_pages;
>>
>> -/* Swap 50% full? Release swapcache more aggressively.. */
>> -static inline bool vm_swap_full(void)
>> +/*
>> + * Swap 50% full or fast backed device?
>> + * Release swapcache more aggressively.
>> + */
>> +static inline bool vm_swap_full(struct swap_info_struct *si)
>> {
>> + if (si->flags & SWP_SOLIDSTATE)
>> + return true;
>> return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
>> }
>>
>> @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>> #define get_nr_swap_pages() 0L
>> #define total_swap_pages 0L
>> #define total_swapcache_pages() 0UL
>> -#define vm_swap_full() 0
>> +#define vm_swap_full(si) 0
>>
>> #define si_swapinfo(val) \
>> do { (val)->freeswap = (val)->totalswap = 0; } while (0)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 705473a..1ca21a9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>> mem_cgroup_commit_charge_swapin(page, ptr);
>>
>> swap_free(entry);
>> - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
>> + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page))
>> + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
>> try_to_free_swap(page);
>> unlock_page(page);
>> if (page != swapcache) {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 1bee6fa..f9cc701 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -293,7 +293,7 @@ checks:
>> scan_base = offset = si->lowest_bit;
>>
>> /* reuse swap entry of cache-only swap if not busy. */
>> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> int swap_was_freed;
>> spin_unlock(&si->lock);
>> swap_was_freed = __try_to_reclaim_swap(si, offset);
>> @@ -382,7 +382,8 @@ scan:
>> spin_lock(&si->lock);
>> goto checks;
>> }
>> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> + if (vm_swap_full(si) &&
>> + si->swap_map[offset] == SWAP_HAS_CACHE) {
>> spin_lock(&si->lock);
>> goto checks;
>> }
>> @@ -397,7 +398,8 @@ scan:
>> spin_lock(&si->lock);
>> goto checks;
>> }
>> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>> + if (vm_swap_full(si) &&
>> + si->swap_map[offset] == SWAP_HAS_CACHE) {
>> spin_lock(&si->lock);
>> goto checks;
>> }
>> @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
>> * Also recheck PageSwapCache now page is locked (above).
>> */
>> if (PageSwapCache(page) && !PageWriteback(page) &&
>> - (!page_mapped(page) || vm_swap_full())) {
>> + (!page_mapped(page) ||
>> + vm_swap_full(page_swap_info(page)))) {
>> delete_from_swap_cache(page);
>> SetPageDirty(page);
>> }
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index df78d17..145c59c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -933,7 +933,7 @@ cull_mlocked:
>>
>> activate_locked:
>> /* Not a candidate for swapping, so reclaim swap space. */
>> - if (PageSwapCache(page) && vm_swap_full())
>> + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
>> try_to_free_swap(page);
>> VM_BUG_ON(PageActive(page));
>> SetPageActive(page);
>> --
>> 1.8.2
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-07 07:26:25

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Ping Minchan.
On 04/02/2013 09:40 PM, Simon Jeons wrote:
> Hi Hugh,
> On 03/28/2013 05:41 AM, Hugh Dickins wrote:
>> On Wed, 27 Mar 2013, Minchan Kim wrote:
>>
>>> Swap subsystem does lazy swap slot free with expecting the page
>>> would be swapped out again so we can't avoid unnecessary write.
>> so we can avoid unnecessary write.
>
> If page can be swap out again, which codes can avoid unnecessary
> write? Could you point out to me? Thanks in advance. ;-)
>
>>> But the problem in in-memory swap is that it consumes memory space
>>> until vm_swap_full(ie, used half of all of swap device) condition
>>> meet. It could be bad if we use multiple swap device, small
>>> in-memory swap
>>> and big storage swap or in-memory swap alone.
>> That is a very good realization: it's surprising that none of us
>> thought of it before - no disrespect to you, well done, thank you.
>>
>> And I guess swap readahead is utterly unhelpful in this case too.
>>
>>> This patch changes vm_swap_full logic slightly so it could free
>>> swap slot early if the backed device is really fast.
>>> For it, I used SWP_SOLIDSTATE but It might be controversial.
>> But I strongly disagree with almost everything in your patch :)
>> I disagree with addressing it in vm_swap_full(), I disagree that
>> it can be addressed by device, I disagree that it has anything to
>> do with SWP_SOLIDSTATE.
>>
>> This is not a problem with swapping to /dev/ram0 or to /dev/zram0,
>> is it? In those cases, a fixed amount of memory has been set aside
>> for swap, and it works out just like with disk block devices. The
>> memory set aside may be wasted, but that is accepted upfront.
>>
>> Similarly, this is not a problem with swapping to SSD. There might
>> or might not be other reasons for adjusting the vm_swap_full() logic
>> for SSD or generally, but those have nothing to do with this issue.
>>
>> The problem here is peculiar to frontswap, and the variably sized
>> memory behind it, isn't it? We are accustomed to using swap to free
>> up memory by transferring its data to some other, cheaper but slower
>> resource.
>>
>> But in the case of frontswap and zmem (I'll say that to avoid thinking
>> through which backends are actually involved), it is not a cheaper and
>> slower resource, but the very same memory we are trying to save: swap
>> is stolen from the memory under reclaim, so any duplication becomes
>> counter-productive (if we ignore cpu compression/decompression costs:
>> I have no idea how fair it is to do so, but anyone who chooses zmem
>> is prepared to pay some cpu price for that).
>>
>> And because it's a frontswap thing, we cannot decide this by device:
>> frontswap may or may not stand in front of each device. There is no
>> problem with swapcache duplicated on disk (until that area approaches
>> being full or fragmented), but at the higher level we cannot see what
>> is in zmem and what is on disk: we only want to free up the zmem dup.
>>
>> I believe the answer is for frontswap/zmem to invalidate the frontswap
>> copy of the page (to free up the compressed memory when possible) and
>> SetPageDirty on the PageUptodate PageSwapCache page when swapping in
>> (setting page dirty so nothing will later go to read it from the
>> unfreed location on backing swap disk, which was never written).
>>
>> We cannot rely on freeing the swap itself, because in general there
>> may be multiple references to the swap, and we only satisfy the one
>> which has faulted. It may or may not be a good idea to use rmap to
>> locate the other places to insert pte in place of swap entry, to
>> resolve them all at once; but we have chosen not to do so in the
>> past, and there's no need for that, if the zmem gets invalidated
>> and the swapcache page set dirty.
>>
>> Hugh
>>
>>> So let's add Ccing Shaohua and Hugh.
>>> If it's a problem for SSD, I'd like to create new type SWP_INMEMORY
>>> or something for z* family.
>>>
>>> Other problem is zram is block device so that it can set SWP_INMEMORY
>>> or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but
>>> I have no idea to use it for frontswap.
>>>
>>> Any idea?
>>>
>>> Other optimize point is we remove it unconditionally when we
>>> found it's exclusive when swap in happen.
>>> It could help frontswap family, too.
>>> What do you think about it?
>>>
>>> Cc: Hugh Dickins <[email protected]>
>>> Cc: Dan Magenheimer <[email protected]>
>>> Cc: Seth Jennings <[email protected]>
>>> Cc: Nitin Gupta <[email protected]>
>>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>>> Cc: Shaohua Li <[email protected]>
>>> Signed-off-by: Minchan Kim <[email protected]>
>>> ---
>>> include/linux/swap.h | 11 ++++++++---
>>> mm/memory.c | 3 ++-
>>> mm/swapfile.c | 11 +++++++----
>>> mm/vmscan.c | 2 +-
>>> 4 files changed, 18 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 2818a12..1f4df66 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -359,9 +359,14 @@ extern struct page
>>> *swapin_readahead(swp_entry_t, gfp_t,
>>> extern atomic_long_t nr_swap_pages;
>>> extern long total_swap_pages;
>>> -/* Swap 50% full? Release swapcache more aggressively.. */
>>> -static inline bool vm_swap_full(void)
>>> +/*
>>> + * Swap 50% full or fast backed device?
>>> + * Release swapcache more aggressively.
>>> + */
>>> +static inline bool vm_swap_full(struct swap_info_struct *si)
>>> {
>>> + if (si->flags & SWP_SOLIDSTATE)
>>> + return true;
>>> return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
>>> }
>>> @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page
>>> *page, swp_entry_t ent, bool swapout)
>>> #define get_nr_swap_pages() 0L
>>> #define total_swap_pages 0L
>>> #define total_swapcache_pages() 0UL
>>> -#define vm_swap_full() 0
>>> +#define vm_swap_full(si) 0
>>> #define si_swapinfo(val) \
>>> do { (val)->freeswap = (val)->totalswap = 0; } while (0)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 705473a..1ca21a9 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm,
>>> struct vm_area_struct *vma,
>>> mem_cgroup_commit_charge_swapin(page, ptr);
>>> swap_free(entry);
>>> - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) ||
>>> PageMlocked(page))
>>> + if (likely(PageSwapCache(page)) &&
>>> (vm_swap_full(page_swap_info(page))
>>> + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)))
>>> try_to_free_swap(page);
>>> unlock_page(page);
>>> if (page != swapcache) {
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 1bee6fa..f9cc701 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -293,7 +293,7 @@ checks:
>>> scan_base = offset = si->lowest_bit;
>>> /* reuse swap entry of cache-only swap if not busy. */
>>> - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>>> + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) {
>>> int swap_was_freed;
>>> spin_unlock(&si->lock);
>>> swap_was_freed = __try_to_reclaim_swap(si, offset);
>>> @@ -382,7 +382,8 @@ scan:
>>> spin_lock(&si->lock);
>>> goto checks;
>>> }
>>> - if (vm_swap_full() && si->swap_map[offset] ==
>>> SWAP_HAS_CACHE) {
>>> + if (vm_swap_full(si) &&
>>> + si->swap_map[offset] == SWAP_HAS_CACHE) {
>>> spin_lock(&si->lock);
>>> goto checks;
>>> }
>>> @@ -397,7 +398,8 @@ scan:
>>> spin_lock(&si->lock);
>>> goto checks;
>>> }
>>> - if (vm_swap_full() && si->swap_map[offset] ==
>>> SWAP_HAS_CACHE) {
>>> + if (vm_swap_full(si) &&
>>> + si->swap_map[offset] == SWAP_HAS_CACHE) {
>>> spin_lock(&si->lock);
>>> goto checks;
>>> }
>>> @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry)
>>> * Also recheck PageSwapCache now page is locked (above).
>>> */
>>> if (PageSwapCache(page) && !PageWriteback(page) &&
>>> - (!page_mapped(page) || vm_swap_full())) {
>>> + (!page_mapped(page) ||
>>> + vm_swap_full(page_swap_info(page)))) {
>>> delete_from_swap_cache(page);
>>> SetPageDirty(page);
>>> }
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index df78d17..145c59c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -933,7 +933,7 @@ cull_mlocked:
>>> activate_locked:
>>> /* Not a candidate for swapping, so reclaim swap space. */
>>> - if (PageSwapCache(page) && vm_swap_full())
>>> + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page)))
>>> try_to_free_swap(page);
>>> VM_BUG_ON(PageActive(page));
>>> SetPageActive(page);
>>> --
>>> 1.8.2
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-04-08 01:48:49

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

Hello Simon,

On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote:
> Ping Minchan.
> On 04/02/2013 09:40 PM, Simon Jeons wrote:
> >Hi Hugh,
> >On 03/28/2013 05:41 AM, Hugh Dickins wrote:
> >>On Wed, 27 Mar 2013, Minchan Kim wrote:
> >>
> >>>Swap subsystem does lazy swap slot free with expecting the page
> >>>would be swapped out again so we can't avoid unnecessary write.
> >> so we can avoid unnecessary write.
> >
> >If page can be swap out again, which codes can avoid unnecessary
> >write? Could you point out to me? Thanks in advance. ;-)

Look at shrink_page_list.

1) PageAnon(page) && !PageSwapCache()
2) add_to_swap's SetPageDirty
3) __remove_mapping

P.S)
It seems you are misunderstanding. Here isn't proper place to ask a
question for your understanding the code. As I know, there are some
project(ex, kernelnewbies) and books for study and sharing the
knowledge linux kernel.

I recommend Mel's "Understand the Linux Virtual Memory Manager".
It's rather outdated but will be very helpful to understand VM of
linux kernel. You can get it freely but I hope you pay for.
So if author become a billionaire by selecting best book in Amazon,
he might print out second edition which covers all of new VM features
and may solve all of you curiosity.

It would be a another method to contribute open source project. :)

I believe you talented developers can catch it up with reading the
code enoughly and find more bonus knowledge. I think it's why our senior
developers yell out RTFM and I follow them.

Cheers!


--
Kind regards,
Minchan Kim

2013-04-08 01:51:43

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC] mm: remove swapcache page early

On 04/08/2013 09:48 AM, Minchan Kim wrote:
> Hello Simon,
>
> On Sun, Apr 07, 2013 at 03:26:12PM +0800, Simon Jeons wrote:
>> Ping Minchan.
>> On 04/02/2013 09:40 PM, Simon Jeons wrote:
>>> Hi Hugh,
>>> On 03/28/2013 05:41 AM, Hugh Dickins wrote:
>>>> On Wed, 27 Mar 2013, Minchan Kim wrote:
>>>>
>>>>> Swap subsystem does lazy swap slot free with expecting the page
>>>>> would be swapped out again so we can't avoid unnecessary write.
>>>> so we can avoid unnecessary write.
>>> If page can be swap out again, which codes can avoid unnecessary
>>> write? Could you point out to me? Thanks in advance. ;-)
> Look at shrink_page_list.
>
> 1) PageAnon(page) && !PageSwapCache()
> 2) add_to_swap's SetPageDirty
> 3) __remove_mapping
>
> P.S)
> It seems you are misunderstanding. Here isn't proper place to ask a
> question for your understanding the code. As I know, there are some
> project(ex, kernelnewbies) and books for study and sharing the
> knowledge linux kernel.
>
> I recommend Mel's "Understand the Linux Virtual Memory Manager".
> It's rather outdated but will be very helpful to understand VM of
> linux kernel. You can get it freely but I hope you pay for.
> So if author become a billionaire by selecting best book in Amazon,
> he might print out second edition which covers all of new VM features
> and may solve all of you curiosity.
>
> It would be a another method to contribute open source project. :)
>
> I believe you talented developers can catch it up with reading the
> code enoughly and find more bonus knowledge. I think it's why our senior
> developers yell out RTFM and I follow them.

What's the meaning of RTFM?

>
> Cheers!
>
>