2009-04-21 07:23:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san
reported. There are many callers of lock_page() but lock_page() which can
be racy with free_swap_and_cache() is not so much, I think.

Nishimura-san, How about this ?

==
From: KAMEZAWA Hiroyuki <[email protected]>

free_swap_and_cache(), which is called under various spin_lock,
is designed to be best-effort function and it does nothing if the page
is locked, under I/O. But it's ok because global lru will find the page finally.

But, when it comes to memory+swap cgroup, global LRU may not work and
swap entry can be alive as "not used but not freed" state for very very long time
because memory cgroup's LRU routine scans its own LRU only.
(Such stale swap-cache is out of memcg's LRU)

Nishimura repoted such kind of swap cache makes memcg unstable and
- we can never free mem_cgroup object (....it's big) even if no users.
- OOM killer can happen because amounts of charge-to-swap is leaked.

This patch tries to fix the problem by adding PageCgroupStale() flag.
If a page which is under zappped is busy swap cache, recored it as Stale.
At the end of swap I/O, Stale flag is checked and swap and swapcache is
freed if necessary. Page migration case is also checked.


Reported-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

---
include/linux/page_cgroup.h | 4 +++
include/linux/swap.h | 8 ++++++
mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++----
mm/page_io.c | 2 +
mm/swapfile.c | 1
5 files changed, 62 insertions(+), 4 deletions(-)

Index: linux-2.6.30-rc2/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h
+++ linux-2.6.30-rc2/include/linux/page_cgroup.h
@@ -26,6 +26,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_STALE, /* may be a stale swap-cache */
};

#define TESTPCGFLAG(uname, lname) \
@@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)

+TESTPCGFLAG(Stale, STALE)
+SETPCGFLAG(Stale, STALE)
+
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
Index: linux-2.6.30-rc2/include/linux/swap.h
===================================================================
--- linux-2.6.30-rc2.orig/include/linux/swap.h
+++ linux-2.6.30-rc2/include/linux/swap.h
@@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_mark_swapcache_stale(struct page *page);
+extern void mem_cgroup_fixup_swapcache(struct page *page);
#else
static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
{
}
+static void mem_cgroup_check_mark_swapcache_stale(struct page *page)
+{
+}
+static void mem_cgroup_fixup_swapcache(struct page *page)
+{
+}
#endif

#else /* CONFIG_SWAP */
Index: linux-2.6.30-rc2/mm/memcontrol.c
===================================================================
--- linux-2.6.30-rc2.orig/mm/memcontrol.c
+++ linux-2.6.30-rc2/mm/memcontrol.c
@@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_
}
rcu_read_unlock();
}
+
+/*
+ * free_swap_and_cache() is an best-effort function and it doesn't free
+ * swapent if the swapcache seems to be busy (ex. the page is locked.)
+ * This behavior is designed to be as is but mem+swap cgroup has to handle it.
+ * Otherwise, swp_entry seems to be leaked (for very long time)
+ */
+void mem_cgroup_mark_swapcache_stale(struct page *page)
+{
+ struct page_cgroup *pc;
+
+ if (!PageSwapCache(page) || page_mapped(page))
+ return;
+
+ pc = lookup_page_cgroup(page);
+ lock_page_cgroup(pc);
+
+ /*
+ * This "Stale" flag will be cleared when the page is reused
+ * somewhere.
+ */
+ if (!PageCgroupUsed(pc))
+ SetPageCgroupStale(pc);
+ unlock_page_cgroup(pc);
+}
+
+void mem_cgroup_fixup_swapcache(struct page *page)
+{
+ struct page_cgroup *pc = lookup_page_cgroup(page);
+
+ /* Stale flag will be cleared automatically */
+ if (unlikely(PageCgroupStale(pc))) {
+ if (get_page_unless_zero(page)) {
+ lock_page(page);
+ try_to_free_swap(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ }
+}
+
#endif

/*
@@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem
__mem_cgroup_commit_charge(mem, pc, ctype);

/*
- * Both of oldpage and newpage are still under lock_page().
- * Then, we don't have to care about race in radix-tree.
- * But we have to be careful that this page is unmapped or not.
+ * oldpage is under lock_page() at migration. Then, we don't have to
+ * care about race in radix-tree. But we have to be careful
+ * that this page is unmapped or not.
*
* There is a case for !page_mapped(). At the start of
* migration, oldpage was mapped. But now, it's zapped.
* But we know *target* page is not freed/reused under us.
* mem_cgroup_uncharge_page() does all necessary checks.
*/
- if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
+ if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) {
mem_cgroup_uncharge_page(target);
+ mem_cgroup_fixup_swapcache(target);
+ }
}

/*
Index: linux-2.6.30-rc2/mm/page_io.c
===================================================================
--- linux-2.6.30-rc2.orig/mm/page_io.c
+++ linux-2.6.30-rc2/mm/page_io.c
@@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi
}
end_page_writeback(page);
bio_put(bio);
+ mem_cgroup_fixup_swapcache(page);
}

void end_swap_bio_read(struct bio *bio, int err)
@@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio,
}
unlock_page(page);
bio_put(bio);
+ mem_cgroup_fixup_swapcache(page);
}

/*
Index: linux-2.6.30-rc2/mm/swapfile.c
===================================================================
--- linux-2.6.30-rc2.orig/mm/swapfile.c
+++ linux-2.6.30-rc2/mm/swapfile.c
@@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr
if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) {
+ mem_cgroup_mark_swapcache_stale(page);
page_cache_release(page);
page = NULL;
}


2009-04-22 05:41:05

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

On Tue, 21 Apr 2009 16:21:21 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san
> reported. There are many callers of lock_page() but lock_page() which can
> be racy with free_swap_and_cache() is not so much, I think.
>
> Nishimura-san, How about this ?
>
Thank you for your patch.

I've run this patch last night but unfortunately I got a BUG.

BUG: sleeping function called from invalid context at include/linux/pagemap.h:327
in_atomic(): 1, irqs_disabled(): 0, pid: 9230, name: page01

hmm, calling lock_page() in end_swap_bio_* seems not safe.

And, some comments are inlined.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> free_swap_and_cache(), which is called under various spin_lock,
> is designed to be best-effort function and it does nothing if the page
> is locked, under I/O. But it's ok because global lru will find the page finally.
>
> But, when it comes to memory+swap cgroup, global LRU may not work and
> swap entry can be alive as "not used but not freed" state for very very long time
> because memory cgroup's LRU routine scans its own LRU only.
> (Such stale swap-cache is out of memcg's LRU)
>
> Nishimura repoted such kind of swap cache makes memcg unstable and
> - we can never free mem_cgroup object (....it's big) even if no users.
> - OOM killer can happen because amounts of charge-to-swap is leaked.
>
And

- All the swap space(swap entries) could be exhausted by these swap cache.

> This patch tries to fix the problem by adding PageCgroupStale() flag.
> If a page which is under zappped is busy swap cache, recored it as Stale.
> At the end of swap I/O, Stale flag is checked and swap and swapcache is
> freed if necessary. Page migration case is also checked.
>
>
> Reported-by: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> ---
> include/linux/page_cgroup.h | 4 +++
> include/linux/swap.h | 8 ++++++
> mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++----
> mm/page_io.c | 2 +
> mm/swapfile.c | 1
> 5 files changed, 62 insertions(+), 4 deletions(-)
>
> Index: linux-2.6.30-rc2/include/linux/page_cgroup.h
> ===================================================================
> --- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h
> +++ linux-2.6.30-rc2/include/linux/page_cgroup.h
> @@ -26,6 +26,7 @@ enum {
> PCG_LOCK, /* page cgroup is locked */
> PCG_CACHE, /* charged as cache */
> PCG_USED, /* this object is in use. */
> + PCG_STALE, /* may be a stale swap-cache */
> };
>
> #define TESTPCGFLAG(uname, lname) \
> @@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE)
> TESTPCGFLAG(Used, USED)
> CLEARPCGFLAG(Used, USED)
>
> +TESTPCGFLAG(Stale, STALE)
> +SETPCGFLAG(Stale, STALE)
> +
> static inline int page_cgroup_nid(struct page_cgroup *pc)
> {
> return page_to_nid(pc->page);
> Index: linux-2.6.30-rc2/include/linux/swap.h
> ===================================================================
> --- linux-2.6.30-rc2.orig/include/linux/swap.h
> +++ linux-2.6.30-rc2/include/linux/swap.h
> @@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag
> #endif
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> +extern void mem_cgroup_mark_swapcache_stale(struct page *page);
> +extern void mem_cgroup_fixup_swapcache(struct page *page);
> #else
> static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> {
> }
> +static void mem_cgroup_check_mark_swapcache_stale(struct page *page)
> +{
> +}
> +static void mem_cgroup_fixup_swapcache(struct page *page)
> +{
> +}
> #endif
>
I think they should be defined in MEM_RES_CTLR case.
Exhausting swap entries problem is not depend on MEM_RES_CTLR_SWAP.

> #else /* CONFIG_SWAP */
> Index: linux-2.6.30-rc2/mm/memcontrol.c
> ===================================================================
> --- linux-2.6.30-rc2.orig/mm/memcontrol.c
> +++ linux-2.6.30-rc2/mm/memcontrol.c
> @@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_
> }
> rcu_read_unlock();
> }
> +
> +/*
> + * free_swap_and_cache() is an best-effort function and it doesn't free
> + * swapent if the swapcache seems to be busy (ex. the page is locked.)
> + * This behavior is designed to be as is but mem+swap cgroup has to handle it.
> + * Otherwise, swp_entry seems to be leaked (for very long time)
> + */
> +void mem_cgroup_mark_swapcache_stale(struct page *page)
> +{
> + struct page_cgroup *pc;
> +
> + if (!PageSwapCache(page) || page_mapped(page))
> + return;
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> +
> + /*
> + * This "Stale" flag will be cleared when the page is reused
> + * somewhere.
> + */
> + if (!PageCgroupUsed(pc))
> + SetPageCgroupStale(pc);
> + unlock_page_cgroup(pc);
> +}
> +
> +void mem_cgroup_fixup_swapcache(struct page *page)
> +{
> + struct page_cgroup *pc = lookup_page_cgroup(page);
> +
> + /* Stale flag will be cleared automatically */
> + if (unlikely(PageCgroupStale(pc))) {
> + if (get_page_unless_zero(page)) {
> + lock_page(page);
> + try_to_free_swap(page);
> + unlock_page(page);
> + page_cache_release(page);
> + }
> + }
> +}
> +
> #endif
>
> /*
> @@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem
> __mem_cgroup_commit_charge(mem, pc, ctype);
>
> /*
> - * Both of oldpage and newpage are still under lock_page().
> - * Then, we don't have to care about race in radix-tree.
> - * But we have to be careful that this page is unmapped or not.
> + * oldpage is under lock_page() at migration. Then, we don't have to
> + * care about race in radix-tree. But we have to be careful
> + * that this page is unmapped or not.
> *
> * There is a case for !page_mapped(). At the start of
> * migration, oldpage was mapped. But now, it's zapped.
> * But we know *target* page is not freed/reused under us.
> * mem_cgroup_uncharge_page() does all necessary checks.
> */
> - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) {
> mem_cgroup_uncharge_page(target);
> + mem_cgroup_fixup_swapcache(target);
> + }
> }
>
> /*
hmm, the intention of the patch I posted(*) is not free stale SwapCache.

* http://marc.info/?l=linux-mm&m=124029196025611&w=2

If the oldpage was page_mapped() and !PageSwapCache, newpage would be
uncharge by mem_cgroup_uncharge_page() if the owner process has been exited
(because newpage is !page_mapped()).
But if the oldpage was page_mapped and PageSwapCache, newpage cannot be
uncharged by mem_cgroup_uncharge_page() even if the owner process has been exited.

Those SwapCache is put_back'ed to memcg's LRU, so it can be reclaimed
if memcg's LRU scaning(Anon) run, but my intention was to fix this behavior
for consistency.

> Index: linux-2.6.30-rc2/mm/page_io.c
> ===================================================================
> --- linux-2.6.30-rc2.orig/mm/page_io.c
> +++ linux-2.6.30-rc2/mm/page_io.c
> @@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi
> }
> end_page_writeback(page);
> bio_put(bio);
> + mem_cgroup_fixup_swapcache(page);
> }
>
> void end_swap_bio_read(struct bio *bio, int err)
> @@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio,
> }
> unlock_page(page);
> bio_put(bio);
> + mem_cgroup_fixup_swapcache(page);
> }
>
> /*
> Index: linux-2.6.30-rc2/mm/swapfile.c
> ===================================================================
> --- linux-2.6.30-rc2.orig/mm/swapfile.c
> +++ linux-2.6.30-rc2/mm/swapfile.c
> @@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr
> if (swap_entry_free(p, entry) == 1) {
> page = find_get_page(&swapper_space, entry.val);
> if (page && !trylock_page(page)) {
> + mem_cgroup_mark_swapcache_stale(page);
> page_cache_release(page);
> page = NULL;
> }
>

IIUC, this patch cannot handle(even if it worked as intended) cases like:

processA | processB
-------------------------------------+-------------------------------------
(free_swap_and_cache()) | (read_swap_cache_async())
| swap_duplicate()
swap_entry_free() == 1 |
find_get_page() |
-> cannot find. so |
PCG_STALE isn't set. |
| add_to_swap_cache()
|
| swap_readpage()
| end_swap_bio_read()
| mem_cgroup_fixup_swapcache()
| does nothing becase !PageCgroupStale

(the page is mapped but not on swapcache)
processA | processB
-------------------------------------+-------------------------------------
(zap_pte_range()) | (shrink_page_list())
| (referenced flag is set in some reason)
page_remove_rmap() |
-> uncharged(!PageSwapCache) |
| add_to_swap()
| -> succeed
|
| (not freed because referenced flag is set)

So, we should add check to shrink_page_list() anyway, IMHO.


Hmm... I tested a patch like below, but it seems to have a dead lock about swap_lock.
(This patch has unhandled corner cases yet, even if it worked.)

I'll dig and try more including another aproach..

---
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62d8143..991dd53 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -336,11 +336,16 @@ static inline void disable_swap_token(void)

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void mem_cgroup_fixup_swapcache(struct page *page);
#else
static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
}
+static inline void
+mem_cgroup_fixup_swapcache(struct page *page)
+{
+}
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ba07cab..b2a4d52 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1568,6 +1568,19 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
css_put(&memcg->css);
}

+void mem_cgroup_fixup_swapcache(struct page *page)
+{
+ if (mem_cgroup_disabled())
+ return;
+
+ VM_BUG_ON(!PageLocked(page));
+
+ if (get_page_unless_zero(page)) {
+ try_to_free_swap(page);
+ page_cache_release(page);
+ }
+}
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
/*
* called from swap_entry_free(). remove record in swap_cgroup and
diff --git a/mm/page_io.c b/mm/page_io.c
index 3023c47..2bafcd3 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -85,6 +85,7 @@ void end_swap_bio_read(struct bio *bio, int err)
} else {
SetPageUptodate(page);
}
+ mem_cgroup_fixup_swapcache(page);
unlock_page(page);
bio_put(bio);
}

2009-04-22 06:12:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

On Wed, 22 Apr 2009 14:38:33 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 21 Apr 2009 16:21:21 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > maybe this patch covers almost all cases of swap-leak of memcg, Nishimura-san
> > reported. There are many callers of lock_page() but lock_page() which can
> > be racy with free_swap_and_cache() is not so much, I think.
> >
> > Nishimura-san, How about this ?
> >
> Thank you for your patch.
>
> I've run this patch last night but unfortunately I got a BUG.
>
> BUG: sleeping function called from invalid context at include/linux/pagemap.h:327
> in_atomic(): 1, irqs_disabled(): 0, pid: 9230, name: page01
>
> hmm, calling lock_page() in end_swap_bio_* seems not safe.
>
Ugh. ok, calling lock_page is bad.


> And, some comments are inlined.
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > free_swap_and_cache(), which is called under various spin_lock,
> > is designed to be best-effort function and it does nothing if the page
> > is locked, under I/O. But it's ok because global lru will find the page finally.
> >
> > But, when it comes to memory+swap cgroup, global LRU may not work and
> > swap entry can be alive as "not used but not freed" state for very very long time
> > because memory cgroup's LRU routine scans its own LRU only.
> > (Such stale swap-cache is out of memcg's LRU)
> >
> > Nishimura repoted such kind of swap cache makes memcg unstable and
> > - we can never free mem_cgroup object (....it's big) even if no users.
> > - OOM killer can happen because amounts of charge-to-swap is leaked.
> >
> And
>
> - All the swap space(swap entries) could be exhausted by these swap cache.
>
Hmm, maybe.


> > This patch tries to fix the problem by adding PageCgroupStale() flag.
> > If a page which is under zappped is busy swap cache, recored it as Stale.
> > At the end of swap I/O, Stale flag is checked and swap and swapcache is
> > freed if necessary. Page migration case is also checked.
> >
> >
> > Reported-by: Daisuke Nishimura <[email protected]>
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > ---
> > include/linux/page_cgroup.h | 4 +++
> > include/linux/swap.h | 8 ++++++
> > mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++----
> > mm/page_io.c | 2 +
> > mm/swapfile.c | 1
> > 5 files changed, 62 insertions(+), 4 deletions(-)
> >
> > Index: linux-2.6.30-rc2/include/linux/page_cgroup.h
> > ===================================================================
> > --- linux-2.6.30-rc2.orig/include/linux/page_cgroup.h
> > +++ linux-2.6.30-rc2/include/linux/page_cgroup.h
> > @@ -26,6 +26,7 @@ enum {
> > PCG_LOCK, /* page cgroup is locked */
> > PCG_CACHE, /* charged as cache */
> > PCG_USED, /* this object is in use. */
> > + PCG_STALE, /* may be a stale swap-cache */
> > };
> >
> > #define TESTPCGFLAG(uname, lname) \
> > @@ -46,6 +47,9 @@ TESTPCGFLAG(Cache, CACHE)
> > TESTPCGFLAG(Used, USED)
> > CLEARPCGFLAG(Used, USED)
> >
> > +TESTPCGFLAG(Stale, STALE)
> > +SETPCGFLAG(Stale, STALE)
> > +
> > static inline int page_cgroup_nid(struct page_cgroup *pc)
> > {
> > return page_to_nid(pc->page);
> > Index: linux-2.6.30-rc2/include/linux/swap.h
> > ===================================================================
> > --- linux-2.6.30-rc2.orig/include/linux/swap.h
> > +++ linux-2.6.30-rc2/include/linux/swap.h
> > @@ -344,10 +344,18 @@ mem_cgroup_uncharge_swapcache(struct pag
> > #endif
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > +extern void mem_cgroup_mark_swapcache_stale(struct page *page);
> > +extern void mem_cgroup_fixup_swapcache(struct page *page);
> > #else
> > static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> > {
> > }
> > +static void mem_cgroup_check_mark_swapcache_stale(struct page *page)
> > +{
> > +}
> > +static void mem_cgroup_fixup_swapcache(struct page *page)
> > +{
> > +}
> > #endif
> >
> I think they should be defined in MEM_RES_CTLR case.
> Exhausting swap entries problem is not depend on MEM_RES_CTLR_SWAP.
>
Hmm, ok.


> > #else /* CONFIG_SWAP */
> > Index: linux-2.6.30-rc2/mm/memcontrol.c
> > ===================================================================
> > --- linux-2.6.30-rc2.orig/mm/memcontrol.c
> > +++ linux-2.6.30-rc2/mm/memcontrol.c
> > @@ -1534,6 +1534,47 @@ void mem_cgroup_uncharge_swap(swp_entry_
> > }
> > rcu_read_unlock();
> > }
> > +
> > +/*
> > + * free_swap_and_cache() is an best-effort function and it doesn't free
> > + * swapent if the swapcache seems to be busy (ex. the page is locked.)
> > + * This behavior is designed to be as is but mem+swap cgroup has to handle it.
> > + * Otherwise, swp_entry seems to be leaked (for very long time)
> > + */
> > +void mem_cgroup_mark_swapcache_stale(struct page *page)
> > +{
> > + struct page_cgroup *pc;
> > +
> > + if (!PageSwapCache(page) || page_mapped(page))
> > + return;
> > +
> > + pc = lookup_page_cgroup(page);
> > + lock_page_cgroup(pc);
> > +
> > + /*
> > + * This "Stale" flag will be cleared when the page is reused
> > + * somewhere.
> > + */
> > + if (!PageCgroupUsed(pc))
> > + SetPageCgroupStale(pc);
> > + unlock_page_cgroup(pc);
> > +}
> > +
> > +void mem_cgroup_fixup_swapcache(struct page *page)
> > +{
> > + struct page_cgroup *pc = lookup_page_cgroup(page);
> > +
> > + /* Stale flag will be cleared automatically */
> > + if (unlikely(PageCgroupStale(pc))) {
> > + if (get_page_unless_zero(page)) {
> > + lock_page(page);
> > + try_to_free_swap(page);
> > + unlock_page(page);
> > + page_cache_release(page);
> > + }
> > + }
> > +}
> > +
> > #endif
> >
> > /*
> > @@ -1604,17 +1645,19 @@ void mem_cgroup_end_migration(struct mem
> > __mem_cgroup_commit_charge(mem, pc, ctype);
> >
> > /*
> > - * Both of oldpage and newpage are still under lock_page().
> > - * Then, we don't have to care about race in radix-tree.
> > - * But we have to be careful that this page is unmapped or not.
> > + * oldpage is under lock_page() at migration. Then, we don't have to
> > + * care about race in radix-tree. But we have to be careful
> > + * that this page is unmapped or not.
> > *
> > * There is a case for !page_mapped(). At the start of
> > * migration, oldpage was mapped. But now, it's zapped.
> > * But we know *target* page is not freed/reused under us.
> > * mem_cgroup_uncharge_page() does all necessary checks.
> > */
> > - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
> > + if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) {
> > mem_cgroup_uncharge_page(target);
> > + mem_cgroup_fixup_swapcache(target);
> > + }
> > }
> >
> > /*
> hmm, the intention of the patch I posted(*) is not free stale SwapCache.
>
> * http://marc.info/?l=linux-mm&m=124029196025611&w=2
>
> If the oldpage was page_mapped() and !PageSwapCache, newpage would be
> uncharge by mem_cgroup_uncharge_page() if the owner process has been exited
> (because newpage is !page_mapped()).
> But if the oldpage was page_mapped and PageSwapCache, newpage cannot be
> uncharged by mem_cgroup_uncharge_page() even if the owner process has been exited.
>
> Those SwapCache is put_back'ed to memcg's LRU, so it can be reclaimed
> if memcg's LRU scaning(Anon) run, but my intention was to fix this behavior
> for consistency.
>
Hmm, but please remove lock_page() to newpage. (i.e. plz fix comment in memcontrol.c)

> > Index: linux-2.6.30-rc2/mm/page_io.c
> > ===================================================================
> > --- linux-2.6.30-rc2.orig/mm/page_io.c
> > +++ linux-2.6.30-rc2/mm/page_io.c
> > @@ -68,6 +68,7 @@ static void end_swap_bio_write(struct bi
> > }
> > end_page_writeback(page);
> > bio_put(bio);
> > + mem_cgroup_fixup_swapcache(page);
> > }
> >
> > void end_swap_bio_read(struct bio *bio, int err)
> > @@ -87,6 +88,7 @@ void end_swap_bio_read(struct bio *bio,
> > }
> > unlock_page(page);
> > bio_put(bio);
> > + mem_cgroup_fixup_swapcache(page);
> > }
> >
> > /*
> > Index: linux-2.6.30-rc2/mm/swapfile.c
> > ===================================================================
> > --- linux-2.6.30-rc2.orig/mm/swapfile.c
> > +++ linux-2.6.30-rc2/mm/swapfile.c
> > @@ -587,6 +587,7 @@ int free_swap_and_cache(swp_entry_t entr
> > if (swap_entry_free(p, entry) == 1) {
> > page = find_get_page(&swapper_space, entry.val);
> > if (page && !trylock_page(page)) {
> > + mem_cgroup_mark_swapcache_stale(page);
> > page_cache_release(page);
> > page = NULL;
> > }
> >
>
> IIUC, this patch cannot handle(even if it worked as intended) cases like:
>
> processA | processB
> -------------------------------------+-------------------------------------
> (free_swap_and_cache()) | (read_swap_cache_async())
> | swap_duplicate()
> swap_entry_free() == 1 |
> find_get_page() |
> -> cannot find. so |
> PCG_STALE isn't set. |
> | add_to_swap_cache()
> |
> | swap_readpage()
> | end_swap_bio_read()
> | mem_cgroup_fixup_swapcache()
> | does nothing becase !PageCgroupStale
>
> (the page is mapped but not on swapcache)
> processA | processB
> -------------------------------------+-------------------------------------
> (zap_pte_range()) | (shrink_page_list())
> | (referenced flag is set in some reason)
> page_remove_rmap() |
> -> uncharged(!PageSwapCache) |
> | add_to_swap()
> | -> succeed
> |
> | (not freed because referenced flag is set)
>
> So, we should add check to shrink_page_list() anyway, IMHO.
>
Maybe I should forget this patch ;)

>
> Hmm... I tested a patch like below, but it seems to have a dead lock about swap_lock.
> (This patch has unhandled corner cases yet, even if it worked.)
>
> I'll dig and try more including another aproach..
>
> ---
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 62d8143..991dd53 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -336,11 +336,16 @@ static inline void disable_swap_token(void)
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> +extern void mem_cgroup_fixup_swapcache(struct page *page);
> #else
> static inline void
> mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> {
> }
> +static inline void
> +mem_cgroup_fixup_swapcache(struct page *page)
> +{
> +}
> #endif
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ba07cab..b2a4d52 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1568,6 +1568,19 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> css_put(&memcg->css);
> }
>
> +void mem_cgroup_fixup_swapcache(struct page *page)
> +{
> + if (mem_cgroup_disabled())
> + return;
> +
> + VM_BUG_ON(!PageLocked(page));
> +
> + if (get_page_unless_zero(page)) {
> + try_to_free_swap(page);
> + page_cache_release(page);
> + }
> +}
Because the page is locked, page_cache_release() is unnecessary.

> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> /*
> * called from swap_entry_free(). remove record in swap_cgroup and
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 3023c47..2bafcd3 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -85,6 +85,7 @@ void end_swap_bio_read(struct bio *bio, int err)
> } else {
> SetPageUptodate(page);
> }
> + mem_cgroup_fixup_swapcache(page);

try_to_free_swap() at every end_swap_bio_read() ? We'll get tons of NACK.

Thanks,
-Kame

> unlock_page(page);
> bio_put(bio);
> }
>
>

2009-04-23 04:19:24

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

> I'll dig and try more including another aproach..
>
How about this patch ?

It seems to have been working fine for several hours.
I should add more and more comments and clean it up, of course :)
(I think it would be better to unify definitions of new functions to swapfile.c,
and checking page_mapped() might be enough for mem_cgroup_free_unused_swapcache().)

Signed-off-by: Daisuke Nishimura <[email protected]>
---
include/linux/memcontrol.h | 5 +++
include/linux/swap.h | 11 ++++++++
mm/memcontrol.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
mm/swap_state.c | 8 +++++
mm/swapfile.c | 32 ++++++++++++++++++++++-
mm/vmscan.c | 8 +++++
6 files changed, 125 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 25b9ca9..8b674c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -101,6 +101,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone);
struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
+extern void mem_cgroup_free_unused_swapcache(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

@@ -259,6 +260,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
return NULL;
}

+static inline void mem_cgroup_free_unused_swapcache(struct page *page)
+{
+}
+
static inline void
mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62d8143..cdfa982 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -336,11 +336,22 @@ static inline void disable_swap_token(void)

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern int mem_cgroup_fixup_swapin(struct page *page);
+extern void mem_cgroup_fixup_swapfree(struct page *page);
#else
static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
}
+static inline int
+mem_cgroup_fixup_swapin(struct page *page)
+{
+ return 0;
+}
+static inline void
+mem_cgroup_fixup_swapfree(struct page *page)
+{
+}
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79c32b8..f90967b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1536,6 +1536,68 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
}
#endif

+struct mem_cgroup_swap_fixup_work {
+ struct work_struct work;
+ struct page *page;
+};
+
+static void mem_cgroup_fixup_swapfree_cb(struct work_struct *work)
+{
+ struct mem_cgroup_swap_fixup_work *my_work;
+ struct page *page;
+
+ my_work = container_of(work, struct mem_cgroup_swap_fixup_work, work);
+ page = my_work->page;
+
+ lock_page(page);
+ if (PageSwapCache(page))
+ mem_cgroup_free_unused_swapcache(page);
+ unlock_page(page);
+
+ kfree(my_work);
+ put_page(page);
+}
+
+void mem_cgroup_fixup_swapfree(struct page *page)
+{
+ struct mem_cgroup_swap_fixup_work *my_work;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ if (!PageSwapCache(page) || page_mapped(page))
+ return;
+
+ my_work = kmalloc(sizeof(*my_work), GFP_ATOMIC); /* cannot sleep */
+ if (my_work) {
+ get_page(page); /* put_page will be called in callback */
+ my_work->page = page;
+ INIT_WORK(&my_work->work, mem_cgroup_fixup_swapfree_cb);
+ schedule_work(&my_work->work);
+ }
+
+ return;
+}
+
+/*
+ * called from shrink_page_list() and mem_cgroup_fixup_swapfree_cb() to free
+ * !PageCgroupUsed SwapCache, because memcg cannot handle these SwapCache well.
+ */
+void mem_cgroup_free_unused_swapcache(struct page *page)
+{
+ struct page_cgroup *pc;
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(!PageSwapCache(page));
+
+ pc = lookup_page_cgroup(page);
+ /*
+ * Used bit of swapcache is solid under page lock.
+ */
+ if (!PageCgroupUsed(pc))
+ try_to_free_swap(page);
+}
+
/*
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page belongs to.
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..57d9678 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -310,6 +310,14 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
SetPageSwapBacked(new_page);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
+ if (unlikely(mem_cgroup_fixup_swapin(new_page)))
+ /*
+ * new_page is not used by anyone.
+ * And it has been already removed from
+ * SwapCache and freed.
+ */
+ return NULL;
+
/*
* Initiate read into locked page and return.
*/
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 312fafe..1f6934c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -578,6 +578,7 @@ int free_swap_and_cache(swp_entry_t entry)
{
struct swap_info_struct *p;
struct page *page = NULL;
+ struct page *stale = NULL;

if (is_migration_entry(entry))
return 1;
@@ -587,7 +588,7 @@ int free_swap_and_cache(swp_entry_t entry)
if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) {
- page_cache_release(page);
+ stale = page;
page = NULL;
}
}
@@ -606,9 +607,38 @@ int free_swap_and_cache(swp_entry_t entry)
unlock_page(page);
page_cache_release(page);
}
+ if (stale) {
+ mem_cgroup_fixup_swapfree(stale);
+ page_cache_release(stale);
+ }
return p != NULL;
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+int mem_cgroup_fixup_swapin(struct page *page)
+{
+ int ret = 0;
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(!PageSwapCache(page));
+
+ if (mem_cgroup_disabled())
+ return 0;
+
+ /* Used only by SwapCache ? */
+ if (unlikely(!page_swapcount(page))) {
+ get_page(page);
+ ret = remove_mapping(&swapper_space, page);
+ if (ret)
+ /* should be unlocked before beeing freed */
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
+ return ret;
+}
+#endif
+
#ifdef CONFIG_HIBERNATION
/*
* Find the swap type that corresponds to given device (if any).
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac9577..640bfb6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -785,6 +785,14 @@ activate_locked:
SetPageActive(page);
pgactivate++;
keep_locked:
+ if (!scanning_global_lru(sc) && PageSwapCache(page))
+ /*
+ * Free !PageCgroupUsed SwapCache here, because memcg
+ * cannot handle these SwapCache well.
+ * This can happen if the page is freed by the owner
+ * process before it is added to SwapCache.
+ */
+ mem_cgroup_free_unused_swapcache(page);
unlock_page(page);
keep:
list_add(&page->lru, &ret_pages);

2009-04-23 08:47:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

On Thu, 23 Apr 2009 13:14:37 +0900
Daisuke Nishimura <[email protected]> wrote:

> > I'll dig and try more including another aproach..
> >
> How about this patch ?
>
> It seems to have been working fine for several hours.
> I should add more and more comments and clean it up, of course :)
> (I think it would be better to unify definitions of new functions to swapfile.c,
> and checking page_mapped() might be enough for mem_cgroup_free_unused_swapcache().)
>
> Signed-off-by: Daisuke Nishimura <[email protected]>

Hmm, I still think this patch is overkill.



> ---
> include/linux/memcontrol.h | 5 +++
> include/linux/swap.h | 11 ++++++++
> mm/memcontrol.c | 62 ++++++++++++++++++++++++++++++++++++++++++++
> mm/swap_state.c | 8 +++++
> mm/swapfile.c | 32 ++++++++++++++++++++++-
> mm/vmscan.c | 8 +++++
> 6 files changed, 125 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 25b9ca9..8b674c2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -101,6 +101,7 @@ struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
> struct zone *zone);
> struct zone_reclaim_stat*
> mem_cgroup_get_reclaim_stat_from_page(struct page *page);
> +extern void mem_cgroup_free_unused_swapcache(struct page *page);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
> @@ -259,6 +260,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> return NULL;
> }
>
> +static inline void mem_cgroup_free_unused_swapcache(struct page *page)
> +{
> +}
> +
> static inline void
> mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 62d8143..cdfa982 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -336,11 +336,22 @@ static inline void disable_swap_token(void)
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> +extern int mem_cgroup_fixup_swapin(struct page *page);
> +extern void mem_cgroup_fixup_swapfree(struct page *page);
> #else
> static inline void
> mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> {
> }
> +static inline int
> +mem_cgroup_fixup_swapin(struct page *page)
> +{
> + return 0;
> +}
> +static inline void
> +mem_cgroup_fixup_swapfree(struct page *page)
> +{
> +}
> #endif
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 79c32b8..f90967b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1536,6 +1536,68 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
> }
> #endif
>
> +struct mem_cgroup_swap_fixup_work {
> + struct work_struct work;
> + struct page *page;
> +};
> +
> +static void mem_cgroup_fixup_swapfree_cb(struct work_struct *work)
> +{
> + struct mem_cgroup_swap_fixup_work *my_work;
> + struct page *page;
> +
> + my_work = container_of(work, struct mem_cgroup_swap_fixup_work, work);
> + page = my_work->page;
> +
> + lock_page(page);
> + if (PageSwapCache(page))
> + mem_cgroup_free_unused_swapcache(page);
> + unlock_page(page);
> +
> + kfree(my_work);
> + put_page(page);
> +}
> +
> +void mem_cgroup_fixup_swapfree(struct page *page)
> +{
> + struct mem_cgroup_swap_fixup_work *my_work;
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + if (!PageSwapCache(page) || page_mapped(page))
> + return;
> +
> + my_work = kmalloc(sizeof(*my_work), GFP_ATOMIC); /* cannot sleep */
> + if (my_work) {
> + get_page(page); /* put_page will be called in callback */
> + my_work->page = page;
> + INIT_WORK(&my_work->work, mem_cgroup_fixup_swapfree_cb);
> + schedule_work(&my_work->work);
> + }
> +
> + return;
> +}
> +
> +/*
> + * called from shrink_page_list() and mem_cgroup_fixup_swapfree_cb() to free
> + * !PageCgroupUsed SwapCache, because memcg cannot handle these SwapCache well.
> + */
> +void mem_cgroup_free_unused_swapcache(struct page *page)
> +{
> + struct page_cgroup *pc;
> +
> + VM_BUG_ON(!PageLocked(page));
> + VM_BUG_ON(!PageSwapCache(page));
> +
> + pc = lookup_page_cgroup(page);
> + /*
> + * Used bit of swapcache is solid under page lock.
> + */
> + if (!PageCgroupUsed(pc))
> + try_to_free_swap(page);
> +}
> +
> /*
> * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
> * page belongs to.
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3ecea98..57d9678 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -310,6 +310,14 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> SetPageSwapBacked(new_page);
> err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (likely(!err)) {
> + if (unlikely(mem_cgroup_fixup_swapin(new_page)))
> + /*
> + * new_page is not used by anyone.
> + * And it has been already removed from
> + * SwapCache and freed.
> + */
> + return NULL;
> +

Can't we check refcnt of swp_entry here, again ?
if (refcnt == 1), we can make this as STALE.
(and can free swap cache here)


Thanks,
-Kame

2009-04-24 04:34:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

On Wed, 22 Apr 2009 14:38:33 +0900
Daisuke Nishimura <[email protected]> wrote:

> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > +extern void mem_cgroup_mark_swapcache_stale(struct page *page);
> > +extern void mem_cgroup_fixup_swapcache(struct page *page);
> > #else
> > static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> > {
> > }
> > +static void mem_cgroup_check_mark_swapcache_stale(struct page *page)
> > +{
> > +}
> > +static void mem_cgroup_fixup_swapcache(struct page *page)
> > +{
> > +}
> > #endif
> >
> I think they should be defined in MEM_RES_CTLR case.
> Exhausting swap entries problem is not depend on MEM_RES_CTLR_SWAP.
>
Could you explain this more ? I can't understand.


Thanks,
-Kame

2009-04-24 06:25:28

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for mem+swap controller

On Fri, 24 Apr 2009 13:33:06 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 22 Apr 2009 14:38:33 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > > +extern void mem_cgroup_mark_swapcache_stale(struct page *page);
> > > +extern void mem_cgroup_fixup_swapcache(struct page *page);
> > > #else
> > > static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
> > > {
> > > }
> > > +static void mem_cgroup_check_mark_swapcache_stale(struct page *page)
> > > +{
> > > +}
> > > +static void mem_cgroup_fixup_swapcache(struct page *page)
> > > +{
> > > +}
> > > #endif
> > >
> > I think they should be defined in MEM_RES_CTLR case.
> > Exhausting swap entries problem is not depend on MEM_RES_CTLR_SWAP.
> >
> Could you explain this more ? I can't understand.
>
STALE(!PageCgroupUsed) SwapCache *without owner process* can be created by
the race between exit()..free_swap_and_cache() and read_swap_cache_async()(type-1)
or between exit()..page_remove_rmap() and shrink_page_list()(type-2).
(I don't think STALE SwapCache itself is problematic as long as there is
an actual user of the SwapCache.)

Those NOUSER STALE SwapCache are NOT depend on MEM_RES_CTLR_SWAP.

If total_swap_size is small enough not to trigger global LRU scan,
all swap space can be used up.
I confirmed before that all swap space were used up(and caused oom)
with mem.limit=32M/total_swap_size=50M.


Thanks,
Daisuke Nishimura.

2009-04-24 07:30:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

This is new one. (using new logic.) Maybe enough light-weight and caches all cases.

Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

Because free_swap_and_cache() function is called under spinlocks,
it can't sleep and use trylock_page() instead of lock_page().
By this, swp_entry which is not used after zap_xx can exists as
SwapCache, which will be never used.
This kind of SwapCache is reclaimed by global LRU when it's found
at LRU rotation.

When memory cgroup is used, the global LRU will not be kicked and
stale Swap Caches will not be reclaimed. This is problematic because
memcg's swap entry accounting is leaked and memcg can't know it.
To catch this stale SwapCache, we have to chase it and check the
swap is alive or not again.

This patch adds a function to chase stale swap cache and reclaim it
in modelate way. When zap_xxx fails to remove swap ent, it will be
recoreded into buffer and memcg's "work" will reclaim it later.
No sleep, no memory allocation under free_swap_and_cache().

This patch also adds stale-swap-cache-congestion logic and try to avoid having
too much stale swap caches at the same time.

Implementation is naive but maybe the cost meets trade-off.

How to test:
1. set limit of memory to very small (1-2M?).
2. run some amount of program and run page reclaim/swap-in.
3. kill programs by SIGKILL etc....then, Stale Swap Cache will
be increased. After this patch, stale swap caches are reclaimed
and mem+swap controller will not go to OOM.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/swap.h | 16 ++++++
mm/memcontrol.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap_state.c | 11 +++-
mm/swapfile.c | 11 ++++
mm/vmscan.c | 3 +
5 files changed, 168 insertions(+), 1 deletion(-)

Index: mmotm-2.6.30-Apr21/include/linux/swap.h
===================================================================
--- mmotm-2.6.30-Apr21.orig/include/linux/swap.h
+++ mmotm-2.6.30-Apr21/include/linux/swap.h
@@ -336,11 +336,27 @@ static inline void disable_swap_token(vo

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void memcg_mark_swapent_stale(swp_entry_t ent);
+extern void memcg_sanity_check_swapin(struct page *page, swp_entry_t ent);
+extern int memcg_stale_swap_congestion(void);
#else
static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
}
+
+static void memcg_mark_swapent_stale(swp_entry_t ent)
+{
+}
+
+static void memcg_sanity_check_swapin(struct page *page, swp_entry_t ent)
+{
+}
+
+static inline int memcg_stale_swap_congestion(void)
+{
+ return 0;
+}
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
Index: mmotm-2.6.30-Apr21/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-Apr21.orig/mm/memcontrol.c
+++ mmotm-2.6.30-Apr21/mm/memcontrol.c
@@ -1702,6 +1702,131 @@ int mem_cgroup_shmem_charge_fallback(str
return ret;
}

+#ifdef CONFIG_SWAP
+/*
+ * Stale Swap Cache Handler.
+ * Stale Swap Cache is a Swap Cache which will never be used. In general,
+ * Swap Cache is zapped by free_swap_and_cache() (via zap_pte_range() etc.).
+ * But in racy case, free_swap_and_cache() doesn't free swap entries and
+ * it's expected that Swap Cache will be freed by global LRU rotation.
+ *
+ * But if memory cgroup is used, global lru rotation may not happen and
+ * Stale Swap Cache (and unused swap entry) will never be reclaimed. In bad
+ * case, this can cause OOM (under memcg) and other problems.
+ *
+ * Following is GC code for stale swap caches.
+ */
+
+#define STALE_ENTS (512)
+#define STALE_ENTS_MAP (STALE_ENTS/BITS_PER_LONG)
+
+static struct stale_swap_control {
+ spinlock_t lock;
+ int num;
+ int congestion;
+ unsigned long usemap[STALE_ENTS_MAP];
+ swp_entry_t ents[STALE_ENTS];
+ struct delayed_work gc_work;
+} ssc;
+
+static void memcg_fixup_stale_swapcache(struct work_struct *work)
+{
+ int pos = 0;
+ swp_entry_t entry;
+ struct page *page;
+ int forget, ret;
+
+ while (ssc.num) {
+ spin_lock(&ssc.lock);
+ pos = find_next_bit(ssc.usemap, STALE_ENTS, pos);
+ spin_unlock(&ssc.lock);
+
+ if (pos >= STALE_ENTS)
+ break;
+
+ entry = ssc.ents[pos];
+
+ forget = 1;
+ page = lookup_swap_cache(entry);
+ if (page) {
+ lock_page(page);
+ ret = try_to_free_swap(page);
+ /* If it's still under I/O, don't forget it */
+ if (!ret && PageWriteback(page))
+ forget = 0;
+ unlock_page(page);
+ }
+ if (forget) {
+ spin_lock(&ssc.lock);
+ clear_bit(pos, ssc.usemap);
+ ssc.num--;
+ if (ssc.num < STALE_ENTS/2)
+ ssc.congestion = 0;
+ spin_unlock(&ssc.lock);
+ }
+ pos++;
+ }
+ if (ssc.num) /* schedule me again */
+ schedule_delayed_work(&ssc.gc_work, HZ/10);
+ return;
+}
+
+static void schedule_ssc_gc(void)
+{
+ /* 10ms margin to wait for a page unlocked */
+ schedule_delayed_work(&ssc.gc_work, HZ/10);
+}
+
+/* We found lock_page() contention at zap_page. then revisit this later */
+void memcg_mark_swapent_stale(swp_entry_t ent)
+{
+ int pos;
+
+ spin_lock(&ssc.lock);
+ WARN_ON(ssc.num >= STALE_ENTS);
+ if (ssc.num < STALE_ENTS) {
+ pos = find_first_zero_bit(ssc.usemap, STALE_ENTS);
+ ssc.ents[pos] = ent;
+ set_bit(pos, ssc.usemap);
+ ssc.num++;
+ if (ssc.num > STALE_ENTS/2)
+ ssc.congestion = 1;
+ }
+ spin_unlock(&ssc.lock);
+ schedule_ssc_gc();
+}
+
+/* If too many stale swap caches, avoid too much swap I/O */
+int memcg_stale_swap_congestion(void)
+{
+ smp_mb();
+ if (ssc.congestion) {
+ schedule_ssc_gc();
+ return 1;
+ }
+ return 0;
+}
+
+static void setup_stale_swapcache_control(void)
+{
+ memset(&ssc, 0, sizeof(ssc));
+ spin_lock_init(&ssc.lock);
+ INIT_DELAYED_WORK(&ssc.gc_work, memcg_fixup_stale_swapcache);
+}
+
+#else
+
+int memcg_stale_swap_congestion(void)
+{
+ return 0;
+}
+
+static void setup_stale_swapcache_control(void)
+{
+}
+
+#endif /* CONFIG_SWAP */
+
static DEFINE_MUTEX(set_limit_mutex);

static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
@@ -2464,6 +2589,7 @@ static struct mem_cgroup *parent_mem_cgr
return mem_cgroup_from_res_counter(mem->res.parent, res);
}

+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
static void __init enable_swap_cgroup(void)
{
@@ -2493,6 +2619,7 @@ mem_cgroup_create(struct cgroup_subsys *
/* root ? */
if (cont->parent == NULL) {
enable_swap_cgroup();
+ setup_stale_swapcache_control();
parent = NULL;
} else {
parent = mem_cgroup_from_cont(cont->parent);
@@ -2588,3 +2715,4 @@ static int __init disable_swap_account(c
}
__setup("noswapaccount", disable_swap_account);
#endif
+
Index: mmotm-2.6.30-Apr21/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-Apr21.orig/mm/swap_state.c
+++ mmotm-2.6.30-Apr21/mm/swap_state.c
@@ -313,6 +313,7 @@ struct page *read_swap_cache_async(swp_e
/*
* Initiate read into locked page and return.
*/
+ memcg_sanity_check_swapin(new_page, entry);
lru_cache_add_anon(new_page);
swap_readpage(NULL, new_page);
return new_page;
@@ -360,8 +361,16 @@ struct page *swapin_readahead(swp_entry_
* No, it's very unlikely that swap layout would follow vma layout,
* more likely that neighbouring swap pages came from the same node:
* so use the same "addr" to choose the same node for each swap read.
+ *
+ * If memory cgroup is used, Stale Swap Cache congestion check is
+ * done and no readahed if there are too much stale swap caches.
*/
- nr_pages = valid_swaphandles(entry, &offset);
+ if (memcg_stale_swap_congestion()) {
+ offset = swp_offset(entry);
+ nr_pages = 1;
+ } else
+ nr_pages = valid_swaphandles(entry, &offset);
+
for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
Index: mmotm-2.6.30-Apr21/mm/swapfile.c
===================================================================
--- mmotm-2.6.30-Apr21.orig/mm/swapfile.c
+++ mmotm-2.6.30-Apr21/mm/swapfile.c
@@ -570,6 +570,16 @@ int try_to_free_swap(struct page *page)
return 1;
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+void memcg_sanity_check_swapin(struct page *page, swp_entry_t entry)
+{
+ VM_BUG_ON(!PageSwapCache(page));
+ VM_BUG_ON(!PageLocked(page));
+ /* This page is Locked */
+ if (!page_swapcount(page))
+ memcg_mark_swapent_stale(entry);
+}
+#endif
/*
* Free the swap entry like above, but also try to
* free the page cache entry if it is the last user.
@@ -589,6 +599,7 @@ int free_swap_and_cache(swp_entry_t entr
if (page && !trylock_page(page)) {
page_cache_release(page);
page = NULL;
+ memcg_mark_swapent_stale(entry);
}
}
spin_unlock(&swap_lock);
Index: mmotm-2.6.30-Apr21/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-Apr21.orig/mm/vmscan.c
+++ mmotm-2.6.30-Apr21/mm/vmscan.c
@@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
if (PageAnon(page) && !PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
+ /* avoid making more stale swap caches */
+ if (memcg_stale_swap_congestion())
+ goto keep_locked;
if (!add_to_swap(page))
goto activate_locked;
may_enter_fs = 1;

2009-04-24 08:09:41

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

On Fri, 24 Apr 2009 16:28:40 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Because free_swap_and_cache() function is called under spinlocks,
> it can't sleep and use trylock_page() instead of lock_page().
> By this, swp_entry which is not used after zap_xx can exists as
> SwapCache, which will be never used.
> This kind of SwapCache is reclaimed by global LRU when it's found
> at LRU rotation.
>
> When memory cgroup is used, the global LRU will not be kicked and
> stale Swap Caches will not be reclaimed. This is problematic because
> memcg's swap entry accounting is leaked and memcg can't know it.
> To catch this stale SwapCache, we have to chase it and check the
> swap is alive or not again.
>
> This patch adds a function to chase stale swap cache and reclaim it
> in modelate way. When zap_xxx fails to remove swap ent, it will be
> recoreded into buffer and memcg's "work" will reclaim it later.
> No sleep, no memory allocation under free_swap_and_cache().
>
> This patch also adds stale-swap-cache-congestion logic and try to avoid having
> too much stale swap caches at the same time.
>
> Implementation is naive but maybe the cost meets trade-off.
>
> How to test:
> 1. set limit of memory to very small (1-2M?).
> 2. run some amount of program and run page reclaim/swap-in.
> 3. kill programs by SIGKILL etc....then, Stale Swap Cache will
> be increased. After this patch, stale swap caches are reclaimed
> and mem+swap controller will not go to OOM.
>
Thank you for your patch!

It seems good at first glance.
I'll test it this weekend.

Thanks,
Daisuke Nishimura.

2009-04-25 13:30:04

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

> +static void memcg_fixup_stale_swapcache(struct work_struct *work)
> +{
> + int pos = 0;
> + swp_entry_t entry;
> + struct page *page;
> + int forget, ret;
> +
> + while (ssc.num) {
> + spin_lock(&ssc.lock);
> + pos = find_next_bit(ssc.usemap, STALE_ENTS, pos);
> + spin_unlock(&ssc.lock);
> +
> + if (pos >= STALE_ENTS)
> + break;
> +
> + entry = ssc.ents[pos];
> +
> + forget = 1;
> + page = lookup_swap_cache(entry);
> + if (page) {
> + lock_page(page);
> + ret = try_to_free_swap(page);
> + /* If it's still under I/O, don't forget it */
> + if (!ret && PageWriteback(page))
> + forget = 0;
> + unlock_page(page);
I think we need page_cache_release().
lookup_swap_cache() gets the page.

> + }
> + if (forget) {
> + spin_lock(&ssc.lock);
> + clear_bit(pos, ssc.usemap);
> + ssc.num--;
> + if (ssc.num < STALE_ENTS/2)
> + ssc.congestion = 0;
> + spin_unlock(&ssc.lock);
> + }
> + pos++;
> + }
> + if (ssc.num) /* schedule me again */
> + schedule_delayed_work(&ssc.gc_work, HZ/10);
"if (ssc.congestion)" would be better ?

> + return;
> +}
> +

(snip)

> Index: mmotm-2.6.30-Apr21/mm/vmscan.c
> ===================================================================
> --- mmotm-2.6.30-Apr21.orig/mm/vmscan.c
> +++ mmotm-2.6.30-Apr21/mm/vmscan.c
> @@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
> if (PageAnon(page) && !PageSwapCache(page)) {
> if (!(sc->gfp_mask & __GFP_IO))
> goto keep_locked;
> + /* avoid making more stale swap caches */
> + if (memcg_stale_swap_congestion())
> + goto keep_locked;
> if (!add_to_swap(page))
> goto activate_locked;
> may_enter_fs = 1;
>
Hmm, I don't think this can avoid type-2 stale swap caches.
IIUC, this can only avoid add_to_swap() if the number of stale swap caches
exceeds some threshold, but type-2 swap caches(set !PageCgroupUsed by the
owner process via page_remove_rmap()->mem_cgroup_uncharge_page() before
beeing add to swap cache) is not handled as 'stale'.

In fact, I can see the usage of SwapCache increasing gradually.

Can you add a patch like bellow ?

Thanks,
Daisuke Nishimura.
===
From: Daisuke Nishimura <[email protected]>

Instead of checking memcg_stale_swap_congestion() before add_to_swap(),
free "unused" swap cache after add_to_swap().

IMHO, it would be better to let shrink_page_list() free as much pages
as possible, so free type-2 stale swap caches directly, instead of
handling them in lazy manner.
shrink_page_list() calls try_to_free_swap() already in some paths.
(e.g. pageout()->swap_writepage()->try_to_free_swap())

Signed-off-by: Daisuke Nishimura <[email protected]>
---
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e6519c..51c6985 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -339,6 +339,7 @@ extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
extern void memcg_mark_swapent_stale(swp_entry_t ent);
extern void memcg_sanity_check_swapin(struct page *page, swp_entry_t ent);
extern int memcg_stale_swap_congestion(void);
+extern int memcg_free_unused_swapcache(struct page *page);
#else
static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
@@ -357,6 +358,11 @@ static inline int memcg_stale_swap_congestion(void)
{
return 0;
}
+
+static int memcg_free_unused_swapcache(struct page *page)
+{
+ return 0;
+}
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ccc69b4..822a914 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1754,6 +1754,17 @@ static void setup_stale_swapcache_control(void)
INIT_DELAYED_WORK(&ssc.gc_work, memcg_fixup_stale_swapcache);
}

+int memcg_free_unused_swapcache(struct page *page)
+{
+ VM_BUG_ON(!PageSwapCache(page));
+ VM_BUG_ON(!PageLocked(page));
+
+ if (mem_cgroup_disabled())
+ return 0;
+ if (!PageAnon(page) || page_mapped(page))
+ return 0;
+ return try_to_free_swap(page); /* checks page_swapcount */
+}
#else

int memcg_stale_swap_congestion(void)
@@ -1765,6 +1776,10 @@ static void setup_stale_swapcache_control(void)
{
}

+int memcg_free_unused_swapcache(struct page *page)
+{
+ return 0;
+}
#endif /* CONFIG_SWAP */

static DEFINE_MUTEX(set_limit_mutex);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 054ed38..5b9aa8e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -654,11 +654,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageAnon(page) && !PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
- /* avoid making more stale swap caches */
- if (memcg_stale_swap_congestion())
- goto keep_locked;
if (!add_to_swap(page))
goto activate_locked;
+ /*
+ * The owner process might have uncharged the page
+ * (by page_remove_rmap()) before it has been added
+ * to swap cache.
+ * Check it here to avoid making it stale.
+ */
+ if (memcg_free_unused_swapcache(page))
+ goto keep_locked;
may_enter_fs = 1;
}

===

2009-04-25 16:07:26

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

A few minor nitpicks :)

> > +static void memcg_fixup_stale_swapcache(struct work_struct *work)
> > +{
> > + int pos = 0;
> > + swp_entry_t entry;
> > + struct page *page;
> > + int forget, ret;
> > +
> > + while (ssc.num) {
> > + spin_lock(&ssc.lock);
> > + pos = find_next_bit(ssc.usemap, STALE_ENTS, pos);
> > + spin_unlock(&ssc.lock);
> > +
> > + if (pos >= STALE_ENTS)
> > + break;
> > +
> > + entry = ssc.ents[pos];
> > +
> > + forget = 1;
> > + page = lookup_swap_cache(entry);
I think using find_get_page() would be better.
lookup_swap_cache() update swapcache_info.

> > + if (page) {
> > + lock_page(page);
> > + ret = try_to_free_swap(page);
> > + /* If it's still under I/O, don't forget it */
> > + if (!ret && PageWriteback(page))
> > + forget = 0;
> > + unlock_page(page);
> I think we need page_cache_release().
> lookup_swap_cache() gets the page.
>
> > + }
> > + if (forget) {
> > + spin_lock(&ssc.lock);
> > + clear_bit(pos, ssc.usemap);
> > + ssc.num--;
> > + if (ssc.num < STALE_ENTS/2)
> > + ssc.congestion = 0;
> > + spin_unlock(&ssc.lock);
> > + }
> > + pos++;
> > + }
> > + if (ssc.num) /* schedule me again */
> > + schedule_delayed_work(&ssc.gc_work, HZ/10);
We can use schedule_ssc_gc() here.
(It should be defined before this, of course. And can be inlined.)

> "if (ssc.congestion)" would be better ?
>
> > + return;
> > +}
> > +
>

Thanks,
Daisuke Nishimura.

2009-04-27 07:41:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

On Sun, 26 Apr 2009 01:06:58 +0900
Daisuke Nishimura <[email protected]> wrote:

> A few minor nitpicks :)
>
> > > +static void memcg_fixup_stale_swapcache(struct work_struct *work)
> > > +{
> > > + int pos = 0;
> > > + swp_entry_t entry;
> > > + struct page *page;
> > > + int forget, ret;
> > > +
> > > + while (ssc.num) {
> > > + spin_lock(&ssc.lock);
> > > + pos = find_next_bit(ssc.usemap, STALE_ENTS, pos);
> > > + spin_unlock(&ssc.lock);
> > > +
> > > + if (pos >= STALE_ENTS)
> > > + break;
> > > +
> > > + entry = ssc.ents[pos];
> > > +
> > > + forget = 1;
> > > + page = lookup_swap_cache(entry);
> I think using find_get_page() would be better.
> lookup_swap_cache() update swapcache_info.
>

ok, and I have to add put_page() somewhere.

> > > + if (page) {
> > > + lock_page(page);
> > > + ret = try_to_free_swap(page);
> > > + /* If it's still under I/O, don't forget it */
> > > + if (!ret && PageWriteback(page))
> > > + forget = 0;
> > > + unlock_page(page);
> > I think we need page_cache_release().
> > lookup_swap_cache() gets the page.
> >
> > > + }
> > > + if (forget) {
> > > + spin_lock(&ssc.lock);
> > > + clear_bit(pos, ssc.usemap);
> > > + ssc.num--;
> > > + if (ssc.num < STALE_ENTS/2)
> > > + ssc.congestion = 0;
> > > + spin_unlock(&ssc.lock);
> > > + }
> > > + pos++;
> > > + }
> > > + if (ssc.num) /* schedule me again */
> > > + schedule_delayed_work(&ssc.gc_work, HZ/10);
> We can use schedule_ssc_gc() here.
> (It should be defined before this, of course. And can be inlined.)
>
Sure.

will soon post v2. (removing RFC)

Thanks,
-Kame
> > "if (ssc.congestion)" would be better ?
> >
> > > + return;
> > > +}
> > > +
> >
>
> Thanks,
> Daisuke Nishimura.
>

2009-04-27 08:13:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-24 16:28:40]:

> This is new one. (using new logic.) Maybe enough light-weight and caches all cases.

You sure mean catches above :)


>
> Thanks,
> -Kame
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Because free_swap_and_cache() function is called under spinlocks,
> it can't sleep and use trylock_page() instead of lock_page().
> By this, swp_entry which is not used after zap_xx can exists as
> SwapCache, which will be never used.
> This kind of SwapCache is reclaimed by global LRU when it's found
> at LRU rotation.
>
> When memory cgroup is used, the global LRU will not be kicked and
> stale Swap Caches will not be reclaimed. This is problematic because
> memcg's swap entry accounting is leaked and memcg can't know it.
> To catch this stale SwapCache, we have to chase it and check the
> swap is alive or not again.
>
> This patch adds a function to chase stale swap cache and reclaim it
> in modelate way. When zap_xxx fails to remove swap ent, it will be
> recoreded into buffer and memcg's "work" will reclaim it later.
> No sleep, no memory allocation under free_swap_and_cache().
>
> This patch also adds stale-swap-cache-congestion logic and try to avoid having
> too much stale swap caches at the same time.
>
> Implementation is naive but maybe the cost meets trade-off.
>
> How to test:
> 1. set limit of memory to very small (1-2M?).
> 2. run some amount of program and run page reclaim/swap-in.
> 3. kill programs by SIGKILL etc....then, Stale Swap Cache will
> be increased. After this patch, stale swap caches are reclaimed
> and mem+swap controller will not go to OOM.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Quick comment on the design

1. I like the marking of swap cache entries as stale
2. Can't we reclaim stale entries during memcg LRU reclaim? Why write
a GC for it?

--
Balbir

2009-04-27 08:23:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

On Mon, 27 Apr 2009 13:42:06 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-24 16:28:40]:
>
> > This is new one. (using new logic.) Maybe enough light-weight and caches all cases.
>
> You sure mean catches above :)
>
>
> >
> > Thanks,
> > -Kame
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Because free_swap_and_cache() function is called under spinlocks,
> > it can't sleep and use trylock_page() instead of lock_page().
> > By this, swp_entry which is not used after zap_xx can exists as
> > SwapCache, which will be never used.
> > This kind of SwapCache is reclaimed by global LRU when it's found
> > at LRU rotation.
> >
> > When memory cgroup is used, the global LRU will not be kicked and
> > stale Swap Caches will not be reclaimed. This is problematic because
> > memcg's swap entry accounting is leaked and memcg can't know it.
> > To catch this stale SwapCache, we have to chase it and check the
> > swap is alive or not again.
> >
> > This patch adds a function to chase stale swap cache and reclaim it
> > in modelate way. When zap_xxx fails to remove swap ent, it will be
> > recoreded into buffer and memcg's "work" will reclaim it later.
> > No sleep, no memory allocation under free_swap_and_cache().
> >
> > This patch also adds stale-swap-cache-congestion logic and try to avoid having
> > too much stale swap caches at the same time.
> >
> > Implementation is naive but maybe the cost meets trade-off.
> >
> > How to test:
> > 1. set limit of memory to very small (1-2M?).
> > 2. run some amount of program and run page reclaim/swap-in.
> > 3. kill programs by SIGKILL etc....then, Stale Swap Cache will
> > be increased. After this patch, stale swap caches are reclaimed
> > and mem+swap controller will not go to OOM.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Quick comment on the design
>
> 1. I like the marking of swap cache entries as stale

I like to. But there is no space to record it as stale. And "race" makes
that difficult even if we have enough space. If you read the whole thread,
you know there are many patterns of race.

> 2. Can't we reclaim stale entries during memcg LRU reclaim? Why write
> a GC for it?
>
Because they are not on memcg LRU. we can't reclaim it by memcg LRU.
(See the first mail from Nishimura of this thread. It explains well.)

One easy case is here.

- CPU0 call zap_pte()->free_swap_and_cache()
- CPU1 tries to swap-in it.
In this case, free_swap_and_cache() doesn't free swp_entry and swp_entry
is read into the memory. But it will never be added memcg's LRU until
it's mapped.
(What we have to consider here is swapin-readahead. It can swap-in memory
even if it's not accessed. Then, this race window is larger than expected.)

We can't use memcg's LRU then...what we can do is.

- scanning global LRU all
or
- use some trick to reclaim them in lazy way.


Thanks,
-Kame

2009-04-27 08:44:56

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-27 17:21:19]:

> On Mon, 27 Apr 2009 13:42:06 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * KAMEZAWA Hiroyuki <[email protected]> [2009-04-24 16:28:40]:
> >
> > > This is new one. (using new logic.) Maybe enough light-weight and caches all cases.
> >
> > You sure mean catches above :)
> >
> >
> > >
> > > Thanks,
> > > -Kame
> > > ==
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > Because free_swap_and_cache() function is called under spinlocks,
> > > it can't sleep and use trylock_page() instead of lock_page().
> > > By this, swp_entry which is not used after zap_xx can exists as
> > > SwapCache, which will be never used.
> > > This kind of SwapCache is reclaimed by global LRU when it's found
> > > at LRU rotation.
> > >
> > > When memory cgroup is used, the global LRU will not be kicked and
> > > stale Swap Caches will not be reclaimed. This is problematic because
> > > memcg's swap entry accounting is leaked and memcg can't know it.
> > > To catch this stale SwapCache, we have to chase it and check the
> > > swap is alive or not again.
> > >
> > > This patch adds a function to chase stale swap cache and reclaim it
> > > in modelate way. When zap_xxx fails to remove swap ent, it will be
> > > recoreded into buffer and memcg's "work" will reclaim it later.
> > > No sleep, no memory allocation under free_swap_and_cache().
> > >
> > > This patch also adds stale-swap-cache-congestion logic and try to avoid having
> > > too much stale swap caches at the same time.
> > >
> > > Implementation is naive but maybe the cost meets trade-off.
> > >
> > > How to test:
> > > 1. set limit of memory to very small (1-2M?).
> > > 2. run some amount of program and run page reclaim/swap-in.
> > > 3. kill programs by SIGKILL etc....then, Stale Swap Cache will
> > > be increased. After this patch, stale swap caches are reclaimed
> > > and mem+swap controller will not go to OOM.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Quick comment on the design
> >
> > 1. I like the marking of swap cache entries as stale
>
> I like to. But there is no space to record it as stale. And "race" makes
> that difficult even if we have enough space. If you read the whole thread,
> you know there are many patterns of race.

There have been several iterations of this discussion, summarizing it
would be nice, let me find the thread.

>
> > 2. Can't we reclaim stale entries during memcg LRU reclaim? Why write
> > a GC for it?
> >
> Because they are not on memcg LRU. we can't reclaim it by memcg LRU.
> (See the first mail from Nishimura of this thread. It explains well.)
>

Hmm.. I don't find it, let me do a more exhaustive search on the web.
If the entry is stale and not on memcg LRU, it is still accounted to
the memcg?

> One easy case is here.
>
> - CPU0 call zap_pte()->free_swap_and_cache()
> - CPU1 tries to swap-in it.
> In this case, free_swap_and_cache() doesn't free swp_entry and swp_entry
> is read into the memory. But it will never be added memcg's LRU until
> it's mapped.

That is strange.. not even added to the LRU as a cached page?

> (What we have to consider here is swapin-readahead. It can swap-in memory
> even if it's not accessed. Then, this race window is larger than expected.)
>
> We can't use memcg's LRU then...what we can do is.
>
> - scanning global LRU all
> or
> - use some trick to reclaim them in lazy way.
>

Thanks for being patient, some of these questions have been discussed
before I suppose. Let me dig out the thread.

--
Balbir

2009-04-27 08:51:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] fix swap entries is not reclaimed in proper way for memg v3.

On Mon, 27 Apr 2009 14:13:47 +0530
Balbir Singh <[email protected]> wrote:
> > I like to. But there is no space to record it as stale. And "race" makes
> > that difficult even if we have enough space. If you read the whole thread,
> > you know there are many patterns of race.
>
> There have been several iterations of this discussion, summarizing it
> would be nice, let me find the thread.
>
At first, it's obious that there are no free space in swap entry array and
swap_cgroup array. (And this can be trouble even if MEM_RES_CONTROLLER_SWAP_EXT
is not used.)

I tried to record "stale" information to page_cgroup with flag, but there is
following sequence and I can't do it.

==
CPU0(zap_pte) CPU1 (read swap)
swap_duplicate()
free_swapentry()
add_to_swap_cache().
==
In this case, we can't know swap_entry is stale or not at zap_pte().



> >
> > > 2. Can't we reclaim stale entries during memcg LRU reclaim? Why write
> > > a GC for it?
> > >
> > Because they are not on memcg LRU. we can't reclaim it by memcg LRU.
> > (See the first mail from Nishimura of this thread. It explains well.)
> >
>
> Hmm.. I don't find it, let me do a more exhaustive search on the web.
> If the entry is stale and not on memcg LRU, it is still accounted to
> the memcg?
yes. accoutned to memcg.memsw.usage_in_bytes.


>
> > One easy case is here.
> >
> > - CPU0 call zap_pte()->free_swap_and_cache()
> > - CPU1 tries to swap-in it.
> > In this case, free_swap_and_cache() doesn't free swp_entry and swp_entry
> > is read into the memory. But it will never be added memcg's LRU until
> > it's mapped.
>
> That is strange.. not even added to the LRU as a cached page?
>
added to "global" LRU but not to "memcg's LRU" because "USED" bit is not set.


> > (What we have to consider here is swapin-readahead. It can swap-in memory
> > even if it's not accessed. Then, this race window is larger than expected.)
> >
> > We can't use memcg's LRU then...what we can do is.
> >
> > - scanning global LRU all
> > or
> > - use some trick to reclaim them in lazy way.
> >
>
> Thanks for being patient, some of these questions have been discussed
> before I suppose. Let me dig out the thread.
>

Sorry for lack of explanation. I'll add more text to v4. patch.

Thanks,
-kame