LinuxLists.cc - [PATCH] fix leak of swap accounting as stale swap cache under memcg

2009-04-27 09:14:45

Subject: [PATCH] fix leak of swap accounting as stale swap cache under memcg

Works very well under my test as following.
prepare a program which does malloc, touch pages repeatedly.

# echo 2M > /cgroup/A/memory.limit_in_bytes # set limit to 2M.
# echo 0 > /cgroup/A/tasks. # add shell to the group.

while true; do
malloc_and_touch 1M & # run malloc and touch program.
malloc_and_touch 1M &
malloc_and_touch 1M &
sleep 3
pkill malloc_and_touch # kill them
done

Then, you can see memory.memsw.usage_in_bytes increase gradually and exceeds 3M bytes.
This means account for swp_entry is not reclaimed at kill -> exit-> zap_pte()
because of race with swap-ops and zap_pte() under memcg.

==
From: KAMEZAWA Hiroyuki <[email protected]>

Because free_swap_and_cache() function is called under spinlocks,
it can't sleep and use trylock_page() instead of lock_page().
By this, swp_entry which is not used after zap_xx can exists as
SwapCache, which will be never used.
This kind of SwapCache is reclaimed by global LRU when it's found
at LRU rotation. Typical case is following.

(CPU0 zap_pte) (CPU1 swapin-readahead)
zap_pte() swap_duplicate()
swap_entry_free()
-> nothing to do
swap will be read in.

(This race window is wider than expected because of readahead)

When memory cgroup is used, the global LRU will not be kicked and
stale Swap Caches will not be reclaimed. Newly read-in swap cache is
not accounted and not added to memcg's LRU until it's mapped.
So, memcg itself cant reclaim it but swp_entry is freed until
global LRU finds it.

This is problematic because memcg's swap entry accounting is leaked
memcg can't know it. To catch this stale SwapCache, we have to chase it
and check the swap is alive or not again.

For chasing all swap entry, we need amount of memory but we don't
have enough space and it seems overkill. But, because stale-swap-cache
can be short-lived if we free it in proper way, we can check them
and sweep them out in lazy way with (small) static size buffer.

This patch adds a function to chase stale swap cache and reclaim it.
When zap_xxx fails to remove swap ent, it will be recoreded into buffer
and memcg's sweep routine will reclaim it later.
No sleep, no memory allocation under free_swap_and_cache().

This patch also adds stale-swap-cache-congestion logic and try to avoid to
have too much stale swap caches at once.

Implementation is naive but maybe the cost meets trade-off.

How to test:
1. set limit of memory to very small (1-2M?).
2. run some amount of program and run page reclaim/swap-in.
3. kill programs by SIGKILL etc....then, Stale Swap Cache will
be increased. After this patch, stale swap caches are reclaimed
and mem+swap controller will not go to OOM.

Changelog:v3->v4
- replace lookup_swap_cache() with find_get_page().
- clean up.
- added put_page().
- fixed compilation under various CONFIG.
V3 was completely new.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/swap.h | 20 +++++++
mm/memcontrol.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap_state.c | 11 +++-
mm/swapfile.c | 11 ++++
mm/vmscan.c | 3 +
5 files changed, 173 insertions(+), 1 deletion(-)

Index: mmotm-2.6.30-Apr24/include/linux/swap.h
===================================================================
--- mmotm-2.6.30-Apr24.orig/include/linux/swap.h
+++ mmotm-2.6.30-Apr24/include/linux/swap.h
@@ -336,11 +336,27 @@ static inline void disable_swap_token(vo

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern void memcg_mark_swapent_stale(swp_entry_t ent);
+extern void memcg_sanity_check_swapin(struct page *page, swp_entry_t ent);
+extern int memcg_stale_swap_congestion(void);
#else
static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
}
+
+static inline void memcg_mark_swapent_stale(swp_entry_t ent)
+{
+}
+
+static inline void memcg_sanity_check_swapin(struct page *page, swp_entry_t ent)
+{
+}
+
+static inline int memcg_stale_swap_congestion(void)
+{
+ return 0;
+}
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
@@ -436,6 +452,10 @@ static inline int mem_cgroup_cache_charg
{
return 0;
}
+static inline int memcg_stale_swap_congestion(void)
+{
+ return 0;
+}

#endif /* CONFIG_SWAP */
#endif /* __KERNEL__*/
Index: mmotm-2.6.30-Apr24/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/memcontrol.c
+++ mmotm-2.6.30-Apr24/mm/memcontrol.c
@@ -1702,6 +1702,132 @@ int mem_cgroup_shmem_charge_fallback(str
return ret;
}

+#ifdef CONFIG_SWAP
+/*
+ * Stale Swap Cache Handler.
+ * Stale Swap Cache is a Swap Cache which will never be used. In general,
+ * Swap Cache is zapped by free_swap_and_cache() (via zap_pte_range() etc.).
+ * But in racy case, free_swap_and_cache() doesn't free swap entries and
+ * it's expected that Swap Cache will be freed by global LRU rotation.
+ *
+ * But if memory cgroup is used, global lru rotation may not happen and
+ * Stale Swap Cache (and unused swap entry) will never be reclaimed. In bad
+ * case, this can cause OOM (under memcg) and other problems.
+ *
+ * Following is GC code for stale swap caches.
+ */
+
+#define STALE_ENTS (512)
+#define STALE_ENTS_MAP (STALE_ENTS/BITS_PER_LONG)
+
+static struct stale_swap_control {
+ spinlock_t lock;
+ int num;
+ int congestion;
+ unsigned long usemap[STALE_ENTS_MAP];
+ swp_entry_t ents[STALE_ENTS];
+ struct delayed_work gc_work;
+} ssc;
+
+static void schedule_ssc_gc(void)
+{
+ /* 10ms margin to wait for a page unlocked */
+ schedule_delayed_work(&ssc.gc_work, HZ/10);
+}
+
+static void memcg_fixup_stale_swapcache(struct work_struct *work)
+{
+ int pos = 0;
+ swp_entry_t entry;
+ struct page *page;
+ int forget, ret;
+
+ while (ssc.num) {
+ spin_lock(&ssc.lock);
+ pos = find_next_bit(ssc.usemap, STALE_ENTS, pos);
+ spin_unlock(&ssc.lock);
+
+ if (pos >= STALE_ENTS)
+ break;
+
+ entry = ssc.ents[pos];
+
+ forget = 1;
+ /*
+ * Because lookup_swap_cache() increases statistics,
+ * call find_get_page() directly.
+ */
+ page = find_get_page(&swapper_space, entry.val);
+ if (page) {
+ lock_page(page);
+ ret = try_to_free_swap(page);
+ /* If it's still under I/O, don't forget it */
+ if (!ret && PageWriteback(page))
+ forget = 0;
+ unlock_page(page);
+ put_page(page);
+ }
+ if (forget) {
+ spin_lock(&ssc.lock);
+ clear_bit(pos, ssc.usemap);
+ ssc.num--;
+ if (ssc.num < STALE_ENTS/2)
+ ssc.congestion = 0;
+ spin_unlock(&ssc.lock);
+ }
+ pos++;
+ }
+ if (ssc.num) /* schedule me again */
+ schedule_ssc_gc();
+ return;
+}
+
+
+/* We found lock_page() contention at zap_page. then revisit this later */
+void memcg_mark_swapent_stale(swp_entry_t ent)
+{
+ int pos;
+
+ spin_lock(&ssc.lock);
+ WARN_ON(ssc.num >= STALE_ENTS);
+ if (ssc.num < STALE_ENTS) {
+ pos = find_first_zero_bit(ssc.usemap, STALE_ENTS);
+ ssc.ents[pos] = ent;
+ set_bit(pos, ssc.usemap);
+ ssc.num++;
+ if (ssc.num > STALE_ENTS/2)
+ ssc.congestion = 1;
+ }
+ spin_unlock(&ssc.lock);
+ schedule_ssc_gc();
+}
+
+/* If too many stale swap caches, avoid too much swap I/O */
+int memcg_stale_swap_congestion(void)
+{
+ smp_mb();
+ if (ssc.congestion) {
+ schedule_ssc_gc();
+ return 1;
+ }
+ return 0;
+}
+
+static void setup_stale_swapcache_control(void)
+{
+ memset(&ssc, 0, sizeof(ssc));
+ spin_lock_init(&ssc.lock);
+ INIT_DELAYED_WORK(&ssc.gc_work, memcg_fixup_stale_swapcache);
+}
+
+#else
+
+static void setup_stale_swapcache_control(void)
+{
+}
+
+#endif /* CONFIG_SWAP */
+
static DEFINE_MUTEX(set_limit_mutex);

static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
@@ -2464,6 +2590,7 @@ static struct mem_cgroup *parent_mem_cgr
return mem_cgroup_from_res_counter(mem->res.parent, res);
}

+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
static void __init enable_swap_cgroup(void)
{
@@ -2493,6 +2620,7 @@ mem_cgroup_create(struct cgroup_subsys *
/* root ? */
if (cont->parent == NULL) {
enable_swap_cgroup();
+ setup_stale_swapcache_control();
parent = NULL;
} else {
parent = mem_cgroup_from_cont(cont->parent);
@@ -2588,3 +2716,4 @@ static int __init disable_swap_account(c
}
__setup("noswapaccount", disable_swap_account);
#endif
+
Index: mmotm-2.6.30-Apr24/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/swap_state.c
+++ mmotm-2.6.30-Apr24/mm/swap_state.c
@@ -313,6 +313,7 @@ struct page *read_swap_cache_async(swp_e
/*
* Initiate read into locked page and return.
*/
+ memcg_sanity_check_swapin(new_page, entry);
lru_cache_add_anon(new_page);
swap_readpage(NULL, new_page);
return new_page;
@@ -360,8 +361,16 @@ struct page *swapin_readahead(swp_entry_
* No, it's very unlikely that swap layout would follow vma layout,
* more likely that neighbouring swap pages came from the same node:
* so use the same "addr" to choose the same node for each swap read.
+ *
+ * If memory cgroup is used, Stale Swap Cache congestion check is
+ * done and no readahed if there are too much stale swap caches.
*/
- nr_pages = valid_swaphandles(entry, &offset);
+ if (memcg_stale_swap_congestion()) {
+ offset = swp_offset(entry);
+ nr_pages = 1;
+ } else
+ nr_pages = valid_swaphandles(entry, &offset);
+
for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
/* Ok, do the async read-ahead now */
page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
Index: mmotm-2.6.30-Apr24/mm/swapfile.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/swapfile.c
+++ mmotm-2.6.30-Apr24/mm/swapfile.c
@@ -570,6 +570,16 @@ int try_to_free_swap(struct page *page)
return 1;
}

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+void memcg_sanity_check_swapin(struct page *page, swp_entry_t entry)
+{
+ VM_BUG_ON(!PageSwapCache(page));
+ VM_BUG_ON(!PageLocked(page));
+ /* This page is Locked */
+ if (!page_swapcount(page))
+ memcg_mark_swapent_stale(entry);
+}
+#endif
/*
* Free the swap entry like above, but also try to
* free the page cache entry if it is the last user.
@@ -589,6 +599,7 @@ int free_swap_and_cache(swp_entry_t entr
if (page && !trylock_page(page)) {
page_cache_release(page);
page = NULL;
+ memcg_mark_swapent_stale(entry);
}
}
spin_unlock(&swap_lock);
Index: mmotm-2.6.30-Apr24/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
+++ mmotm-2.6.30-Apr24/mm/vmscan.c
@@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
if (PageAnon(page) && !PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
+ /* avoid making more stale swap caches */
+ if (memcg_stale_swap_congestion())
+ goto keep_locked;
if (!add_to_swap(page))
goto activate_locked;
may_enter_fs = 1;

2009-04-27 10:14:32

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-27 18:12:59]:

> Works very well under my test as following.
> prepare a program which does malloc, touch pages repeatedly.
>
> # echo 2M > /cgroup/A/memory.limit_in_bytes # set limit to 2M.
> # echo 0 > /cgroup/A/tasks. # add shell to the group.
>
> while true; do
> malloc_and_touch 1M & # run malloc and touch program.
> malloc_and_touch 1M &
> malloc_and_touch 1M &
> sleep 3
> pkill malloc_and_touch # kill them
> done
>
> Then, you can see memory.memsw.usage_in_bytes increase gradually and exceeds 3M bytes.
> This means account for swp_entry is not reclaimed at kill -> exit-> zap_pte()
> because of race with swap-ops and zap_pte() under memcg.
>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Because free_swap_and_cache() function is called under spinlocks,
> it can't sleep and use trylock_page() instead of lock_page().
> By this, swp_entry which is not used after zap_xx can exists as
> SwapCache, which will be never used.
> This kind of SwapCache is reclaimed by global LRU when it's found
> at LRU rotation. Typical case is following.
>

The changelog is not clear, this is the typical case for?

> (CPU0 zap_pte) (CPU1 swapin-readahead)
> zap_pte() swap_duplicate()
> swap_entry_free()
> -> nothing to do
> swap will be read in.
>
> (This race window is wider than expected because of readahead)
>

This should happen when the page is undergoing IO and this page_lock
is not available. BTW, do we need page_lock to uncharge the page from
the memory resource controller?

> When memory cgroup is used, the global LRU will not be kicked and
> stale Swap Caches will not be reclaimed. Newly read-in swap cache is
> not accounted and not added to memcg's LRU until it's mapped.

^^^^^^^ I thought it was accounted for but not on LRU

> So, memcg itself cant reclaim it but swp_entry is freed untila
^ not?
> global LRU finds it.
>
> This is problematic because memcg's swap entry accounting is leaked
> memcg can't know it. To catch this stale SwapCache, we have to chase it
> and check the swap is alive or not again.
>
> For chasing all swap entry, we need amount of memory but we don't
> have enough space and it seems overkill. But, because stale-swap-cache
> can be short-lived if we free it in proper way, we can check them
> and sweep them out in lazy way with (small) static size buffer.
>
> This patch adds a function to chase stale swap cache and reclaim it.
> When zap_xxx fails to remove swap ent, it will be recoreded into buffer
> and memcg's sweep routine will reclaim it later.
> No sleep, no memory allocation under free_swap_and_cache().
>
> This patch also adds stale-swap-cache-congestion logic and try to avoid to
> have too much stale swap caches at once.
>
> Implementation is naive but maybe the cost meets trade-off.
>

To be honest, I don't like the code complexity added, that is why I
want to explore more before agreeing to add an entire GC. We could
consider using pagevecs, but we might not need some of the members
like cold. I know you and Daisuke have worked hard on this problem, if
we can't really find a better way, I'll let this pass.

--
Balbir

2009-04-27 11:37:11

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

On Mon, 27 Apr 2009 15:43:23 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-27 18:12:59]:
>
> > Works very well under my test as following.
> > prepare a program which does malloc, touch pages repeatedly.
> >
> > # echo 2M > /cgroup/A/memory.limit_in_bytes # set limit to 2M.
> > # echo 0 > /cgroup/A/tasks. # add shell to the group.
> >
> > while true; do
> > malloc_and_touch 1M & # run malloc and touch program.
> > malloc_and_touch 1M &
> > malloc_and_touch 1M &
> > sleep 3
> > pkill malloc_and_touch # kill them
> > done
> >
> > Then, you can see memory.memsw.usage_in_bytes increase gradually and exceeds 3M bytes.
> > This means account for swp_entry is not reclaimed at kill -> exit-> zap_pte()
> > because of race with swap-ops and zap_pte() under memcg.
> >
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Because free_swap_and_cache() function is called under spinlocks,
> > it can't sleep and use trylock_page() instead of lock_page().
> > By this, swp_entry which is not used after zap_xx can exists as
> > SwapCache, which will be never used.
> > This kind of SwapCache is reclaimed by global LRU when it's found
> > at LRU rotation. Typical case is following.
> >
>
> The changelog is not clear, this is the typical case for?
>
Okey, let me summarise the problem.

First of all, what I think is problematic is "!PageCgroupUsed
swap cache without the owner process".
Those swap caches cannot be reclaimed by memcg's reclaim
because they are not on memcg's LRU(!PageCgroupUsed pages are not
linked to memcg's LRU).
Moreover, the owner prcess has already gone, only global LRU scanning
can free those swap caches.

Those swap caches causes some problems like:
(1) pressure the memsw.usage(only when MEM_RES_CTLR_SWAP).
(2) make struct mem_cgroup unfreeable even after rmdir, because
we call mem_cgroup_get() when a page is swaped out(only when MEM_RES_CTLR_SWAP).
(3) pressure the usage of swap entry.

Those swap caches can be created in paths like:

Type-1) race between exit and swap-in path
Assume processA is exiting and pte has swap entry of swaped out page.
And processB is trying to swap in the entry by readahead.
This entry holds memsw.usage and refcnt to struct mem_cgroup.

Type-1.1)
processA | processB
-------------------------------------+-------------------------------------
(free_swap_and_cache()) | (read_swap_cache_async())
| swap_duplicate()
| __set_page_locked()
| add_to_swap_cache()
swap_entry_free() == 1 |
find_get_page() -> found |
try_lock_page() -> fail & return |
| lru_cache_add_anon()
| doesn't link this page to memcg's
| LRU, because of !PageCgroupUsed.

Type-1.2)
processA | processB
-------------------------------------+-------------------------------------
(free_swap_and_cache()) | (read_swap_cache_async())
| swap_duplicate()
swap_entry_free() == 1 |
find_get_page() -> not found |
& return | __set_page_locked()
| add_to_swap_cache()
| lru_cache_add_anon()
| doesn't link this page to memcg's
| LRU, because of !PageCgroupUsed.

Type-2) race between exit and swap-out path
Assume processA is exiting and pte points to a page(!PageSwapCache).
And processB is trying reclaim the page.

processA | processB
-------------------------------------+-------------------------------------
(page_remove_rmap()) | (shrink_page_list())
mem_cgroup_uncharge_page() |
->uncharged because it's not |
PageSwapCache yet. |
So, both mem/memsw.usage |
are decremented. |
| add_to_swap() -> added to swap cache.

If this page goes thorough without being freed for some reason, this page
doesn't goes back to memcg's LRU because of !PageCgroupUsed.

Type-1 has problem (1)-(3), and type-2 has (3) only.

> > (CPU0 zap_pte) (CPU1 swapin-readahead)
> > zap_pte() swap_duplicate()
> > swap_entry_free()
> > -> nothing to do
> > swap will be read in.
> >
> > (This race window is wider than expected because of readahead)
> >
>
> This should happen when the page is undergoing IO and this page_lock
> is not available. BTW, do we need page_lock to uncharge the page from
> the memory resource controller?
>
This lock is needed for delete_from_swap_cache().

If free_swap_and_cache can hold the lock in this path:

delete_from_swap_cache()
mem_cgroup_uncharge_swapcache()
-> does nothing because of !PageCgroupUsed
swap_free()
mem_cgroup_uncharge_swap()
-> memsw.usage--, mem_cgroup_put()

> > When memory cgroup is used, the global LRU will not be kicked and
> > stale Swap Caches will not be reclaimed. Newly read-in swap cache is
> > not accounted and not added to memcg's LRU until it's mapped.
>
> ^^^^^^^ I thought it was accounted for but not on LRU
>
Newly allocated pages are accounted before added to LRU,
but that's not true in swap-in path.
We remove the page from LRU once and put it back again to
add it to the proper memcg's LRU at commit_charge_swapin().

> > So, memcg itself cant reclaim it but swp_entry is freed untila
> ^ not?
> > global LRU finds it.
> >
> > This is problematic because memcg's swap entry accounting is leaked
> > memcg can't know it. To catch this stale SwapCache, we have to chase it
> > and check the swap is alive or not again.
> >
> > For chasing all swap entry, we need amount of memory but we don't
> > have enough space and it seems overkill. But, because stale-swap-cache
> > can be short-lived if we free it in proper way, we can check them
> > and sweep them out in lazy way with (small) static size buffer.
> >
> > This patch adds a function to chase stale swap cache and reclaim it.
> > When zap_xxx fails to remove swap ent, it will be recoreded into buffer
> > and memcg's sweep routine will reclaim it later.
> > No sleep, no memory allocation under free_swap_and_cache().
> >
> > This patch also adds stale-swap-cache-congestion logic and try to avoid to
> > have too much stale swap caches at once.
> >
> > Implementation is naive but maybe the cost meets trade-off.
> >
>
> To be honest, I don't like the code complexity added, that is why I
> want to explore more before agreeing to add an entire GC. We could
> consider using pagevecs, but we might not need some of the members
> like cold. I know you and Daisuke have worked hard on this problem, if
> we can't really find a better way, I'll let this pass.
>
I don't care the method as long as this problem can be solved.
But I think this is the most simple way among what have
been proposed so far :)

Thanks,
Daisuke Nishimura.

2009-04-27 12:10:18

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

> Index: mmotm-2.6.30-Apr24/mm/vmscan.c
> ===================================================================
> --- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
> +++ mmotm-2.6.30-Apr24/mm/vmscan.c
> @@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
> if (PageAnon(page) && !PageSwapCache(page)) {
> if (!(sc->gfp_mask & __GFP_IO))
> goto keep_locked;
> + /* avoid making more stale swap caches */
> + if (memcg_stale_swap_congestion())
> + goto keep_locked;
> if (!add_to_swap(page))
> goto activate_locked;
> may_enter_fs = 1;
>
Well, as I mentioned before(http://marc.info/?l=linux-kernel&m=124066623510867&w=2),
this cannot avoid type-2(set !PageCgroupUsed by the owner process via
page_remove_rmap()->mem_cgroup_uncharge_page() before being added to swap cache).
If these swap caches go through shrink_page_list() without beeing freed
for some reason, these swap caches doesn't go back to memcg's LRU.

Type-2 doesn't pressure memsw.usage, but you can see it by plotting
"grep SwapCached /proc/meminfo".

And I don't think it's a good idea to add memcg_stale_swap_congestion() here.
This means less possibility to reclaim pages.

Do you dislike the patch I attached in the above mail ?

If not, please merge it(I tested your prvious version with some fixes and
my patch, and it worked well). Or shall I send is as a separate patch
to fix type-2 after your patch(yours looks good to me for type-1)?
(to tell the truth, I want reuse memcg_free_unused_swapcache() in another patch)

Thanks,
Daisuke Nishimura.

2009-04-27 19:17:45

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

On Mon, Apr 27, 2009 at 5:05 PM, Daisuke Nishimura
<[email protected]> wrote:
> On Mon, 27 Apr 2009 15:43:23 +0530
> Balbir Singh <[email protected]> wrote:
>
>> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-27 18:12:59]:
>>
>> > Works very well under my test as following.
>> > ? prepare a program which does malloc, touch pages repeatedly.
>> >
>> > ? # echo 2M > /cgroup/A/memory.limit_in_bytes ?# set limit to 2M.
>> > ? # echo 0 > /cgroup/A/tasks. ? ? ? ? ? ? ? ? ?# add shell to the group.
>> >
>> > ? while true; do
>> > ? ? malloc_and_touch 1M & ? ? ? ? ? ? ? ? ? ? ? # run malloc and touch program.
>> > ? ? malloc_and_touch 1M &
>> > ? ? malloc_and_touch 1M &
>> > ? ? sleep 3
>> > ? ? pkill malloc_and_touch ? ? ? ? ? ? ? ? ? ? ?# kill them
>> > ? done
>> >
>> > Then, you can see memory.memsw.usage_in_bytes increase gradually and exceeds 3M bytes.
>> > This means account for swp_entry is not reclaimed at kill -> exit-> zap_pte()
>> > because of race with swap-ops and zap_pte() under memcg.
>> >
>> > ==
>> > From: KAMEZAWA Hiroyuki <[email protected]>
>> >
>> > Because free_swap_and_cache() function is called under spinlocks,
>> > it can't sleep and use trylock_page() instead of lock_page().
>> > By this, swp_entry which is not used after zap_xx can exists as
>> > SwapCache, which will be never used.
>> > This kind of SwapCache is reclaimed by global LRU when it's found
>> > at LRU rotation. Typical case is following.
>> >
>>
>> The changelog is not clear, this is the typical case for?
>>
> Okey, let me summarise the problem.
>
> First of all, what I think is problematic is "!PageCgroupUsed
> swap cache without the owner process".
> Those swap caches cannot be reclaimed by memcg's reclaim
> because they are not on memcg's LRU(!PageCgroupUsed pages are not
> linked to memcg's LRU).
> Moreover, the owner prcess has already gone, only global LRU scanning
> can free those swap caches.
>
> Those swap caches causes some problems like:
> (1) pressure the memsw.usage(only when MEM_RES_CTLR_SWAP).
> (2) make struct mem_cgroup unfreeable even after rmdir, because
> ? ?we call mem_cgroup_get() when a page is swaped out(only when MEM_RES_CTLR_SWAP).
> (3) pressure the usage of swap entry.
>
> Those swap caches can be created in paths like:
>
> Type-1) race between exit and swap-in path
> ?Assume processA is exiting and pte has swap entry of swaped out page.
> ?And processB is trying to swap in the entry by readahead.
> ?This entry holds memsw.usage and refcnt to struct mem_cgroup.
>
> Type-1.1)
> ? ? ? ? ? ?processA ? ? ? ? ? ? ? ? ? | ? ? ? ? ? processB
> ?-------------------------------------+-------------------------------------
> ? ?(free_swap_and_cache()) ? ? ? ? ? ?| ?(read_swap_cache_async())
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?swap_duplicate()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?__set_page_locked()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?add_to_swap_cache()
> ? ? ?swap_entry_free() == 1 ? ? ? ? ? |
> ? ? ?find_get_page() -> found ? ? ? ? |
> ? ? ?try_lock_page() -> fail & return |
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?lru_cache_add_anon()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ? ?doesn't link this page to memcg's
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ? ?LRU, because of !PageCgroupUsed.
>
> Type-1.2)
> ? ? ? ? ? ?processA ? ? ? ? ? ? ? ? ? | ? ? ? ? ? processB
> ?-------------------------------------+-------------------------------------
> ? ?(free_swap_and_cache()) ? ? ? ? ? ?| ?(read_swap_cache_async())
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?swap_duplicate()
> ? ? ?swap_entry_free() == 1 ? ? ? ? ? |
> ? ? ?find_get_page() -> not found ? ? |
> ? ? ? ? ? ? ? ? ? ? ? ? & return ? ? ?| ? ?__set_page_locked()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?add_to_swap_cache()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?lru_cache_add_anon()
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ? ?doesn't link this page to memcg's
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ? ?LRU, because of !PageCgroupUsed.
>
> Type-2) race between exit and swap-out path
> ?Assume processA is exiting and pte points to a page(!PageSwapCache).
> ?And processB is trying reclaim the page.
>
> ? ? ? ? ? ?processA ? ? ? ? ? ? ? ? ? | ? ? ? ? ? processB
> ?-------------------------------------+-------------------------------------
> ? ?(page_remove_rmap()) ? ? ? ? ? ? ? | ?(shrink_page_list())
> ? ? ? mem_cgroup_uncharge_page() ? ? ?|
> ? ? ? ? ?->uncharged because it's not |
> ? ? ? ? ? ?PageSwapCache yet. ? ? ? ? |
> ? ? ? ? ? ?So, both mem/memsw.usage ? |
> ? ? ? ? ? ?are decremented. ? ? ? ? ? |
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ?add_to_swap() -> added to swap cache.
>
> ?If this page goes thorough without being freed for some reason, this page
> ?doesn't goes back to memcg's LRU because of !PageCgroupUsed.

Thanks for the detailed explanation of the possible race conditions. I
am beginning to wonder why we don't have any hooks in add_to_swap.*.
for charging a page. If the page is already charged and if it is a
context issue (charging it to the right cgroup) that is already
handled from what I see. Won't that help us solve the !PageCgroupUsed
issue?

Balbir

2009-04-27 23:59:37

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

On Tue, 28 Apr 2009 00:47:31 +0530
Balbir Singh <[email protected]> wrote:

> Thanks for the detailed explanation of the possible race conditions. I
> am beginning to wonder why we don't have any hooks in add_to_swap.*.
> for charging a page. If the page is already charged and if it is a
> context issue (charging it to the right cgroup) that is already
> handled from what I see. Won't that help us solve the !PageCgroupUsed
> issue?
>

For adding hook to add_to_swap_cache, we need to know which cgroup the swap cache
should be charged. Then, we have to remove CONFIG_CGROUP_MEM_RES_CTRL_SWAP_EXT
and enable memsw control always.

When using swap_cgroup, we'll know which cgroup the new swap cache should be charged.
Then, the new page readed in will be charged to recorded cgroup in swap_cgroup.
One bad thing of this method is a cgroup which swap_cgroup point to is different from
a cgroup which the task calls do_swap_fault(). This means that a page-fault by a
task can cause memory-reclaim under another cgroup and moreover, OOM.
I don't think it's sane behavior. So, current design of swap accounting waits until the
page is mapped.

Thanks,
-Kame

2009-04-28 00:20:46

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

On Mon, 27 Apr 2009 21:08:56 +0900
Daisuke Nishimura <[email protected]> wrote:

> > Index: mmotm-2.6.30-Apr24/mm/vmscan.c
> > ===================================================================
> > --- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
> > +++ mmotm-2.6.30-Apr24/mm/vmscan.c
> > @@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
> > if (PageAnon(page) && !PageSwapCache(page)) {
> > if (!(sc->gfp_mask & __GFP_IO))
> > goto keep_locked;
> > + /* avoid making more stale swap caches */
> > + if (memcg_stale_swap_congestion())
> > + goto keep_locked;
> > if (!add_to_swap(page))
> > goto activate_locked;
> > may_enter_fs = 1;
> >
> Well, as I mentioned before(http://marc.info/?l=linux-kernel&m=124066623510867&w=2),
> this cannot avoid type-2(set !PageCgroupUsed by the owner process via
> page_remove_rmap()->mem_cgroup_uncharge_page() before being added to swap cache).
> If these swap caches go through shrink_page_list() without beeing freed
> for some reason, these swap caches doesn't go back to memcg's LRU.
>
> Type-2 doesn't pressure memsw.usage, but you can see it by plotting
> "grep SwapCached /proc/meminfo".
>
> And I don't think it's a good idea to add memcg_stale_swap_congestion() here.
> This means less possibility to reclaim pages.
>
Hmm. maybe adding congestion_wait() ?

> Do you dislike the patch I attached in the above mail ?
>
I doubt whether the patch covers all type-2 case.

> If not, please merge it(I tested your prvious version with some fixes and
> my patch, and it worked well). Or shall I send is as a separate patch
> to fix type-2 after your patch(yours looks good to me for type-1)?
> (to tell the truth, I want reuse memcg_free_unused_swapcache() in another patch)
>
>
I'll consider again and post v3.
But I'll go into a series of holidays, so, may not come back until May/6.

Thanks,
-Kame

> Thanks,
> Daisuke Nishimura.
>

2009-04-28 00:43:20

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

2009-04-28 01:14:37

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

> On Mon, 27 Apr 2009 21:08:56 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
>> > Index: mmotm-2.6.30-Apr24/mm/vmscan.c
>> > ===================================================================
>> > --- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
>> > +++ mmotm-2.6.30-Apr24/mm/vmscan.c
>> > @@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
>> > if (PageAnon(page) && !PageSwapCache(page)) {
>> > if (!(sc->gfp_mask & __GFP_IO))
>> > goto keep_locked;
>> > + /* avoid making more stale swap caches */
>> > + if (memcg_stale_swap_congestion())
>> > + goto keep_locked;
>> > if (!add_to_swap(page))
>> > goto activate_locked;
>> > may_enter_fs = 1;
>> >
>> Well, as I mentioned before(http://marc.info/?l=linux-kernel&m=124066623510867&w=2),
>> this cannot avoid type-2(set !PageCgroupUsed by the owner process via
>> page_remove_rmap()->mem_cgroup_uncharge_page() before being added to swap cache).
>> If these swap caches go through shrink_page_list() without beeing freed
>> for some reason, these swap caches doesn't go back to memcg's LRU.
>>
>> Type-2 doesn't pressure memsw.usage, but you can see it by plotting
>> "grep SwapCached /proc/meminfo".
>>
>> And I don't think it's a good idea to add memcg_stale_swap_congestion() here.
>> This means less possibility to reclaim pages.
>>
> Hmm. maybe adding congestion_wait() ?
>
I don't think no hook before add_to_swap() is needed.

>> Do you dislike the patch I attached in the above mail ?
>>
> I doubt whether the patch covers all type-2 case.
>
hmm, I didn't see any leak anymore when I tested the patch.

But because of machine time limit, I could only test for about 3 hours.
(I had seen some leak at that point before applying my patch)
I'll test for longer time if possible.

>> If not, please merge it(I tested your prvious version with some fixes and
>> my patch, and it worked well). Or shall I send is as a separate patch
>> to fix type-2 after your patch(yours looks good to me for type-1)?
>> (to tell the truth, I want reuse memcg_free_unused_swapcache() in another patch)
>>
>>
> I'll consider again and post v3.
> But I'll go into a series of holidays, so, may not come back until May/6.
>
It's the same for me :)

Thanks,
Daisuke Nishimura.

2009-04-28 01:21:17

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

On Tue, 28 Apr 2009 10:09:30 +0900
[email protected] wrote:

> > On Mon, 27 Apr 2009 21:08:56 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> >> > Index: mmotm-2.6.30-Apr24/mm/vmscan.c
> >> > ===================================================================
> >> > --- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
> >> > +++ mmotm-2.6.30-Apr24/mm/vmscan.c
> >> > @@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
> >> > if (PageAnon(page) && !PageSwapCache(page)) {
> >> > if (!(sc->gfp_mask & __GFP_IO))
> >> > goto keep_locked;
> >> > + /* avoid making more stale swap caches */
> >> > + if (memcg_stale_swap_congestion())
> >> > + goto keep_locked;
> >> > if (!add_to_swap(page))
> >> > goto activate_locked;
> >> > may_enter_fs = 1;
> >> >
> >> Well, as I mentioned before(http://marc.info/?l=linux-kernel&m=124066623510867&w=2),
> >> this cannot avoid type-2(set !PageCgroupUsed by the owner process via
> >> page_remove_rmap()->mem_cgroup_uncharge_page() before being added to swap cache).
> >> If these swap caches go through shrink_page_list() without beeing freed
> >> for some reason, these swap caches doesn't go back to memcg's LRU.
> >>
> >> Type-2 doesn't pressure memsw.usage, but you can see it by plotting
> >> "grep SwapCached /proc/meminfo".
> >>
> >> And I don't think it's a good idea to add memcg_stale_swap_congestion() here.
> >> This means less possibility to reclaim pages.
> >>
> > Hmm. maybe adding congestion_wait() ?
> >
> I don't think no hook before add_to_swap() is needed.
>
> >> Do you dislike the patch I attached in the above mail ?
> >>
> > I doubt whether the patch covers all type-2 case.
> >
> hmm, I didn't see any leak anymore when I tested the patch.
>

At first, your patch
==
if (PageAnon(page) && !PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
- /* avoid making more stale swap caches */
- if (memcg_stale_swap_congestion())
- goto keep_locked;
if (!add_to_swap(page))
goto activate_locked;
+ /*
+ * The owner process might have uncharged the page
+ * (by page_remove_rmap()) before it has been added
+ * to swap cache.
+ * Check it here to avoid making it stale.
+ */
+ if (memcg_free_unused_swapcache(page))
+ goto keep_locked;
may_enter_fs = 1;
}
==
Should be
==

if (PageAnon(page) && !PageSwapCache(page)) {
... // don't modify here
}
if (PageAnon(page) && PageSwapCache(page) && !page_mapped(page)) {
if (try_to_free_page(page)) // or memcg_free_unused_swapcache()
goto free_it;
}
==
I think.

And we need hook to free_swap_and_cache() for handling PageWriteback() case.

> But because of machine time limit, I could only test for about 3 hours.
> (I had seen some leak at that point before applying my patch)
> I'll test for longer time if possible.
>
Sigh, my work time is also limited for these months ;(

> > I'll consider again and post v3.
> > But I'll go into a series of holidays, so, may not come back until May/6.
> >
> It's the same for me :)
>
Enjoy good holidays :)

Thanks,
-Kame

2009-04-28 02:41:14

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

> On Tue, 28 Apr 2009 10:09:30 +0900
> [email protected] wrote:
>
>> > On Mon, 27 Apr 2009 21:08:56 +0900
>> > Daisuke Nishimura <[email protected]> wrote:
>> >
>> >> > Index: mmotm-2.6.30-Apr24/mm/vmscan.c
>> >> > ===================================================================
>> >> > --- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
>> >> > +++ mmotm-2.6.30-Apr24/mm/vmscan.c
>> >> > @@ -661,6 +661,9 @@ static unsigned long shrink_page_list(st
>> >> > if (PageAnon(page) && !PageSwapCache(page)) {
>> >> > if (!(sc->gfp_mask & __GFP_IO))
>> >> > goto keep_locked;
>> >> > + /* avoid making more stale swap caches */
>> >> > + if (memcg_stale_swap_congestion())
>> >> > + goto keep_locked;
>> >> > if (!add_to_swap(page))
>> >> > goto activate_locked;
>> >> > may_enter_fs = 1;
>> >> >
>> >> Well, as I mentioned before(http://marc.info/?l=linux-kernel&m=124066623510867&w=2),
>> >> this cannot avoid type-2(set !PageCgroupUsed by the owner process via
>> >> page_remove_rmap()->mem_cgroup_uncharge_page() before being added to swap cache).
>> >> If these swap caches go through shrink_page_list() without beeing freed
>> >> for some reason, these swap caches doesn't go back to memcg's LRU.
>> >>
>> >> Type-2 doesn't pressure memsw.usage, but you can see it by plotting
>> >> "grep SwapCached /proc/meminfo".
>> >>
>> >> And I don't think it's a good idea to add memcg_stale_swap_congestion() here.
>> >> This means less possibility to reclaim pages.
>> >>
>> > Hmm. maybe adding congestion_wait() ?
>> >
>> I don't think no hook before add_to_swap() is needed.
>>
>> >> Do you dislike the patch I attached in the above mail ?
>> >>
>> > I doubt whether the patch covers all type-2 case.
>> >
>> hmm, I didn't see any leak anymore when I tested the patch.
>>
>
> At first, your patch
> ==
> if (PageAnon(page) && !PageSwapCache(page)) {
> if (!(sc->gfp_mask & __GFP_IO))
> goto keep_locked;
> - /* avoid making more stale swap caches */
> - if (memcg_stale_swap_congestion())
> - goto keep_locked;
> if (!add_to_swap(page))
> goto activate_locked;
> + /*
> + * The owner process might have uncharged the page
> + * (by page_remove_rmap()) before it has been added
> + * to swap cache.
> + * Check it here to avoid making it stale.
> + */
> + if (memcg_free_unused_swapcache(page))
> + goto keep_locked;
> may_enter_fs = 1;
> }
> ==
> Should be
> ==
>
> if (PageAnon(page) && !PageSwapCache(page)) {
> ... // don't modify here
> }
> if (PageAnon(page) && PageSwapCache(page) && !page_mapped(page)) {
> if (try_to_free_page(page)) // or memcg_free_unused_swapcache()
> goto free_it;
> }
> ==
> I think.
>
It may work too.

But if the page is on swap cache already at the point of page_remove_rmap()
-> mem_cgroup_uncharge_page, the page is not uncharged.
So, it can be freed in memcg's LRU scanning in the long run by
shrink_page_list()->pageout()->swap_writepage()->try_to_free_swap().

I added the hook there just because I wanted to clarify what the
problematic case is.

And I don't think "goto free_it" is good.
It calls free_hot_cold_page(), but some process (like swapoff) might
have got the swap cache already and be waiting for the lock of the page.

> And we need hook to free_swap_and_cache() for handling PageWriteback() case.
>
Ah, You're right.

Thanks,
Daisuke Nishimura.

2009-04-28 03:52:24

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

> >> And I don't think it's a good idea to add memcg_stale_swap_congestion() here.
> >> This means less possibility to reclaim pages.
> >>
> > Hmm. maybe adding congestion_wait() ?
> >
> I don't think no hook before add_to_swap() is needed.
>
s/don't//

Sorry for my poor English.

Daisuke Nishimura.

2009-04-28 22:21:29

by Balbir Singh

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

* KAMEZAWA Hiroyuki <[email protected]> [2009-04-28 08:57:53]:

> On Tue, 28 Apr 2009 00:47:31 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Thanks for the detailed explanation of the possible race conditions. I
> > am beginning to wonder why we don't have any hooks in add_to_swap.*.
> > for charging a page. If the page is already charged and if it is a
> > context issue (charging it to the right cgroup) that is already
> > handled from what I see. Won't that help us solve the !PageCgroupUsed
> > issue?
> >
>
> For adding hook to add_to_swap_cache, we need to know which cgroup the swap cache
> should be charged. Then, we have to remove CONFIG_CGROUP_MEM_RES_CTRL_SWAP_EXT
> and enable memsw control always.
>
> When using swap_cgroup, we'll know which cgroup the new swap cache should be charged.
> Then, the new page readed in will be charged to recorded cgroup in swap_cgroup.
> One bad thing of this method is a cgroup which swap_cgroup point to is different from
> a cgroup which the task calls do_swap_fault(). This means that a page-fault by a
> task can cause memory-reclaim under another cgroup and moreover, OOM.
> I don't think it's sane behavior. So, current design of swap accounting waits until the
> page is mapped.
>

I know (that is why we removed the hooks from the original memcg at
some point). Why can't we mark the page here as swap pending to be
mapped, so that we don't lose them. As far as OOM is concerned, I
think they'll get relocated again when they are mapped (as per the
current implementation), the ones that don't are stale and can be
easily reclaimed.

--
Balbir

2009-04-30 00:04:57

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] fix leak of swap accounting as stale swap cache under memcg

On Wed, 29 Apr 2009 03:16:06 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2009-04-28 08:57:53]:
>
> > On Tue, 28 Apr 2009 00:47:31 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > Thanks for the detailed explanation of the possible race conditions. I
> > > am beginning to wonder why we don't have any hooks in add_to_swap.*.
> > > for charging a page. If the page is already charged and if it is a
> > > context issue (charging it to the right cgroup) that is already
> > > handled from what I see. Won't that help us solve the !PageCgroupUsed
> > > issue?
> > >
> >
> > For adding hook to add_to_swap_cache, we need to know which cgroup the swap cache
> > should be charged. Then, we have to remove CONFIG_CGROUP_MEM_RES_CTRL_SWAP_EXT
> > and enable memsw control always.
> >
> > When using swap_cgroup, we'll know which cgroup the new swap cache should be charged.
> > Then, the new page readed in will be charged to recorded cgroup in swap_cgroup.
> > One bad thing of this method is a cgroup which swap_cgroup point to is different from
> > a cgroup which the task calls do_swap_fault(). This means that a page-fault by a
> > task can cause memory-reclaim under another cgroup and moreover, OOM.
> > I don't think it's sane behavior. So, current design of swap accounting waits until the
> > page is mapped.
> >
>
> I know (that is why we removed the hooks from the original memcg at
> some point). Why can't we mark the page here as swap pending to be
> mapped, so that we don't lose them. As far as OOM is concerned, I
> think they'll get relocated again when they are mapped (as per the
> current implementation), the ones that don't are stale and can be
> easily reclaimed.

My point is "we need a help of global LRU".
To implement softlimit, we *have to* fix this without global LRU's help.
I have much more simple patch. pls see it.

Thanks,
-Kame