2013-09-23 08:24:26

by Weijie Yang

[permalink] [raw]
Subject: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

Consider the following scenario:
thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
finished, entry x and its zbud is not freed as its refcount != 0
now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
swapcache_prepare return -ENOENT because entry x is not used any more
zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
zswap_writeback_entry do nothing except put refcount
Now, the memory of zswap_entry x and its zpage leak.

Modify:
- check the refcount in fail path, free memory if it is not referenced.
- use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected]
Acked-by: Seth Jennings <[email protected]>
---
mm/zswap.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index cbd9578..1be7b90 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
enum zswap_get_swap_ret {
ZSWAP_SWAPCACHE_NEW,
ZSWAP_SWAPCACHE_EXIST,
- ZSWAP_SWAPCACHE_NOMEM
+ ZSWAP_SWAPCACHE_FAIL,
};

/*
@@ -401,9 +401,9 @@ enum zswap_get_swap_ret {
* added to the swap cache, and returned in retpage.
*
* If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
*/
static int zswap_get_swap_cache_page(swp_entry_t entry,
struct page **retpage)
@@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
if (new_page)
page_cache_release(new_page);
if (!found_page)
- return ZSWAP_SWAPCACHE_NOMEM;
+ return ZSWAP_SWAPCACHE_FAIL;
*retpage = found_page;
return ZSWAP_SWAPCACHE_EXIST;
}
@@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)

/* try to allocate swap cache page */
switch (zswap_get_swap_cache_page(swpentry, &page)) {
- case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+ case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
ret = -ENOMEM;
goto fail;

- case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+ case ZSWAP_SWAPCACHE_EXIST:
/* page is already in the swap cache, ignore for now */
page_cache_release(page);
ret = -EEXIST;
@@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)

fail:
spin_lock(&tree->lock);
- zswap_entry_put(entry);
+ refcount = zswap_entry_put(entry);
+ if (refcount <= 0) {
+ /* invalidate happened, consider writeback as success */
+ zswap_free_entry(tree, entry);
+ ret = 0;
+ }
spin_unlock(&tree->lock);
return ret;
}
--
1.7.10.4


2013-09-24 01:02:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> Consider the following scenario:
> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
> finished, entry x and its zbud is not freed as its refcount != 0
> now, the swap_map[x] = 0
> thread 0: now call zswap_get_swap_cache_page
> swapcache_prepare return -ENOENT because entry x is not used any more
> zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
> zswap_writeback_entry do nothing except put refcount
> Now, the memory of zswap_entry x and its zpage leak.
>
> Modify:
> - check the refcount in fail path, free memory if it is not referenced.

Hmm, I don't like this because zswap refcount routine is already mess for me.
I'm not sure why it was designed from the beginning. I hope we should fix it first.

1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
2. zswap_entry_put could hide resource free function like zswap_free_entry so that
all of caller can use it easily following pattern.

find_get_zswap_entry
...
...
zswap_entry_put

Of course, zswap_entry_put have to check the entry is in the tree or not
so if someone already removes it from the tree, it should avoid double remove.

One of the concern I can think is that approach extends critical section
but I think it would be no problem because more bottleneck would be [de]compress
functions. If it were really problem, we can mitigate a problem with moving
unnecessary functions out of zswap_free_entry because it seem to be rather
over-enginnering.

> - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
> can be not only caused by nomem but also by invalidate.
>
> Signed-off-by: Weijie Yang <[email protected]>
> Reviewed-by: Bob Liu <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: [email protected]
> Acked-by: Seth Jennings <[email protected]>
> ---
> mm/zswap.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cbd9578..1be7b90 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
> enum zswap_get_swap_ret {
> ZSWAP_SWAPCACHE_NEW,
> ZSWAP_SWAPCACHE_EXIST,
> - ZSWAP_SWAPCACHE_NOMEM
> + ZSWAP_SWAPCACHE_FAIL,
> };
>
> /*
> @@ -401,9 +401,9 @@ enum zswap_get_swap_ret {
> * added to the swap cache, and returned in retpage.
> *
> * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
> */
> static int zswap_get_swap_cache_page(swp_entry_t entry,
> struct page **retpage)
> @@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
> if (new_page)
> page_cache_release(new_page);
> if (!found_page)
> - return ZSWAP_SWAPCACHE_NOMEM;
> + return ZSWAP_SWAPCACHE_FAIL;
> *retpage = found_page;
> return ZSWAP_SWAPCACHE_EXIST;
> }
> @@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>
> /* try to allocate swap cache page */
> switch (zswap_get_swap_cache_page(swpentry, &page)) {
> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
> ret = -ENOMEM;
> goto fail;
>
> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> + case ZSWAP_SWAPCACHE_EXIST:
> /* page is already in the swap cache, ignore for now */
> page_cache_release(page);
> ret = -EEXIST;
> @@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>
> fail:
> spin_lock(&tree->lock);
> - zswap_entry_put(entry);
> + refcount = zswap_entry_put(entry);
> + if (refcount <= 0) {
> + /* invalidate happened, consider writeback as success */
> + zswap_free_entry(tree, entry);
> + ret = 0;
> + }
> spin_unlock(&tree->lock);
> return ret;
> }
> --
> 1.7.10.4
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-09-26 03:43:52

by Weijie Yang

[permalink] [raw]
Subject: RE: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> >
> > Modify:
> > - check the refcount in fail path, free memory if it is not referenced.
>
> Hmm, I don't like this because zswap refcount routine is already mess for me.
> I'm not sure why it was designed from the beginning. I hope we should fix it first.
>
> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> all of caller can use it easily following pattern.
>
> find_get_zswap_entry
> ...
> ...
> zswap_entry_put
>
> Of course, zswap_entry_put have to check the entry is in the tree or not
> so if someone already removes it from the tree, it should avoid double remove.
>
> One of the concern I can think is that approach extends critical section
> but I think it would be no problem because more bottleneck would be [de]compress
> functions. If it were really problem, we can mitigate a problem with moving
> unnecessary functions out of zswap_free_entry because it seem to be rather
> over-enginnering.

I refactor the zswap refcount routine according to Minchan's idea.
Here is the new patch, Any suggestion is welcomed.

To Seth and Bob, would you please review it again?

mm/zswap.c | 116
++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
1 file changed, 52 insertions(+), 64 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
old mode 100644
new mode 100755
index deda2b6..bd04910
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
if (!entry)
return NULL;
entry->refcount = 1;
+ RB_CLEAR_NODE(&entry->rbnode);
return entry;
}

@@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
}

/* caller must hold the tree lock */
-static int zswap_entry_put(struct zswap_entry *entry)
+static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
{
- entry->refcount--;
- return entry->refcount;
+ int refcount = --entry->refcount;
+
+ if (refcount <= 0) {
+ if (!RB_EMPTY_NODE(&entry->rbnode)) {
+ rb_erase(&entry->rbnode, &tree->rbroot);
+ RB_CLEAR_NODE(&entry->rbnode);
+ }
+
+ zswap_free_entry(tree, entry);
+ }
+
+ return refcount;
}

/*********************************
@@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
return NULL;
}

+static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
+{
+ struct zswap_entry *entry = NULL;
+
+ entry = zswap_rb_search(root, offset);
+ if (entry)
+ zswap_entry_get(entry);
+
+ return entry;
+}
+
/*
* In the case that a entry with the same offset is found, a pointer to
* the existing entry is stored in dupentry and the function returns -EEXIST
@@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
enum zswap_get_swap_ret {
ZSWAP_SWAPCACHE_NEW,
ZSWAP_SWAPCACHE_EXIST,
- ZSWAP_SWAPCACHE_NOMEM
+ ZSWAP_SWAPCACHE_FAIL,
};

/*
@@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
* added to the swap cache, and returned in retpage.
*
* If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
*/
static int zswap_get_swap_cache_page(swp_entry_t entry,
struct page **retpage)
@@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
if (new_page)
page_cache_release(new_page);
if (!found_page)
- return ZSWAP_SWAPCACHE_NOMEM;
+ return ZSWAP_SWAPCACHE_FAIL;
*retpage = found_page;
return ZSWAP_SWAPCACHE_EXIST;
}
@@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)

/* find and ref zswap entry */
spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = zswap_entry_find_get(&tree->rbroot, offset);
if (!entry) {
/* entry was invalidated */
spin_unlock(&tree->lock);
return 0;
}
- zswap_entry_get(entry);
spin_unlock(&tree->lock);
BUG_ON(offset != entry->offset);

/* try to allocate swap cache page */
switch (zswap_get_swap_cache_page(swpentry, &page)) {
- case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+ case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
ret = -ENOMEM;
goto fail;

- case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+ case ZSWAP_SWAPCACHE_EXIST:
/* page is already in the swap cache, ignore for now */
page_cache_release(page);
ret = -EEXIST;
@@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
zswap_written_back_pages++;

spin_lock(&tree->lock);
-
/* drop local reference */
- zswap_entry_put(entry);
+ refcount = zswap_entry_put(tree, entry);
/* drop the initial reference from entry creation */
- refcount = zswap_entry_put(entry);
-
- /*
- * There are three possible values for refcount here:
- * (1) refcount is 1, load is in progress, unlink from rbtree,
- * load will free
- * (2) refcount is 0, (normal case) entry is valid,
- * remove from rbtree and free entry
- * (3) refcount is -1, invalidate happened during writeback,
- * free entry
- */
- if (refcount >= 0) {
- /* no invalidate yet, remove from rbtree */
+ if (refcount > 0) {
rb_erase(&entry->rbnode, &tree->rbroot);
+ RB_CLEAR_NODE(&entry->rbnode);
+ refcount = zswap_entry_put(tree, entry);
}
spin_unlock(&tree->lock);
- if (refcount <= 0) {
- /* free the entry */
- zswap_free_entry(tree, entry);
- return 0;
- }
- return -EAGAIN;
+
+ goto end;

fail:
spin_lock(&tree->lock);
- zswap_entry_put(entry);
+ refcount = zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);
- return ret;
+
+end:
+ if (refcount <= 0)
+ return 0;
+ else
+ return -EAGAIN;
}

/*********************************
@@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
zswap_duplicate_entry++;
/* remove from rbtree */
rb_erase(&dupentry->rbnode, &tree->rbroot);
- if (!zswap_entry_put(dupentry)) {
- /* free */
- zswap_free_entry(tree, dupentry);
- }
+ RB_CLEAR_NODE(&dupentry->rbnode);
+ zswap_entry_put(tree, dupentry);
}
} while (ret == -EEXIST);
spin_unlock(&tree->lock);
@@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,

/* find */
spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = zswap_entry_find_get(&tree->rbroot, offset);
if (!entry) {
/* entry was written back */
spin_unlock(&tree->lock);
return -1;
}
- zswap_entry_get(entry);
spin_unlock(&tree->lock);

/* decompress */
@@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
BUG_ON(ret);

spin_lock(&tree->lock);
- refcount = zswap_entry_put(entry);
- if (likely(refcount)) {
- spin_unlock(&tree->lock);
- return 0;
- }
+ zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);

- /*
- * We don't have to unlink from the rbtree because
- * zswap_writeback_entry() or zswap_frontswap_invalidate page()
- * has already done this for us if we are the last reference.
- */
- /* free */
-
- zswap_free_entry(tree, entry);
-
return 0;
}

@@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)

/* remove from rbtree */
rb_erase(&entry->rbnode, &tree->rbroot);
+ RB_CLEAR_NODE(&entry->rbnode);

/* drop the initial reference from entry creation */
- refcount = zswap_entry_put(entry);
+ zswap_entry_put(tree, entry);

spin_unlock(&tree->lock);
-
- if (refcount) {
- /* writeback in progress, writeback will free */
- return;
- }
-
- /* free */
- zswap_free_entry(tree, entry);
}

/* frees all zswap entries for the given swap type */

2013-10-10 19:54:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Thu, 26 Sep 2013 11:42:17 +0800 Weijie Yang <[email protected]> wrote:

> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > >
> > > Modify:
> > > - check the refcount in fail path, free memory if it is not referenced.
> >
> > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> >
> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> > all of caller can use it easily following pattern.
> >
> > find_get_zswap_entry
> > ...
> > ...
> > zswap_entry_put
> >
> > Of course, zswap_entry_put have to check the entry is in the tree or not
> > so if someone already removes it from the tree, it should avoid double remove.
> >
> > One of the concern I can think is that approach extends critical section
> > but I think it would be no problem because more bottleneck would be [de]compress
> > functions. If it were really problem, we can mitigate a problem with moving
> > unnecessary functions out of zswap_free_entry because it seem to be rather
> > over-enginnering.
>
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
>
> To Seth and Bob, would you please review it again?

Yes, please let's re-review this patch. It is very different from its
predecessor.


From: Weijie Yang <[email protected]>
Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

Consider the following scenario:

thread 0: reclaim entry x (get refcount, but not call
zswap_get_swap_cache_page)

thread 1: call zswap_frontswap_invalidate_page to invalidate
entry x. finished, entry x and its zbud is not freed as its
refcount != 0 now, the swap_map[x] = 0

thread 0: now call zswap_get_swap_cache_page swapcache_prepare
return -ENOENT because entry x is not used any more
zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
zswap_writeback_entry do nothing except put refcount

Now, the memory of zswap_entry x and its zpage leak.

Modify:
- check the refcount in fail path, free memory if it is not referenced.
- use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
Cc: Minchan Kim <[email protected]>
Acked-by: Seth Jennings <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/zswap.c | 116 ++++++++++++++++++++++-----------------------------
1 file changed, 52 insertions(+), 64 deletions(-)

diff -puN mm/zswap.c~mm-zswap-bugfix-memory-leak-when-invalidate-and-reclaim-occur-concurrently mm/zswap.c
--- a/mm/zswap.c~mm-zswap-bugfix-memory-leak-when-invalidate-and-reclaim-occur-concurrently
+++ a/mm/zswap.c
@@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_c
if (!entry)
return NULL;
entry->refcount = 1;
+ RB_CLEAR_NODE(&entry->rbnode);
return entry;
}

@@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap
}

/* caller must hold the tree lock */
-static int zswap_entry_put(struct zswap_entry *entry)
+static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
{
- entry->refcount--;
- return entry->refcount;
+ int refcount = --entry->refcount;
+
+ if (refcount <= 0) {
+ if (!RB_EMPTY_NODE(&entry->rbnode)) {
+ rb_erase(&entry->rbnode, &tree->rbroot);
+ RB_CLEAR_NODE(&entry->rbnode);
+ }
+
+ zswap_free_entry(tree, entry);
+ }
+
+ return refcount;
}

/*********************************
@@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_sear
return NULL;
}

+static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
+{
+ struct zswap_entry *entry = NULL;
+
+ entry = zswap_rb_search(root, offset);
+ if (entry)
+ zswap_entry_get(entry);
+
+ return entry;
+}
+
/*
* In the case that a entry with the same offset is found, a pointer to
* the existing entry is stored in dupentry and the function returns -EEXIST
@@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswa
enum zswap_get_swap_ret {
ZSWAP_SWAPCACHE_NEW,
ZSWAP_SWAPCACHE_EXIST,
- ZSWAP_SWAPCACHE_NOMEM
+ ZSWAP_SWAPCACHE_FAIL,
};

/*
@@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
* added to the swap cache, and returned in retpage.
*
* If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
*/
static int zswap_get_swap_cache_page(swp_entry_t entry,
struct page **retpage)
@@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp
if (new_page)
page_cache_release(new_page);
if (!found_page)
- return ZSWAP_SWAPCACHE_NOMEM;
+ return ZSWAP_SWAPCACHE_FAIL;
*retpage = found_page;
return ZSWAP_SWAPCACHE_EXIST;
}
@@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct

/* find and ref zswap entry */
spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = zswap_entry_find_get(&tree->rbroot, offset);
if (!entry) {
/* entry was invalidated */
spin_unlock(&tree->lock);
return 0;
}
- zswap_entry_get(entry);
spin_unlock(&tree->lock);
BUG_ON(offset != entry->offset);

/* try to allocate swap cache page */
switch (zswap_get_swap_cache_page(swpentry, &page)) {
- case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+ case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
ret = -ENOMEM;
goto fail;

- case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+ case ZSWAP_SWAPCACHE_EXIST:
/* page is already in the swap cache, ignore for now */
page_cache_release(page);
ret = -EEXIST;
@@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct
zswap_written_back_pages++;

spin_lock(&tree->lock);
-
/* drop local reference */
- zswap_entry_put(entry);
+ refcount = zswap_entry_put(tree, entry);
/* drop the initial reference from entry creation */
- refcount = zswap_entry_put(entry);
-
- /*
- * There are three possible values for refcount here:
- * (1) refcount is 1, load is in progress, unlink from rbtree,
- * load will free
- * (2) refcount is 0, (normal case) entry is valid,
- * remove from rbtree and free entry
- * (3) refcount is -1, invalidate happened during writeback,
- * free entry
- */
- if (refcount >= 0) {
- /* no invalidate yet, remove from rbtree */
+ if (refcount > 0) {
rb_erase(&entry->rbnode, &tree->rbroot);
+ RB_CLEAR_NODE(&entry->rbnode);
+ refcount = zswap_entry_put(tree, entry);
}
spin_unlock(&tree->lock);
- if (refcount <= 0) {
- /* free the entry */
- zswap_free_entry(tree, entry);
- return 0;
- }
- return -EAGAIN;
+
+ goto end;

fail:
spin_lock(&tree->lock);
- zswap_entry_put(entry);
+ refcount = zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);
- return ret;
+
+end:
+ if (refcount <= 0)
+ return 0;
+ else
+ return -EAGAIN;
}

/*********************************
@@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigne
zswap_duplicate_entry++;
/* remove from rbtree */
rb_erase(&dupentry->rbnode, &tree->rbroot);
- if (!zswap_entry_put(dupentry)) {
- /* free */
- zswap_free_entry(tree, dupentry);
- }
+ RB_CLEAR_NODE(&dupentry->rbnode);
+ zswap_entry_put(tree, dupentry);
}
} while (ret == -EEXIST);
spin_unlock(&tree->lock);
@@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned

/* find */
spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = zswap_entry_find_get(&tree->rbroot, offset);
if (!entry) {
/* entry was written back */
spin_unlock(&tree->lock);
return -1;
}
- zswap_entry_get(entry);
spin_unlock(&tree->lock);

/* decompress */
@@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned
BUG_ON(ret);

spin_lock(&tree->lock);
- refcount = zswap_entry_put(entry);
- if (likely(refcount)) {
- spin_unlock(&tree->lock);
- return 0;
- }
+ zswap_entry_put(tree, entry);
spin_unlock(&tree->lock);

- /*
- * We don't have to unlink from the rbtree because
- * zswap_writeback_entry() or zswap_frontswap_invalidate page()
- * has already done this for us if we are the last reference.
- */
- /* free */
-
- zswap_free_entry(tree, entry);
-
return 0;
}

@@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_p

/* remove from rbtree */
rb_erase(&entry->rbnode, &tree->rbroot);
+ RB_CLEAR_NODE(&entry->rbnode);

/* drop the initial reference from entry creation */
- refcount = zswap_entry_put(entry);
+ zswap_entry_put(tree, entry);

spin_unlock(&tree->lock);
-
- if (refcount) {
- /* writeback in progress, writeback will free */
- return;
- }
-
- /* free */
- zswap_free_entry(tree, entry);
}

/* frees all zswap entries for the given swap type */
_

2013-10-11 07:13:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > >
> > > Modify:
> > > - check the refcount in fail path, free memory if it is not referenced.
> >
> > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> >
> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> > all of caller can use it easily following pattern.
> >
> > find_get_zswap_entry
> > ...
> > ...
> > zswap_entry_put
> >
> > Of course, zswap_entry_put have to check the entry is in the tree or not
> > so if someone already removes it from the tree, it should avoid double remove.
> >
> > One of the concern I can think is that approach extends critical section
> > but I think it would be no problem because more bottleneck would be [de]compress
> > functions. If it were really problem, we can mitigate a problem with moving
> > unnecessary functions out of zswap_free_entry because it seem to be rather
> > over-enginnering.
>
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
>
> To Seth and Bob, would you please review it again?

Yeah, Seth, Bob. You guys are right persons to review this because this
scheme was suggested by me who is biased so it couldn't be a fair. ;-)
But anyway, I will review code itself.

>
> mm/zswap.c | 116
> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
> 1 file changed, 52 insertions(+), 64 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> old mode 100644
> new mode 100755
> index deda2b6..bd04910
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
> if (!entry)
> return NULL;
> entry->refcount = 1;
> + RB_CLEAR_NODE(&entry->rbnode);
> return entry;
> }
>
> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
> }
>
> /* caller must hold the tree lock */
> -static int zswap_entry_put(struct zswap_entry *entry)
> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)

Why should we have return value? If we really need it, it mitigates
get/put semantic's whole point so I'd like to just return void.

Let me see.

> {
> - entry->refcount--;
> - return entry->refcount;
> + int refcount = --entry->refcount;
> +
> + if (refcount <= 0) {

Hmm, I don't like minus refcount, really.
I hope we could do following as

BUG_ON(refcount < 0);
if (refcount == 0) {
...
}



> + if (!RB_EMPTY_NODE(&entry->rbnode)) {
> + rb_erase(&entry->rbnode, &tree->rbroot);
> + RB_CLEAR_NODE(&entry->rbnode);

Minor,
You could make new function zswap_rb_del or zswap_rb_remove which detach the node
from rb tree and clear node because we have already zswap_rb_insert.


> + }
> +
> + zswap_free_entry(tree, entry);
> + }
> +
> + return refcount;
> }
>
> /*********************************
> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> return NULL;
> }
>

Add function description.

> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
> +{
> + struct zswap_entry *entry = NULL;
> +
> + entry = zswap_rb_search(root, offset);
> + if (entry)
> + zswap_entry_get(entry);
> +
> + return entry;
> +}
> +
> /*
> * In the case that a entry with the same offset is found, a pointer to
> * the existing entry is stored in dupentry and the function returns -EEXIST
> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
> enum zswap_get_swap_ret {
> ZSWAP_SWAPCACHE_NEW,
> ZSWAP_SWAPCACHE_EXIST,
> - ZSWAP_SWAPCACHE_NOMEM
> + ZSWAP_SWAPCACHE_FAIL,
> };
>
> /*
> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
> * added to the swap cache, and returned in retpage.
> *
> * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
> */
> static int zswap_get_swap_cache_page(swp_entry_t entry,
> struct page **retpage)
> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
> if (new_page)
> page_cache_release(new_page);
> if (!found_page)
> - return ZSWAP_SWAPCACHE_NOMEM;
> + return ZSWAP_SWAPCACHE_FAIL;
> *retpage = found_page;
> return ZSWAP_SWAPCACHE_EXIST;
> }
> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>
> /* find and ref zswap entry */
> spin_lock(&tree->lock);
> - entry = zswap_rb_search(&tree->rbroot, offset);
> + entry = zswap_entry_find_get(&tree->rbroot, offset);
> if (!entry) {
> /* entry was invalidated */
> spin_unlock(&tree->lock);
> return 0;
> }
> - zswap_entry_get(entry);
> spin_unlock(&tree->lock);
> BUG_ON(offset != entry->offset);
>
> /* try to allocate swap cache page */
> switch (zswap_get_swap_cache_page(swpentry, &page)) {
> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
> ret = -ENOMEM;
> goto fail;
>
> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> + case ZSWAP_SWAPCACHE_EXIST:

Why did you remove comment?

> /* page is already in the swap cache, ignore for now */
> page_cache_release(page);
> ret = -EEXIST;
> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
> zswap_written_back_pages++;
>
> spin_lock(&tree->lock);
> -
> /* drop local reference */
> - zswap_entry_put(entry);
> + refcount = zswap_entry_put(tree, entry);
> /* drop the initial reference from entry creation */
> - refcount = zswap_entry_put(entry);
> -
> - /*
> - * There are three possible values for refcount here:
> - * (1) refcount is 1, load is in progress, unlink from rbtree,
> - * load will free
> - * (2) refcount is 0, (normal case) entry is valid,
> - * remove from rbtree and free entry
> - * (3) refcount is -1, invalidate happened during writeback,
> - * free entry
> - */
> - if (refcount >= 0) {
> - /* no invalidate yet, remove from rbtree */
> + if (refcount > 0) {
> rb_erase(&entry->rbnode, &tree->rbroot);
> + RB_CLEAR_NODE(&entry->rbnode);
> + refcount = zswap_entry_put(tree, entry);

Now, I see why you need return in zswap_entry_put but let's consider again
because it's really mess to me and it hurts get/put semantic a lot so
How about this?

spin_lock(&tree->lock);
/* drop local reference */
zswap_entry_put(tree, entry);
/*
* In here, we want to free entry but invalidation may free earlier
* under us so that we should check it again
*/
if (entry == zswap_rb_search(&tree->rb_root, offset))
/* Yes, it's stable so we should free it */
zswap_entry_put(tree, entry);

/*
* Whether it would be freed by invalidation or writeback, it doesn't
* matter. Important thing is that it will be freed so there
* is no point to return -EAGAIN.
*/
spin_unlock(&tree->lock);
return 0;

> }
> spin_unlock(&tree->lock);
> - if (refcount <= 0) {
> - /* free the entry */
> - zswap_free_entry(tree, entry);
> - return 0;
> - }
> - return -EAGAIN;
> +
> + goto end;
>
> fail:
> spin_lock(&tree->lock);
> - zswap_entry_put(entry);
> + refcount = zswap_entry_put(tree, entry);
> spin_unlock(&tree->lock);
> - return ret;
> +
> +end:
> + if (refcount <= 0)
> + return 0;
> + else
> + return -EAGAIN;
> }
>
> /*********************************
> @@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> zswap_duplicate_entry++;
> /* remove from rbtree */
> rb_erase(&dupentry->rbnode, &tree->rbroot);
> - if (!zswap_entry_put(dupentry)) {
> - /* free */
> - zswap_free_entry(tree, dupentry);
> - }
> + RB_CLEAR_NODE(&dupentry->rbnode);
> + zswap_entry_put(tree, dupentry);
> }
> } while (ret == -EEXIST);
> spin_unlock(&tree->lock);
> @@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>
> /* find */
> spin_lock(&tree->lock);
> - entry = zswap_rb_search(&tree->rbroot, offset);
> + entry = zswap_entry_find_get(&tree->rbroot, offset);
> if (!entry) {
> /* entry was written back */
> spin_unlock(&tree->lock);
> return -1;
> }
> - zswap_entry_get(entry);
> spin_unlock(&tree->lock);
>
> /* decompress */
> @@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
> BUG_ON(ret);
>
> spin_lock(&tree->lock);
> - refcount = zswap_entry_put(entry);
> - if (likely(refcount)) {
> - spin_unlock(&tree->lock);
> - return 0;
> - }
> + zswap_entry_put(tree, entry);
> spin_unlock(&tree->lock);
>
> - /*
> - * We don't have to unlink from the rbtree because
> - * zswap_writeback_entry() or zswap_frontswap_invalidate page()
> - * has already done this for us if we are the last reference.
> - */
> - /* free */
> -
> - zswap_free_entry(tree, entry);
> -
> return 0;
> }
>
> @@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>
> /* remove from rbtree */
> rb_erase(&entry->rbnode, &tree->rbroot);
> + RB_CLEAR_NODE(&entry->rbnode);
>
> /* drop the initial reference from entry creation */
> - refcount = zswap_entry_put(entry);
> + zswap_entry_put(tree, entry);
>
> spin_unlock(&tree->lock);
> -
> - if (refcount) {
> - /* writeback in progress, writeback will free */
> - return;
> - }
> -
> - /* free */
> - zswap_free_entry(tree, entry);
> }
>
> /* frees all zswap entries for the given swap type */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-10-12 02:32:33

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang <[email protected]> wrote:
> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> >
>> > Modify:
>> > - check the refcount in fail path, free memory if it is not referenced.
>>
>> Hmm, I don't like this because zswap refcount routine is already mess for me.
>> I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>
>> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> all of caller can use it easily following pattern.
>>
>> find_get_zswap_entry
>> ...
>> ...
>> zswap_entry_put
>>
>> Of course, zswap_entry_put have to check the entry is in the tree or not
>> so if someone already removes it from the tree, it should avoid double remove.
>>
>> One of the concern I can think is that approach extends critical section
>> but I think it would be no problem because more bottleneck would be [de]compress
>> functions. If it were really problem, we can mitigate a problem with moving
>> unnecessary functions out of zswap_free_entry because it seem to be rather
>> over-enginnering.
>
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
>
> To Seth and Bob, would you please review it again?
>

I have nothing in addition to Minchan's review.

Since the code is a bit complex, I'd suggest you to split it into two patches.
[1/2]: fix the memory leak
[2/2]: clean up the entry_put

And run some testing..

Thanks,
-Bob

2013-10-12 02:50:05

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > >
>> > > Modify:
>> > > - check the refcount in fail path, free memory if it is not referenced.
>> >
>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>> >
>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> > all of caller can use it easily following pattern.
>> >
>> > find_get_zswap_entry
>> > ...
>> > ...
>> > zswap_entry_put
>> >
>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > so if someone already removes it from the tree, it should avoid double remove.
>> >
>> > One of the concern I can think is that approach extends critical section
>> > but I think it would be no problem because more bottleneck would be [de]compress
>> > functions. If it were really problem, we can mitigate a problem with moving
>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.
>
>>
>> mm/zswap.c | 116
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>> 1 file changed, 52 insertions(+), 64 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index deda2b6..bd04910
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>> if (!entry)
>> return NULL;
>> entry->refcount = 1;
>> + RB_CLEAR_NODE(&entry->rbnode);
>> return entry;
>> }
>>
>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>> }
>>
>> /* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>
> Why should we have return value? If we really need it, it mitigates
> get/put semantic's whole point so I'd like to just return void.
>
> Let me see.
>
>> {
>> - entry->refcount--;
>> - return entry->refcount;
>> + int refcount = --entry->refcount;
>> +
>> + if (refcount <= 0) {
>
> Hmm, I don't like minus refcount, really.
> I hope we could do following as
>
> BUG_ON(refcount < 0);
> if (refcount == 0) {
> ...
> }
>
>
>
>> + if (!RB_EMPTY_NODE(&entry->rbnode)) {
>> + rb_erase(&entry->rbnode, &tree->rbroot);
>> + RB_CLEAR_NODE(&entry->rbnode);
>
> Minor,
> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
> from rb tree and clear node because we have already zswap_rb_insert.
>
>
>> + }
>> +
>> + zswap_free_entry(tree, entry);
>> + }
>> +
>> + return refcount;
>> }
>>
>> /*********************************
>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>> return NULL;
>> }
>>
>
> Add function description.
>
>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>> +{
>> + struct zswap_entry *entry = NULL;
>> +
>> + entry = zswap_rb_search(root, offset);
>> + if (entry)
>> + zswap_entry_get(entry);
>> +
>> + return entry;
>> +}
>> +
>> /*
>> * In the case that a entry with the same offset is found, a pointer to
>> * the existing entry is stored in dupentry and the function returns -EEXIST
>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>> enum zswap_get_swap_ret {
>> ZSWAP_SWAPCACHE_NEW,
>> ZSWAP_SWAPCACHE_EXIST,
>> - ZSWAP_SWAPCACHE_NOMEM
>> + ZSWAP_SWAPCACHE_FAIL,
>> };
>>
>> /*
>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>> * added to the swap cache, and returned in retpage.
>> *
>> * If success, the swap cache page is returned in retpage
>> - * Returns 0 if page was already in the swap cache, page is not locked
>> - * Returns 1 if the new page needs to be populated, page is locked
>> - * Returns <0 on error
>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>> */
>> static int zswap_get_swap_cache_page(swp_entry_t entry,
>> struct page **retpage)
>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>> if (new_page)
>> page_cache_release(new_page);
>> if (!found_page)
>> - return ZSWAP_SWAPCACHE_NOMEM;
>> + return ZSWAP_SWAPCACHE_FAIL;
>> *retpage = found_page;
>> return ZSWAP_SWAPCACHE_EXIST;
>> }
>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>
>> /* find and ref zswap entry */
>> spin_lock(&tree->lock);
>> - entry = zswap_rb_search(&tree->rbroot, offset);
>> + entry = zswap_entry_find_get(&tree->rbroot, offset);
>> if (!entry) {
>> /* entry was invalidated */
>> spin_unlock(&tree->lock);
>> return 0;
>> }
>> - zswap_entry_get(entry);
>> spin_unlock(&tree->lock);
>> BUG_ON(offset != entry->offset);
>>
>> /* try to allocate swap cache page */
>> switch (zswap_get_swap_cache_page(swpentry, &page)) {
>> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>> ret = -ENOMEM;
>> goto fail;
>>
>> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>> + case ZSWAP_SWAPCACHE_EXIST:
>
> Why did you remove comment?
>
>> /* page is already in the swap cache, ignore for now */
>> page_cache_release(page);
>> ret = -EEXIST;
>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>> zswap_written_back_pages++;
>>
>> spin_lock(&tree->lock);
>> -
>> /* drop local reference */
>> - zswap_entry_put(entry);
>> + refcount = zswap_entry_put(tree, entry);
>> /* drop the initial reference from entry creation */
>> - refcount = zswap_entry_put(entry);
>> -
>> - /*
>> - * There are three possible values for refcount here:
>> - * (1) refcount is 1, load is in progress, unlink from rbtree,
>> - * load will free
>> - * (2) refcount is 0, (normal case) entry is valid,
>> - * remove from rbtree and free entry
>> - * (3) refcount is -1, invalidate happened during writeback,
>> - * free entry
>> - */
>> - if (refcount >= 0) {
>> - /* no invalidate yet, remove from rbtree */
>> + if (refcount > 0) {
>> rb_erase(&entry->rbnode, &tree->rbroot);
>> + RB_CLEAR_NODE(&entry->rbnode);
>> + refcount = zswap_entry_put(tree, entry);
>
> Now, I see why you need return in zswap_entry_put but let's consider again
> because it's really mess to me and it hurts get/put semantic a lot so
> How about this?
>
> spin_lock(&tree->lock);
> /* drop local reference */
> zswap_entry_put(tree, entry);
> /*
> * In here, we want to free entry but invalidation may free earlier
> * under us so that we should check it again
> */
> if (entry == zswap_rb_search(&tree->rb_root, offset))

Then where is the place unlink entry from rbtree if load was in progress ?

And in the following fail path, return value from zswap_entry_put() is
also used.

> /* Yes, it's stable so we should free it */
> zswap_entry_put(tree, entry);
>
> /*
> * Whether it would be freed by invalidation or writeback, it doesn't
> * matter. Important thing is that it will be freed so there
> * is no point to return -EAGAIN.
> */
> spin_unlock(&tree->lock);
> return 0;
>

--
Regards,
--Bob

2013-10-12 08:41:09

by Weijie Yang

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > >
>> > > Modify:
>> > > - check the refcount in fail path, free memory if it is not referenced.
>> >
>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>> >
>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> > all of caller can use it easily following pattern.
>> >
>> > find_get_zswap_entry
>> > ...
>> > ...
>> > zswap_entry_put
>> >
>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > so if someone already removes it from the tree, it should avoid double remove.
>> >
>> > One of the concern I can think is that approach extends critical section
>> > but I think it would be no problem because more bottleneck would be [de]compress
>> > functions. If it were really problem, we can mitigate a problem with moving
>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.

Thanks for your careful review and suggestion.

>>
>> mm/zswap.c | 116
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>> 1 file changed, 52 insertions(+), 64 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index deda2b6..bd04910
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>> if (!entry)
>> return NULL;
>> entry->refcount = 1;
>> + RB_CLEAR_NODE(&entry->rbnode);
>> return entry;
>> }
>>
>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>> }
>>
>> /* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>
> Why should we have return value? If we really need it, it mitigates
> get/put semantic's whole point so I'd like to just return void.
>
> Let me see.
>
>> {
>> - entry->refcount--;
>> - return entry->refcount;
>> + int refcount = --entry->refcount;
>> +
>> + if (refcount <= 0) {
>
> Hmm, I don't like minus refcount, really.
> I hope we could do following as

It is not like the common get/put semantic
As invalidate and reclaim can be called concurrently,
this refcount would become minus.
we have to check the refcount and meanwhile handle
whether it is on the tree.

> BUG_ON(refcount < 0);
> if (refcount == 0) {
> ...
> }
>
>
>
>> + if (!RB_EMPTY_NODE(&entry->rbnode)) {
>> + rb_erase(&entry->rbnode, &tree->rbroot);
>> + RB_CLEAR_NODE(&entry->rbnode);
>
> Minor,
> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
> from rb tree and clear node because we have already zswap_rb_insert.

yes.

>
>> + }
>> +
>> + zswap_free_entry(tree, entry);
>> + }
>> +
>> + return refcount;
>> }
>>
>> /*********************************
>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>> return NULL;
>> }
>>
>
> Add function description.

ok.

>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>> +{
>> + struct zswap_entry *entry = NULL;
>> +
>> + entry = zswap_rb_search(root, offset);
>> + if (entry)
>> + zswap_entry_get(entry);
>> +
>> + return entry;
>> +}
>> +
>> /*
>> * In the case that a entry with the same offset is found, a pointer to
>> * the existing entry is stored in dupentry and the function returns -EEXIST
>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>> enum zswap_get_swap_ret {
>> ZSWAP_SWAPCACHE_NEW,
>> ZSWAP_SWAPCACHE_EXIST,
>> - ZSWAP_SWAPCACHE_NOMEM
>> + ZSWAP_SWAPCACHE_FAIL,
>> };
>>
>> /*
>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>> * added to the swap cache, and returned in retpage.
>> *
>> * If success, the swap cache page is returned in retpage
>> - * Returns 0 if page was already in the swap cache, page is not locked
>> - * Returns 1 if the new page needs to be populated, page is locked
>> - * Returns <0 on error
>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>> */
>> static int zswap_get_swap_cache_page(swp_entry_t entry,
>> struct page **retpage)
>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>> if (new_page)
>> page_cache_release(new_page);
>> if (!found_page)
>> - return ZSWAP_SWAPCACHE_NOMEM;
>> + return ZSWAP_SWAPCACHE_FAIL;
>> *retpage = found_page;
>> return ZSWAP_SWAPCACHE_EXIST;
>> }
>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>
>> /* find and ref zswap entry */
>> spin_lock(&tree->lock);
>> - entry = zswap_rb_search(&tree->rbroot, offset);
>> + entry = zswap_entry_find_get(&tree->rbroot, offset);
>> if (!entry) {
>> /* entry was invalidated */
>> spin_unlock(&tree->lock);
>> return 0;
>> }
>> - zswap_entry_get(entry);
>> spin_unlock(&tree->lock);
>> BUG_ON(offset != entry->offset);
>>
>> /* try to allocate swap cache page */
>> switch (zswap_get_swap_cache_page(swpentry, &page)) {
>> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>> ret = -ENOMEM;
>> goto fail;
>>
>> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>> + case ZSWAP_SWAPCACHE_EXIST:
>
> Why did you remove comment?

Sorry for that, I intended to move them to zswap_entry_put(), but I forgot.

>> /* page is already in the swap cache, ignore for now */
>> page_cache_release(page);
>> ret = -EEXIST;
>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>> zswap_written_back_pages++;
>>
>> spin_lock(&tree->lock);
>> -
>> /* drop local reference */
>> - zswap_entry_put(entry);
>> + refcount = zswap_entry_put(tree, entry);
>> /* drop the initial reference from entry creation */
>> - refcount = zswap_entry_put(entry);
>> -
>> - /*
>> - * There are three possible values for refcount here:
>> - * (1) refcount is 1, load is in progress, unlink from rbtree,
>> - * load will free
>> - * (2) refcount is 0, (normal case) entry is valid,
>> - * remove from rbtree and free entry
>> - * (3) refcount is -1, invalidate happened during writeback,
>> - * free entry
>> - */
>> - if (refcount >= 0) {
>> - /* no invalidate yet, remove from rbtree */
>> + if (refcount > 0) {
>> rb_erase(&entry->rbnode, &tree->rbroot);
>> + RB_CLEAR_NODE(&entry->rbnode);
>> + refcount = zswap_entry_put(tree, entry);
>
> Now, I see why you need return in zswap_entry_put but let's consider again
> because it's really mess to me and it hurts get/put semantic a lot so
> How about this?

It is a better way to avoid minus refcount !

> spin_lock(&tree->lock);
> /* drop local reference */
> zswap_entry_put(tree, entry);
> /*
> * In here, we want to free entry but invalidation may free earlier
> * under us so that we should check it again
> */
> if (entry == zswap_rb_search(&tree->rb_root, offset))
> /* Yes, it's stable so we should free it */
> zswap_entry_put(tree, entry);
>
> /*
> * Whether it would be freed by invalidation or writeback, it doesn't
> * matter. Important thing is that it will be freed so there
> * is no point to return -EAGAIN.
> */
> spin_unlock(&tree->lock);
> return 0;
>
>> }
>> spin_unlock(&tree->lock);
>> - if (refcount <= 0) {
>> - /* free the entry */
>> - zswap_free_entry(tree, entry);
>> - return 0;
>> - }
>> - return -EAGAIN;
>> +
>> + goto end;
>>
>> fail:
>> spin_lock(&tree->lock);
>> - zswap_entry_put(entry);
>> + refcount = zswap_entry_put(tree, entry);
>> spin_unlock(&tree->lock);
>> - return ret;
>> +
>> +end:
>> + if (refcount <= 0)
>> + return 0;
>> + else
>> + return -EAGAIN;
>> }
>>
>> /*********************************
>> @@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>> zswap_duplicate_entry++;
>> /* remove from rbtree */
>> rb_erase(&dupentry->rbnode, &tree->rbroot);
>> - if (!zswap_entry_put(dupentry)) {
>> - /* free */
>> - zswap_free_entry(tree, dupentry);
>> - }
>> + RB_CLEAR_NODE(&dupentry->rbnode);
>> + zswap_entry_put(tree, dupentry);
>> }
>> } while (ret == -EEXIST);
>> spin_unlock(&tree->lock);
>> @@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>
>> /* find */
>> spin_lock(&tree->lock);
>> - entry = zswap_rb_search(&tree->rbroot, offset);
>> + entry = zswap_entry_find_get(&tree->rbroot, offset);
>> if (!entry) {
>> /* entry was written back */
>> spin_unlock(&tree->lock);
>> return -1;
>> }
>> - zswap_entry_get(entry);
>> spin_unlock(&tree->lock);
>>
>> /* decompress */
>> @@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>> BUG_ON(ret);
>>
>> spin_lock(&tree->lock);
>> - refcount = zswap_entry_put(entry);
>> - if (likely(refcount)) {
>> - spin_unlock(&tree->lock);
>> - return 0;
>> - }
>> + zswap_entry_put(tree, entry);
>> spin_unlock(&tree->lock);
>>
>> - /*
>> - * We don't have to unlink from the rbtree because
>> - * zswap_writeback_entry() or zswap_frontswap_invalidate page()
>> - * has already done this for us if we are the last reference.
>> - */
>> - /* free */
>> -
>> - zswap_free_entry(tree, entry);
>> -
>> return 0;
>> }
>>
>> @@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>>
>> /* remove from rbtree */
>> rb_erase(&entry->rbnode, &tree->rbroot);
>> + RB_CLEAR_NODE(&entry->rbnode);
>>
>> /* drop the initial reference from entry creation */
>> - refcount = zswap_entry_put(entry);
>> + zswap_entry_put(tree, entry);
>>
>> spin_unlock(&tree->lock);
>> -
>> - if (refcount) {
>> - /* writeback in progress, writeback will free */
>> - return;
>> - }
>> -
>> - /* free */
>> - zswap_free_entry(tree, entry);
>> }
>>
>> /* frees all zswap entries for the given swap type */
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Kind regards,
> Minchan Kim

2013-10-12 08:45:27

by Weijie Yang

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Sat, Oct 12, 2013 at 10:32 AM, Bob Liu <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang <[email protected]> wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>>> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>> >
>>> > Modify:
>>> > - check the refcount in fail path, free memory if it is not referenced.
>>>
>>> Hmm, I don't like this because zswap refcount routine is already mess for me.
>>> I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>>
>>> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>> the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>>> 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>> all of caller can use it easily following pattern.
>>>
>>> find_get_zswap_entry
>>> ...
>>> ...
>>> zswap_entry_put
>>>
>>> Of course, zswap_entry_put have to check the entry is in the tree or not
>>> so if someone already removes it from the tree, it should avoid double remove.
>>>
>>> One of the concern I can think is that approach extends critical section
>>> but I think it would be no problem because more bottleneck would be [de]compress
>>> functions. If it were really problem, we can mitigate a problem with moving
>>> unnecessary functions out of zswap_free_entry because it seem to be rather
>>> over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>>
>
> I have nothing in addition to Minchan's review.
>
> Since the code is a bit complex, I'd suggest you to split it into two patches.
> [1/2]: fix the memory leak
> [2/2]: clean up the entry_put
>
> And run some testing..

I will split and test it.

Thanks.

> Thanks,
> -Bob

2013-10-12 09:14:15

by Weijie Yang

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu <[email protected]> wrote:
> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
>> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>> > >
>>> > > Modify:
>>> > > - check the refcount in fail path, free memory if it is not referenced.
>>> >
>>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>> >
>>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>> > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>> > all of caller can use it easily following pattern.
>>> >
>>> > find_get_zswap_entry
>>> > ...
>>> > ...
>>> > zswap_entry_put
>>> >
>>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>>> > so if someone already removes it from the tree, it should avoid double remove.
>>> >
>>> > One of the concern I can think is that approach extends critical section
>>> > but I think it would be no problem because more bottleneck would be [de]compress
>>> > functions. If it were really problem, we can mitigate a problem with moving
>>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>>> > over-enginnering.
>>>
>>> I refactor the zswap refcount routine according to Minchan's idea.
>>> Here is the new patch, Any suggestion is welcomed.
>>>
>>> To Seth and Bob, would you please review it again?
>>
>> Yeah, Seth, Bob. You guys are right persons to review this because this
>> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>> But anyway, I will review code itself.
>>
>>>
>>> mm/zswap.c | 116
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>>> 1 file changed, 52 insertions(+), 64 deletions(-)
>>>
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> old mode 100644
>>> new mode 100755
>>> index deda2b6..bd04910
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>> if (!entry)
>>> return NULL;
>>> entry->refcount = 1;
>>> + RB_CLEAR_NODE(&entry->rbnode);
>>> return entry;
>>> }
>>>
>>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>> }
>>>
>>> /* caller must hold the tree lock */
>>> -static int zswap_entry_put(struct zswap_entry *entry)
>>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>>
>> Why should we have return value? If we really need it, it mitigates
>> get/put semantic's whole point so I'd like to just return void.
>>
>> Let me see.
>>
>>> {
>>> - entry->refcount--;
>>> - return entry->refcount;
>>> + int refcount = --entry->refcount;
>>> +
>>> + if (refcount <= 0) {
>>
>> Hmm, I don't like minus refcount, really.
>> I hope we could do following as
>>
>> BUG_ON(refcount < 0);
>> if (refcount == 0) {
>> ...
>> }
>>
>>
>>
>>> + if (!RB_EMPTY_NODE(&entry->rbnode)) {
>>> + rb_erase(&entry->rbnode, &tree->rbroot);
>>> + RB_CLEAR_NODE(&entry->rbnode);
>>
>> Minor,
>> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
>> from rb tree and clear node because we have already zswap_rb_insert.
>>
>>
>>> + }
>>> +
>>> + zswap_free_entry(tree, entry);
>>> + }
>>> +
>>> + return refcount;
>>> }
>>>
>>> /*********************************
>>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>> return NULL;
>>> }
>>>
>>
>> Add function description.
>>
>>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>>> +{
>>> + struct zswap_entry *entry = NULL;
>>> +
>>> + entry = zswap_rb_search(root, offset);
>>> + if (entry)
>>> + zswap_entry_get(entry);
>>> +
>>> + return entry;
>>> +}
>>> +
>>> /*
>>> * In the case that a entry with the same offset is found, a pointer to
>>> * the existing entry is stored in dupentry and the function returns -EEXIST
>>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>> enum zswap_get_swap_ret {
>>> ZSWAP_SWAPCACHE_NEW,
>>> ZSWAP_SWAPCACHE_EXIST,
>>> - ZSWAP_SWAPCACHE_NOMEM
>>> + ZSWAP_SWAPCACHE_FAIL,
>>> };
>>>
>>> /*
>>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>>> * added to the swap cache, and returned in retpage.
>>> *
>>> * If success, the swap cache page is returned in retpage
>>> - * Returns 0 if page was already in the swap cache, page is not locked
>>> - * Returns 1 if the new page needs to be populated, page is locked
>>> - * Returns <0 on error
>>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>> */
>>> static int zswap_get_swap_cache_page(swp_entry_t entry,
>>> struct page **retpage)
>>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>> if (new_page)
>>> page_cache_release(new_page);
>>> if (!found_page)
>>> - return ZSWAP_SWAPCACHE_NOMEM;
>>> + return ZSWAP_SWAPCACHE_FAIL;
>>> *retpage = found_page;
>>> return ZSWAP_SWAPCACHE_EXIST;
>>> }
>>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>
>>> /* find and ref zswap entry */
>>> spin_lock(&tree->lock);
>>> - entry = zswap_rb_search(&tree->rbroot, offset);
>>> + entry = zswap_entry_find_get(&tree->rbroot, offset);
>>> if (!entry) {
>>> /* entry was invalidated */
>>> spin_unlock(&tree->lock);
>>> return 0;
>>> }
>>> - zswap_entry_get(entry);
>>> spin_unlock(&tree->lock);
>>> BUG_ON(offset != entry->offset);
>>>
>>> /* try to allocate swap cache page */
>>> switch (zswap_get_swap_cache_page(swpentry, &page)) {
>>> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>>> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>>> ret = -ENOMEM;
>>> goto fail;
>>>
>>> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>>> + case ZSWAP_SWAPCACHE_EXIST:
>>
>> Why did you remove comment?
>>
>>> /* page is already in the swap cache, ignore for now */
>>> page_cache_release(page);
>>> ret = -EEXIST;
>>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>> zswap_written_back_pages++;
>>>
>>> spin_lock(&tree->lock);
>>> -
>>> /* drop local reference */
>>> - zswap_entry_put(entry);
>>> + refcount = zswap_entry_put(tree, entry);
>>> /* drop the initial reference from entry creation */
>>> - refcount = zswap_entry_put(entry);
>>> -
>>> - /*
>>> - * There are three possible values for refcount here:
>>> - * (1) refcount is 1, load is in progress, unlink from rbtree,
>>> - * load will free
>>> - * (2) refcount is 0, (normal case) entry is valid,
>>> - * remove from rbtree and free entry
>>> - * (3) refcount is -1, invalidate happened during writeback,
>>> - * free entry
>>> - */
>>> - if (refcount >= 0) {
>>> - /* no invalidate yet, remove from rbtree */
>>> + if (refcount > 0) {
>>> rb_erase(&entry->rbnode, &tree->rbroot);
>>> + RB_CLEAR_NODE(&entry->rbnode);
>>> + refcount = zswap_entry_put(tree, entry);
>>
>> Now, I see why you need return in zswap_entry_put but let's consider again
>> because it's really mess to me and it hurts get/put semantic a lot so
>> How about this?
>>
>> spin_lock(&tree->lock);
>> /* drop local reference */
>> zswap_entry_put(tree, entry);
>> /*
>> * In here, we want to free entry but invalidation may free earlier
>> * under us so that we should check it again
>> */
>> if (entry == zswap_rb_search(&tree->rb_root, offset))
>
> Then where is the place unlink entry from rbtree if load was in progress ?

zswap_entry_put() have the unlink handle logic

> And in the following fail path, return value from zswap_entry_put() is
> also used.

It is okay even if we return -EAGAIN in the fail path

>> /* Yes, it's stable so we should free it */
>> zswap_entry_put(tree, entry);
>>
>> /*
>> * Whether it would be freed by invalidation or writeback, it doesn't
>> * matter. Important thing is that it will be freed so there
>> * is no point to return -EAGAIN.
>> */
>> spin_unlock(&tree->lock);
>> return 0;
>>
>
> --
> Regards,
> --Bob

2013-10-12 09:29:09

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

On Sat, Oct 12, 2013 at 5:14 PM, Weijie Yang <[email protected]> wrote:
> On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu <[email protected]> wrote:
>> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
>>> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>>>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>>>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>>> > >
>>>> > > Modify:
>>>> > > - check the refcount in fail path, free memory if it is not referenced.
>>>> >
>>>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>>>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>>> >
>>>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>>> > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>>>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>>> > all of caller can use it easily following pattern.
>>>> >
>>>> > find_get_zswap_entry
>>>> > ...
>>>> > ...
>>>> > zswap_entry_put
>>>> >
>>>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>>>> > so if someone already removes it from the tree, it should avoid double remove.
>>>> >
>>>> > One of the concern I can think is that approach extends critical section
>>>> > but I think it would be no problem because more bottleneck would be [de]compress
>>>> > functions. If it were really problem, we can mitigate a problem with moving
>>>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>>>> > over-enginnering.
>>>>
>>>> I refactor the zswap refcount routine according to Minchan's idea.
>>>> Here is the new patch, Any suggestion is welcomed.
>>>>
>>>> To Seth and Bob, would you please review it again?
>>>
>>> Yeah, Seth, Bob. You guys are right persons to review this because this
>>> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>>> But anyway, I will review code itself.
>>>
>>>>
>>>> mm/zswap.c | 116
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>>>> 1 file changed, 52 insertions(+), 64 deletions(-)
>>>>
>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>> old mode 100644
>>>> new mode 100755
>>>> index deda2b6..bd04910
>>>> --- a/mm/zswap.c
>>>> +++ b/mm/zswap.c
>>>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>>> if (!entry)
>>>> return NULL;
>>>> entry->refcount = 1;
>>>> + RB_CLEAR_NODE(&entry->rbnode);
>>>> return entry;
>>>> }
>>>>
>>>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>>> }
>>>>
>>>> /* caller must hold the tree lock */
>>>> -static int zswap_entry_put(struct zswap_entry *entry)
>>>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>>>
>>> Why should we have return value? If we really need it, it mitigates
>>> get/put semantic's whole point so I'd like to just return void.
>>>
>>> Let me see.
>>>
>>>> {
>>>> - entry->refcount--;
>>>> - return entry->refcount;
>>>> + int refcount = --entry->refcount;
>>>> +
>>>> + if (refcount <= 0) {
>>>
>>> Hmm, I don't like minus refcount, really.
>>> I hope we could do following as
>>>
>>> BUG_ON(refcount < 0);
>>> if (refcount == 0) {
>>> ...
>>> }
>>>
>>>
>>>
>>>> + if (!RB_EMPTY_NODE(&entry->rbnode)) {
>>>> + rb_erase(&entry->rbnode, &tree->rbroot);
>>>> + RB_CLEAR_NODE(&entry->rbnode);
>>>
>>> Minor,
>>> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
>>> from rb tree and clear node because we have already zswap_rb_insert.
>>>
>>>
>>>> + }
>>>> +
>>>> + zswap_free_entry(tree, entry);
>>>> + }
>>>> +
>>>> + return refcount;
>>>> }
>>>>
>>>> /*********************************
>>>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>>> return NULL;
>>>> }
>>>>
>>>
>>> Add function description.
>>>
>>>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>>>> +{
>>>> + struct zswap_entry *entry = NULL;
>>>> +
>>>> + entry = zswap_rb_search(root, offset);
>>>> + if (entry)
>>>> + zswap_entry_get(entry);
>>>> +
>>>> + return entry;
>>>> +}
>>>> +
>>>> /*
>>>> * In the case that a entry with the same offset is found, a pointer to
>>>> * the existing entry is stored in dupentry and the function returns -EEXIST
>>>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>>> enum zswap_get_swap_ret {
>>>> ZSWAP_SWAPCACHE_NEW,
>>>> ZSWAP_SWAPCACHE_EXIST,
>>>> - ZSWAP_SWAPCACHE_NOMEM
>>>> + ZSWAP_SWAPCACHE_FAIL,
>>>> };
>>>>
>>>> /*
>>>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>>>> * added to the swap cache, and returned in retpage.
>>>> *
>>>> * If success, the swap cache page is returned in retpage
>>>> - * Returns 0 if page was already in the swap cache, page is not locked
>>>> - * Returns 1 if the new page needs to be populated, page is locked
>>>> - * Returns <0 on error
>>>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>>>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>>>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>>> */
>>>> static int zswap_get_swap_cache_page(swp_entry_t entry,
>>>> struct page **retpage)
>>>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>>> if (new_page)
>>>> page_cache_release(new_page);
>>>> if (!found_page)
>>>> - return ZSWAP_SWAPCACHE_NOMEM;
>>>> + return ZSWAP_SWAPCACHE_FAIL;
>>>> *retpage = found_page;
>>>> return ZSWAP_SWAPCACHE_EXIST;
>>>> }
>>>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>>
>>>> /* find and ref zswap entry */
>>>> spin_lock(&tree->lock);
>>>> - entry = zswap_rb_search(&tree->rbroot, offset);
>>>> + entry = zswap_entry_find_get(&tree->rbroot, offset);
>>>> if (!entry) {
>>>> /* entry was invalidated */
>>>> spin_unlock(&tree->lock);
>>>> return 0;
>>>> }
>>>> - zswap_entry_get(entry);
>>>> spin_unlock(&tree->lock);
>>>> BUG_ON(offset != entry->offset);
>>>>
>>>> /* try to allocate swap cache page */
>>>> switch (zswap_get_swap_cache_page(swpentry, &page)) {
>>>> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>>>> + case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>>>> ret = -ENOMEM;
>>>> goto fail;
>>>>
>>>> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>>>> + case ZSWAP_SWAPCACHE_EXIST:
>>>
>>> Why did you remove comment?
>>>
>>>> /* page is already in the swap cache, ignore for now */
>>>> page_cache_release(page);
>>>> ret = -EEXIST;
>>>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>> zswap_written_back_pages++;
>>>>
>>>> spin_lock(&tree->lock);
>>>> -
>>>> /* drop local reference */
>>>> - zswap_entry_put(entry);
>>>> + refcount = zswap_entry_put(tree, entry);
>>>> /* drop the initial reference from entry creation */
>>>> - refcount = zswap_entry_put(entry);
>>>> -
>>>> - /*
>>>> - * There are three possible values for refcount here:
>>>> - * (1) refcount is 1, load is in progress, unlink from rbtree,
>>>> - * load will free
>>>> - * (2) refcount is 0, (normal case) entry is valid,
>>>> - * remove from rbtree and free entry
>>>> - * (3) refcount is -1, invalidate happened during writeback,
>>>> - * free entry
>>>> - */
>>>> - if (refcount >= 0) {
>>>> - /* no invalidate yet, remove from rbtree */
>>>> + if (refcount > 0) {
>>>> rb_erase(&entry->rbnode, &tree->rbroot);
>>>> + RB_CLEAR_NODE(&entry->rbnode);
>>>> + refcount = zswap_entry_put(tree, entry);
>>>
>>> Now, I see why you need return in zswap_entry_put but let's consider again
>>> because it's really mess to me and it hurts get/put semantic a lot so
>>> How about this?
>>>
>>> spin_lock(&tree->lock);
>>> /* drop local reference */
>>> zswap_entry_put(tree, entry);
>>> /*
>>> * In here, we want to free entry but invalidation may free earlier
>>> * under us so that we should check it again
>>> */
>>> if (entry == zswap_rb_search(&tree->rb_root, offset))
>>
>> Then where is the place unlink entry from rbtree if load was in progress ?
>
> zswap_entry_put() have the unlink handle logic
>

I see. Then how about use if (!RB_EMPTY_NODE(&entry->rbnode)) to
replace rbtree searching?

>> And in the following fail path, return value from zswap_entry_put() is
>> also used.
>
> It is okay even if we return -EAGAIN in the fail path
>

--
Regards,
--Bob

2013-10-12 09:39:06

by Weijie Yang

[permalink] [raw]
Subject: RE: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

Hi, all

I thought out a new way to resolve this problem: use CAS instead of refcount.

It not only resolve this problem, but also fix another issue,
can be expanded to support frontswap get_and_free mode easily.
And I use it in an zswap variant which writeback zbud page directly.

If it is accepted, I will resent it instead of the previous refactor patch

please see below, Request For Comments, Thanks.

On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
> > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
> > > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > > >
> > > > Modify:
> > > > - check the refcount in fail path, free memory if it is not referenced.
> > >
> > > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> > >
> > > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> > > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> > > all of caller can use it easily following pattern.
> > >
> > > find_get_zswap_entry
> > > ...
> > > ...
> > > zswap_entry_put
> > >
> > > Of course, zswap_entry_put have to check the entry is in the tree or not
> > > so if someone already removes it from the tree, it should avoid double remove.
> > >
> > > One of the concern I can think is that approach extends critical section
> > > but I think it would be no problem because more bottleneck would be [de]compress
> > > functions. If it were really problem, we can mitigate a problem with moving
> > > unnecessary functions out of zswap_free_entry because it seem to be rather
> > > over-enginnering.
> >
> > I refactor the zswap refcount routine according to Minchan's idea.
> > Here is the new patch, Any suggestion is welcomed.
> >
> > To Seth and Bob, would you please review it again?
>
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.

Consider the following scenario:
thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
finished, entry x and its zbud is not freed as its refcount != 0
now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
swapcache_prepare return -ENOENT because entry x is not used any more
zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
zswap_writeback_entry do nothing except put refcount
Now, the memory of zswap_entry x and its zpage leak.

Consider the another scenario:
zswap entry x with offset A is already stored in zswap backend.
thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
thread 1: store new page with the same offset A, alloc a new zswap entry y.
This is a duplicate store and finished.
shrink_page_list() finish, now the new page is not in swap_cache
thread 0: zswap_get_swap_cache_page get called.
old page data is added to swap_cache
Now, swap cache has old data rather than new data for offset A.
Error will happen If do_swap_page() get page from swap_cache.

Modify:
- re-design the zswap_entry concurrent protection by using a CAS-access flag
instead of the refcount get/put semantic

- use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <[email protected]>
---
mm/zswap.c | 177 +++++++++++++++++++++++++++---------------------------------
1 file changed, 80 insertions(+), 97 deletions(-)
mode change 100644 => 100755 mm/zswap.c

diff --git a/mm/zswap.c b/mm/zswap.c
old mode 100644
new mode 100755
index cbd9578..fcaaecb
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -158,21 +158,23 @@ static void zswap_comp_exit(void)
* page within zswap.
*
* rbnode - links the entry into red-black tree for the appropriate swap type
- * refcount - the number of outstanding reference to the entry. This is needed
- * to protect against premature freeing of the entry by code
- * concurent calls to load, invalidate, and writeback. The lock
- * for the zswap_tree structure that contains the entry must
- * be held while changing the refcount. Since the lock must
- * be held, there is no reason to also make refcount atomic.
* offset - the swap offset for the entry. Index into the red-black tree.
+ * pos - the position or status of this entry. see below macro definition
+ * change it by CAS, we hold this entry on success or retry if fail
+ * protect against concurrent access by load, invalidate or writeback
+ * even though the probability of concurrent access is very small
* handle - zsmalloc allocation handle that stores the compressed page data
* length - the length in bytes of the compressed page data. Needed during
* decompression
*/
+#define POS_POOL 1 /* stay in pool still */
+#define POS_LOAD 2 /* is loading */
+#define POS_RECLAIM 3 /* is reclaiming */
+#define POS_FLUSH 4 /* is invalidating */
struct zswap_entry {
struct rb_node rbnode;
pgoff_t offset;
- int refcount;
+ atomic_t pos;
unsigned int length;
unsigned long handle;
};
@@ -216,7 +218,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
entry = kmem_cache_alloc(zswap_entry_cache, gfp);
if (!entry)
return NULL;
- entry->refcount = 1;
return entry;
}

@@ -225,18 +226,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
kmem_cache_free(zswap_entry_cache, entry);
}

-/* caller must hold the tree lock */
-static void zswap_entry_get(struct zswap_entry *entry)
-{
- entry->refcount++;
-}
-
-/* caller must hold the tree lock */
-static int zswap_entry_put(struct zswap_entry *entry)
-{
- entry->refcount--;
- return entry->refcount;
-}

/*********************************
* rbtree functions
@@ -380,6 +369,23 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
zswap_pool_pages = zbud_get_pool_size(tree->pool);
}

+/* caller must hold the tree lock, entry is on the tree
+* seldom called fortunately
+* return 0: free this entry successfully
+* return < 0: fail
+*/
+static int zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry)
+{
+ if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) == POS_POOL) {
+ rb_erase(&entry->rbnode, &tree->rbroot);
+ zswap_free_entry(tree, entry);
+ return 0;
+ }
+
+ /* encounter reclaim called by another thread */
+ return -1;
+}
+
/*********************************
* writeback code
**********************************/
@@ -387,7 +393,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
enum zswap_get_swap_ret {
ZSWAP_SWAPCACHE_NEW,
ZSWAP_SWAPCACHE_EXIST,
- ZSWAP_SWAPCACHE_NOMEM
+ ZSWAP_SWAPCACHE_FAIL,
};

/*
@@ -401,9 +407,10 @@ enum zswap_get_swap_ret {
* added to the swap cache, and returned in retpage.
*
* If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated,
+ * the new page is added to swap cache and locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
*/
static int zswap_get_swap_cache_page(swp_entry_t entry,
struct page **retpage)
@@ -475,7 +482,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
if (new_page)
page_cache_release(new_page);
if (!found_page)
- return ZSWAP_SWAPCACHE_NOMEM;
+ return ZSWAP_SWAPCACHE_FAIL;
*retpage = found_page;
return ZSWAP_SWAPCACHE_EXIST;
}
@@ -502,7 +509,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
struct page *page;
u8 *src, *dst;
unsigned int dlen;
- int ret, refcount;
+ int ret;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
};
@@ -523,17 +530,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
spin_unlock(&tree->lock);
return 0;
}
- zswap_entry_get(entry);
+ /* maybe encounter load or invalidate called by another thread
+ * hold or give up this entry
+ */
+ if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_RECLAIM) != POS_POOL) {
+ spin_unlock(&tree->lock);
+ return -EAGAIN;
+ }
spin_unlock(&tree->lock);
BUG_ON(offset != entry->offset);

/* try to allocate swap cache page */
switch (zswap_get_swap_cache_page(swpentry, &page)) {
- case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+ case ZSWAP_SWAPCACHE_FAIL:
ret = -ENOMEM;
goto fail;

- case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+ case ZSWAP_SWAPCACHE_EXIST:
/* page is already in the swap cache, ignore for now */
page_cache_release(page);
ret = -EEXIST;
@@ -561,38 +574,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
page_cache_release(page);
zswap_written_back_pages++;

+ /* once we hold this entry and get here, we can free entry safely
+ * during the period which we hold this entry:
+ * 1. if a thread call do_swap_page concurrently
+ * we will get ZSWAP_SWAPCACHE_EXIST returned or
+ * do_swap_page will hit page we added in the swap cache
+ * 2. if a thread call invalidate concurrently
+ * invalidate thread should wait until this function end
+ */
spin_lock(&tree->lock);
-
- /* drop local reference */
- zswap_entry_put(entry);
- /* drop the initial reference from entry creation */
- refcount = zswap_entry_put(entry);
-
- /*
- * There are three possible values for refcount here:
- * (1) refcount is 1, load is in progress, unlink from rbtree,
- * load will free
- * (2) refcount is 0, (normal case) entry is valid,
- * remove from rbtree and free entry
- * (3) refcount is -1, invalidate happened during writeback,
- * free entry
- */
- if (refcount >= 0) {
- /* no invalidate yet, remove from rbtree */
- rb_erase(&entry->rbnode, &tree->rbroot);
- }
+ rb_erase(&entry->rbnode, &tree->rbroot);
spin_unlock(&tree->lock);
- if (refcount <= 0) {
- /* free the entry */
- zswap_free_entry(tree, entry);
- return 0;
- }
- return -EAGAIN;
+ zswap_free_entry(tree, entry);
+
+ return 0;

fail:
- spin_lock(&tree->lock);
- zswap_entry_put(entry);
- spin_unlock(&tree->lock);
+ BUG_ON(atomic_cmpxchg(&entry->pos, POS_RECLAIM, POS_POOL) != POS_RECLAIM);
return ret;
}

@@ -668,19 +666,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
entry->offset = offset;
entry->handle = handle;
entry->length = dlen;
+ atomic_set(&entry->pos, POS_POOL);

/* map */
spin_lock(&tree->lock);
do {
ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
- if (ret == -EEXIST) {
- zswap_duplicate_entry++;
- /* remove from rbtree */
- rb_erase(&dupentry->rbnode, &tree->rbroot);
- if (!zswap_entry_put(dupentry)) {
- /* free */
- zswap_free_entry(tree, dupentry);
- }
+ if (unlikely(ret == -EEXIST)) {
+ if (zswap_invalidate_entry(tree, dupentry) == 0)
+ zswap_duplicate_entry++;
}
} while (ret == -EEXIST);
spin_unlock(&tree->lock);
@@ -709,9 +703,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
struct zswap_entry *entry;
u8 *src, *dst;
unsigned int dlen;
- int refcount, ret;
+ int ret;

/* find */
+find:
spin_lock(&tree->lock);
entry = zswap_rb_search(&tree->rbroot, offset);
if (!entry) {
@@ -719,7 +714,14 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
spin_unlock(&tree->lock);
return -1;
}
- zswap_entry_get(entry);
+ /* encounter entry reclaimed by another thread
+ * just retry until that thread get ZSWAP_SWAPCACHE_EXIST returned
+ * as we hold the entry, we can do anything: get or get_and_free
+ */
+ if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_LOAD) != POS_POOL) {
+ spin_unlock(&tree->lock);
+ goto find;
+ }
spin_unlock(&tree->lock);

/* decompress */
@@ -733,22 +735,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
zbud_unmap(tree->pool, entry->handle);
BUG_ON(ret);

- spin_lock(&tree->lock);
- refcount = zswap_entry_put(entry);
- if (likely(refcount)) {
- spin_unlock(&tree->lock);
- return 0;
- }
- spin_unlock(&tree->lock);
-
- /*
- * We don't have to unlink from the rbtree because
- * zswap_writeback_entry() or zswap_frontswap_invalidate page()
- * has already done this for us if we are the last reference.
- */
- /* free */
-
- zswap_free_entry(tree, entry);
+ BUG_ON(atomic_cmpxchg(&entry->pos, POS_LOAD, POS_POOL) != POS_LOAD);

return 0;
}
@@ -758,9 +745,9 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
{
struct zswap_tree *tree = zswap_trees[type];
struct zswap_entry *entry;
- int refcount;

/* find */
+find:
spin_lock(&tree->lock);
entry = zswap_rb_search(&tree->rbroot, offset);
if (!entry) {
@@ -769,19 +756,18 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
return;
}

+ /* encounter entry reclaimed by another thread
+ * just retry until that reclaim end.
+ */
+ if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) != POS_POOL) {
+ spin_unlock(&tree->lock);
+ goto find;
+ }
+
/* remove from rbtree */
rb_erase(&entry->rbnode, &tree->rbroot);
-
- /* drop the initial reference from entry creation */
- refcount = zswap_entry_put(entry);
-
spin_unlock(&tree->lock);

- if (refcount) {
- /* writeback in progress, writeback will free */
- return;
- }
-
/* free */
zswap_free_entry(tree, entry);
}
@@ -809,10 +795,7 @@ static void zswap_frontswap_invalidate_area(unsigned type)
*/
while ((node = rb_first(&tree->rbroot))) {
entry = rb_entry(node, struct zswap_entry, rbnode);
- rb_erase(&entry->rbnode, &tree->rbroot);
- zbud_free(tree->pool, entry->handle);
- zswap_entry_cache_free(entry);
- atomic_dec(&zswap_stored_pages);
+ zswap_invalidate_entry(tree, entry);
}
tree->rbroot = RB_ROOT;
spin_unlock(&tree->lock);
--
1.7.10.4

2013-10-14 00:31:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

Hello,

On Sat, Oct 12, 2013 at 05:37:55PM +0800, Weijie Yang wrote:
> Hi, all
>
> I thought out a new way to resolve this problem: use CAS instead of refcount.
>
> It not only resolve this problem, but also fix another issue,
> can be expanded to support frontswap get_and_free mode easily.
> And I use it in an zswap variant which writeback zbud page directly.
>
> If it is accepted, I will resent it instead of the previous refactor patch

You're adding atomic operation and making code more complicated to me rather
than simple get/put scheme. Please, write a good decsription why it's better,
not vague words like "fix another issue, get_and_free and zswap variant".

Please convince us.

>
> please see below, Request For Comments, Thanks.
>
> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
> > On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
> > > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
> > > > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > > > >
> > > > > Modify:
> > > > > - check the refcount in fail path, free memory if it is not referenced.
> > > >
> > > > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > > > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> > > >
> > > > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> > > > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > > > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> > > > all of caller can use it easily following pattern.
> > > >
> > > > find_get_zswap_entry
> > > > ...
> > > > ...
> > > > zswap_entry_put
> > > >
> > > > Of course, zswap_entry_put have to check the entry is in the tree or not
> > > > so if someone already removes it from the tree, it should avoid double remove.
> > > >
> > > > One of the concern I can think is that approach extends critical section
> > > > but I think it would be no problem because more bottleneck would be [de]compress
> > > > functions. If it were really problem, we can mitigate a problem with moving
> > > > unnecessary functions out of zswap_free_entry because it seem to be rather
> > > > over-enginnering.
> > >
> > > I refactor the zswap refcount routine according to Minchan's idea.
> > > Here is the new patch, Any suggestion is welcomed.
> > >
> > > To Seth and Bob, would you please review it again?
> >
> > Yeah, Seth, Bob. You guys are right persons to review this because this
> > scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> > But anyway, I will review code itself.
>
> Consider the following scenario:
> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
> finished, entry x and its zbud is not freed as its refcount != 0
> now, the swap_map[x] = 0
> thread 0: now call zswap_get_swap_cache_page
> swapcache_prepare return -ENOENT because entry x is not used any more
> zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
> zswap_writeback_entry do nothing except put refcount
> Now, the memory of zswap_entry x and its zpage leak.
>
> Consider the another scenario:
> zswap entry x with offset A is already stored in zswap backend.
> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
> thread 1: store new page with the same offset A, alloc a new zswap entry y.
> This is a duplicate store and finished.
> shrink_page_list() finish, now the new page is not in swap_cache
> thread 0: zswap_get_swap_cache_page get called.
> old page data is added to swap_cache
> Now, swap cache has old data rather than new data for offset A.
> Error will happen If do_swap_page() get page from swap_cache.
>
> Modify:
> - re-design the zswap_entry concurrent protection by using a CAS-access flag
> instead of the refcount get/put semantic
>
> - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
> can be not only caused by nomem but also by invalidate.
>
> Signed-off-by: Weijie Yang <[email protected]>
> ---
> mm/zswap.c | 177 +++++++++++++++++++++++++++---------------------------------
> 1 file changed, 80 insertions(+), 97 deletions(-)
> mode change 100644 => 100755 mm/zswap.c
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> old mode 100644
> new mode 100755
> index cbd9578..fcaaecb
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -158,21 +158,23 @@ static void zswap_comp_exit(void)
> * page within zswap.
> *
> * rbnode - links the entry into red-black tree for the appropriate swap type
> - * refcount - the number of outstanding reference to the entry. This is needed
> - * to protect against premature freeing of the entry by code
> - * concurent calls to load, invalidate, and writeback. The lock
> - * for the zswap_tree structure that contains the entry must
> - * be held while changing the refcount. Since the lock must
> - * be held, there is no reason to also make refcount atomic.
> * offset - the swap offset for the entry. Index into the red-black tree.
> + * pos - the position or status of this entry. see below macro definition
> + * change it by CAS, we hold this entry on success or retry if fail
> + * protect against concurrent access by load, invalidate or writeback
> + * even though the probability of concurrent access is very small
> * handle - zsmalloc allocation handle that stores the compressed page data
> * length - the length in bytes of the compressed page data. Needed during
> * decompression
> */
> +#define POS_POOL 1 /* stay in pool still */
> +#define POS_LOAD 2 /* is loading */
> +#define POS_RECLAIM 3 /* is reclaiming */
> +#define POS_FLUSH 4 /* is invalidating */
> struct zswap_entry {
> struct rb_node rbnode;
> pgoff_t offset;
> - int refcount;
> + atomic_t pos;
> unsigned int length;
> unsigned long handle;
> };
> @@ -216,7 +218,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
> entry = kmem_cache_alloc(zswap_entry_cache, gfp);
> if (!entry)
> return NULL;
> - entry->refcount = 1;
> return entry;
> }
>
> @@ -225,18 +226,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
> kmem_cache_free(zswap_entry_cache, entry);
> }
>
> -/* caller must hold the tree lock */
> -static void zswap_entry_get(struct zswap_entry *entry)
> -{
> - entry->refcount++;
> -}
> -
> -/* caller must hold the tree lock */
> -static int zswap_entry_put(struct zswap_entry *entry)
> -{
> - entry->refcount--;
> - return entry->refcount;
> -}
>
> /*********************************
> * rbtree functions
> @@ -380,6 +369,23 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
> zswap_pool_pages = zbud_get_pool_size(tree->pool);
> }
>
> +/* caller must hold the tree lock, entry is on the tree
> +* seldom called fortunately
> +* return 0: free this entry successfully
> +* return < 0: fail
> +*/
> +static int zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry)
> +{
> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) == POS_POOL) {
> + rb_erase(&entry->rbnode, &tree->rbroot);
> + zswap_free_entry(tree, entry);
> + return 0;
> + }
> +
> + /* encounter reclaim called by another thread */
> + return -1;
> +}
> +
> /*********************************
> * writeback code
> **********************************/
> @@ -387,7 +393,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
> enum zswap_get_swap_ret {
> ZSWAP_SWAPCACHE_NEW,
> ZSWAP_SWAPCACHE_EXIST,
> - ZSWAP_SWAPCACHE_NOMEM
> + ZSWAP_SWAPCACHE_FAIL,
> };
>
> /*
> @@ -401,9 +407,10 @@ enum zswap_get_swap_ret {
> * added to the swap cache, and returned in retpage.
> *
> * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated,
> + * the new page is added to swap cache and locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
> */
> static int zswap_get_swap_cache_page(swp_entry_t entry,
> struct page **retpage)
> @@ -475,7 +482,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
> if (new_page)
> page_cache_release(new_page);
> if (!found_page)
> - return ZSWAP_SWAPCACHE_NOMEM;
> + return ZSWAP_SWAPCACHE_FAIL;
> *retpage = found_page;
> return ZSWAP_SWAPCACHE_EXIST;
> }
> @@ -502,7 +509,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
> struct page *page;
> u8 *src, *dst;
> unsigned int dlen;
> - int ret, refcount;
> + int ret;
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_NONE,
> };
> @@ -523,17 +530,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
> spin_unlock(&tree->lock);
> return 0;
> }
> - zswap_entry_get(entry);
> + /* maybe encounter load or invalidate called by another thread
> + * hold or give up this entry
> + */
> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_RECLAIM) != POS_POOL) {
> + spin_unlock(&tree->lock);
> + return -EAGAIN;
> + }
> spin_unlock(&tree->lock);
> BUG_ON(offset != entry->offset);
>
> /* try to allocate swap cache page */
> switch (zswap_get_swap_cache_page(swpentry, &page)) {
> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> + case ZSWAP_SWAPCACHE_FAIL:
> ret = -ENOMEM;
> goto fail;
>
> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> + case ZSWAP_SWAPCACHE_EXIST:
> /* page is already in the swap cache, ignore for now */
> page_cache_release(page);
> ret = -EEXIST;
> @@ -561,38 +574,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
> page_cache_release(page);
> zswap_written_back_pages++;
>
> + /* once we hold this entry and get here, we can free entry safely
> + * during the period which we hold this entry:
> + * 1. if a thread call do_swap_page concurrently
> + * we will get ZSWAP_SWAPCACHE_EXIST returned or
> + * do_swap_page will hit page we added in the swap cache
> + * 2. if a thread call invalidate concurrently
> + * invalidate thread should wait until this function end
> + */
> spin_lock(&tree->lock);
> -
> - /* drop local reference */
> - zswap_entry_put(entry);
> - /* drop the initial reference from entry creation */
> - refcount = zswap_entry_put(entry);
> -
> - /*
> - * There are three possible values for refcount here:
> - * (1) refcount is 1, load is in progress, unlink from rbtree,
> - * load will free
> - * (2) refcount is 0, (normal case) entry is valid,
> - * remove from rbtree and free entry
> - * (3) refcount is -1, invalidate happened during writeback,
> - * free entry
> - */
> - if (refcount >= 0) {
> - /* no invalidate yet, remove from rbtree */
> - rb_erase(&entry->rbnode, &tree->rbroot);
> - }
> + rb_erase(&entry->rbnode, &tree->rbroot);
> spin_unlock(&tree->lock);
> - if (refcount <= 0) {
> - /* free the entry */
> - zswap_free_entry(tree, entry);
> - return 0;
> - }
> - return -EAGAIN;
> + zswap_free_entry(tree, entry);
> +
> + return 0;
>
> fail:
> - spin_lock(&tree->lock);
> - zswap_entry_put(entry);
> - spin_unlock(&tree->lock);
> + BUG_ON(atomic_cmpxchg(&entry->pos, POS_RECLAIM, POS_POOL) != POS_RECLAIM);
> return ret;
> }
>
> @@ -668,19 +666,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> entry->offset = offset;
> entry->handle = handle;
> entry->length = dlen;
> + atomic_set(&entry->pos, POS_POOL);
>
> /* map */
> spin_lock(&tree->lock);
> do {
> ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
> - if (ret == -EEXIST) {
> - zswap_duplicate_entry++;
> - /* remove from rbtree */
> - rb_erase(&dupentry->rbnode, &tree->rbroot);
> - if (!zswap_entry_put(dupentry)) {
> - /* free */
> - zswap_free_entry(tree, dupentry);
> - }
> + if (unlikely(ret == -EEXIST)) {
> + if (zswap_invalidate_entry(tree, dupentry) == 0)
> + zswap_duplicate_entry++;
> }
> } while (ret == -EEXIST);
> spin_unlock(&tree->lock);
> @@ -709,9 +703,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
> struct zswap_entry *entry;
> u8 *src, *dst;
> unsigned int dlen;
> - int refcount, ret;
> + int ret;
>
> /* find */
> +find:
> spin_lock(&tree->lock);
> entry = zswap_rb_search(&tree->rbroot, offset);
> if (!entry) {
> @@ -719,7 +714,14 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
> spin_unlock(&tree->lock);
> return -1;
> }
> - zswap_entry_get(entry);
> + /* encounter entry reclaimed by another thread
> + * just retry until that thread get ZSWAP_SWAPCACHE_EXIST returned
> + * as we hold the entry, we can do anything: get or get_and_free
> + */
> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_LOAD) != POS_POOL) {
> + spin_unlock(&tree->lock);
> + goto find;
> + }
> spin_unlock(&tree->lock);
>
> /* decompress */
> @@ -733,22 +735,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
> zbud_unmap(tree->pool, entry->handle);
> BUG_ON(ret);
>
> - spin_lock(&tree->lock);
> - refcount = zswap_entry_put(entry);
> - if (likely(refcount)) {
> - spin_unlock(&tree->lock);
> - return 0;
> - }
> - spin_unlock(&tree->lock);
> -
> - /*
> - * We don't have to unlink from the rbtree because
> - * zswap_writeback_entry() or zswap_frontswap_invalidate page()
> - * has already done this for us if we are the last reference.
> - */
> - /* free */
> -
> - zswap_free_entry(tree, entry);
> + BUG_ON(atomic_cmpxchg(&entry->pos, POS_LOAD, POS_POOL) != POS_LOAD);
>
> return 0;
> }
> @@ -758,9 +745,9 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
> {
> struct zswap_tree *tree = zswap_trees[type];
> struct zswap_entry *entry;
> - int refcount;
>
> /* find */
> +find:
> spin_lock(&tree->lock);
> entry = zswap_rb_search(&tree->rbroot, offset);
> if (!entry) {
> @@ -769,19 +756,18 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
> return;
> }
>
> + /* encounter entry reclaimed by another thread
> + * just retry until that reclaim end.
> + */
> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) != POS_POOL) {
> + spin_unlock(&tree->lock);
> + goto find;
> + }
> +
> /* remove from rbtree */
> rb_erase(&entry->rbnode, &tree->rbroot);
> -
> - /* drop the initial reference from entry creation */
> - refcount = zswap_entry_put(entry);
> -
> spin_unlock(&tree->lock);
>
> - if (refcount) {
> - /* writeback in progress, writeback will free */
> - return;
> - }
> -
> /* free */
> zswap_free_entry(tree, entry);
> }
> @@ -809,10 +795,7 @@ static void zswap_frontswap_invalidate_area(unsigned type)
> */
> while ((node = rb_first(&tree->rbroot))) {
> entry = rb_entry(node, struct zswap_entry, rbnode);
> - rb_erase(&entry->rbnode, &tree->rbroot);
> - zbud_free(tree->pool, entry->handle);
> - zswap_entry_cache_free(entry);
> - atomic_dec(&zswap_stored_pages);
> + zswap_invalidate_entry(tree, entry);
> }
> tree->rbroot = RB_ROOT;
> spin_unlock(&tree->lock);
> --
> 1.7.10.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-10-14 08:05:22

by Weijie Yang

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

Hi,

In order to make things clear and simple,
I will re-send this v3 2/3 patch. split it to two patches:

1. just fix the bug, as same as the first v3 2/3 patch of this email thread

2. refactor the get/put routine based on Minchan's suggestion.

As for the last CAS patch, I will add more decsription and resend it
as a RFC patch.

Thanks,

On Mon, Oct 14, 2013 at 8:31 AM, Minchan Kim <[email protected]> wrote:
> Hello,
>
> On Sat, Oct 12, 2013 at 05:37:55PM +0800, Weijie Yang wrote:
>> Hi, all
>>
>> I thought out a new way to resolve this problem: use CAS instead of refcount.
>>
>> It not only resolve this problem, but also fix another issue,
>> can be expanded to support frontswap get_and_free mode easily.
>> And I use it in an zswap variant which writeback zbud page directly.
>>
>> If it is accepted, I will resent it instead of the previous refactor patch
>
> You're adding atomic operation and making code more complicated to me rather
> than simple get/put scheme. Please, write a good decsription why it's better,
> not vague words like "fix another issue, get_and_free and zswap variant".
>
> Please convince us.
>
>>
>> please see below, Request For Comments, Thanks.
>>
>> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <[email protected]> wrote:
>> > On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> > > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <[email protected]> wrote:
>> > > > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > > > >
>> > > > > Modify:
>> > > > > - check the refcount in fail path, free memory if it is not referenced.
>> > > >
>> > > > Hmm, I don't like this because zswap refcount routine is already mess for me.
>> > > > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>> > > >
>> > > > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> > > > the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> > > > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> > > > all of caller can use it easily following pattern.
>> > > >
>> > > > find_get_zswap_entry
>> > > > ...
>> > > > ...
>> > > > zswap_entry_put
>> > > >
>> > > > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > > > so if someone already removes it from the tree, it should avoid double remove.
>> > > >
>> > > > One of the concern I can think is that approach extends critical section
>> > > > but I think it would be no problem because more bottleneck would be [de]compress
>> > > > functions. If it were really problem, we can mitigate a problem with moving
>> > > > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > > > over-enginnering.
>> > >
>> > > I refactor the zswap refcount routine according to Minchan's idea.
>> > > Here is the new patch, Any suggestion is welcomed.
>> > >
>> > > To Seth and Bob, would you please review it again?
>> >
>> > Yeah, Seth, Bob. You guys are right persons to review this because this
>> > scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>> > But anyway, I will review code itself.
>>
>> Consider the following scenario:
>> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
>> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
>> finished, entry x and its zbud is not freed as its refcount != 0
>> now, the swap_map[x] = 0
>> thread 0: now call zswap_get_swap_cache_page
>> swapcache_prepare return -ENOENT because entry x is not used any more
>> zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
>> zswap_writeback_entry do nothing except put refcount
>> Now, the memory of zswap_entry x and its zpage leak.
>>
>> Consider the another scenario:
>> zswap entry x with offset A is already stored in zswap backend.
>> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
>> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>> This is a duplicate store and finished.
>> shrink_page_list() finish, now the new page is not in swap_cache
>> thread 0: zswap_get_swap_cache_page get called.
>> old page data is added to swap_cache
>> Now, swap cache has old data rather than new data for offset A.
>> Error will happen If do_swap_page() get page from swap_cache.
>>
>> Modify:
>> - re-design the zswap_entry concurrent protection by using a CAS-access flag
>> instead of the refcount get/put semantic
>>
>> - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
>> can be not only caused by nomem but also by invalidate.
>>
>> Signed-off-by: Weijie Yang <[email protected]>
>> ---
>> mm/zswap.c | 177 +++++++++++++++++++++++++++---------------------------------
>> 1 file changed, 80 insertions(+), 97 deletions(-)
>> mode change 100644 => 100755 mm/zswap.c
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index cbd9578..fcaaecb
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -158,21 +158,23 @@ static void zswap_comp_exit(void)
>> * page within zswap.
>> *
>> * rbnode - links the entry into red-black tree for the appropriate swap type
>> - * refcount - the number of outstanding reference to the entry. This is needed
>> - * to protect against premature freeing of the entry by code
>> - * concurent calls to load, invalidate, and writeback. The lock
>> - * for the zswap_tree structure that contains the entry must
>> - * be held while changing the refcount. Since the lock must
>> - * be held, there is no reason to also make refcount atomic.
>> * offset - the swap offset for the entry. Index into the red-black tree.
>> + * pos - the position or status of this entry. see below macro definition
>> + * change it by CAS, we hold this entry on success or retry if fail
>> + * protect against concurrent access by load, invalidate or writeback
>> + * even though the probability of concurrent access is very small
>> * handle - zsmalloc allocation handle that stores the compressed page data
>> * length - the length in bytes of the compressed page data. Needed during
>> * decompression
>> */
>> +#define POS_POOL 1 /* stay in pool still */
>> +#define POS_LOAD 2 /* is loading */
>> +#define POS_RECLAIM 3 /* is reclaiming */
>> +#define POS_FLUSH 4 /* is invalidating */
>> struct zswap_entry {
>> struct rb_node rbnode;
>> pgoff_t offset;
>> - int refcount;
>> + atomic_t pos;
>> unsigned int length;
>> unsigned long handle;
>> };
>> @@ -216,7 +218,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>> entry = kmem_cache_alloc(zswap_entry_cache, gfp);
>> if (!entry)
>> return NULL;
>> - entry->refcount = 1;
>> return entry;
>> }
>>
>> @@ -225,18 +226,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>> kmem_cache_free(zswap_entry_cache, entry);
>> }
>>
>> -/* caller must hold the tree lock */
>> -static void zswap_entry_get(struct zswap_entry *entry)
>> -{
>> - entry->refcount++;
>> -}
>> -
>> -/* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> -{
>> - entry->refcount--;
>> - return entry->refcount;
>> -}
>>
>> /*********************************
>> * rbtree functions
>> @@ -380,6 +369,23 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>> zswap_pool_pages = zbud_get_pool_size(tree->pool);
>> }
>>
>> +/* caller must hold the tree lock, entry is on the tree
>> +* seldom called fortunately
>> +* return 0: free this entry successfully
>> +* return < 0: fail
>> +*/
>> +static int zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>> +{
>> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) == POS_POOL) {
>> + rb_erase(&entry->rbnode, &tree->rbroot);
>> + zswap_free_entry(tree, entry);
>> + return 0;
>> + }
>> +
>> + /* encounter reclaim called by another thread */
>> + return -1;
>> +}
>> +
>> /*********************************
>> * writeback code
>> **********************************/
>> @@ -387,7 +393,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>> enum zswap_get_swap_ret {
>> ZSWAP_SWAPCACHE_NEW,
>> ZSWAP_SWAPCACHE_EXIST,
>> - ZSWAP_SWAPCACHE_NOMEM
>> + ZSWAP_SWAPCACHE_FAIL,
>> };
>>
>> /*
>> @@ -401,9 +407,10 @@ enum zswap_get_swap_ret {
>> * added to the swap cache, and returned in retpage.
>> *
>> * If success, the swap cache page is returned in retpage
>> - * Returns 0 if page was already in the swap cache, page is not locked
>> - * Returns 1 if the new page needs to be populated, page is locked
>> - * Returns <0 on error
>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated,
>> + * the new page is added to swap cache and locked
>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>> */
>> static int zswap_get_swap_cache_page(swp_entry_t entry,
>> struct page **retpage)
>> @@ -475,7 +482,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>> if (new_page)
>> page_cache_release(new_page);
>> if (!found_page)
>> - return ZSWAP_SWAPCACHE_NOMEM;
>> + return ZSWAP_SWAPCACHE_FAIL;
>> *retpage = found_page;
>> return ZSWAP_SWAPCACHE_EXIST;
>> }
>> @@ -502,7 +509,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>> struct page *page;
>> u8 *src, *dst;
>> unsigned int dlen;
>> - int ret, refcount;
>> + int ret;
>> struct writeback_control wbc = {
>> .sync_mode = WB_SYNC_NONE,
>> };
>> @@ -523,17 +530,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>> spin_unlock(&tree->lock);
>> return 0;
>> }
>> - zswap_entry_get(entry);
>> + /* maybe encounter load or invalidate called by another thread
>> + * hold or give up this entry
>> + */
>> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_RECLAIM) != POS_POOL) {
>> + spin_unlock(&tree->lock);
>> + return -EAGAIN;
>> + }
>> spin_unlock(&tree->lock);
>> BUG_ON(offset != entry->offset);
>>
>> /* try to allocate swap cache page */
>> switch (zswap_get_swap_cache_page(swpentry, &page)) {
>> - case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>> + case ZSWAP_SWAPCACHE_FAIL:
>> ret = -ENOMEM;
>> goto fail;
>>
>> - case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>> + case ZSWAP_SWAPCACHE_EXIST:
>> /* page is already in the swap cache, ignore for now */
>> page_cache_release(page);
>> ret = -EEXIST;
>> @@ -561,38 +574,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>> page_cache_release(page);
>> zswap_written_back_pages++;
>>
>> + /* once we hold this entry and get here, we can free entry safely
>> + * during the period which we hold this entry:
>> + * 1. if a thread call do_swap_page concurrently
>> + * we will get ZSWAP_SWAPCACHE_EXIST returned or
>> + * do_swap_page will hit page we added in the swap cache
>> + * 2. if a thread call invalidate concurrently
>> + * invalidate thread should wait until this function end
>> + */
>> spin_lock(&tree->lock);
>> -
>> - /* drop local reference */
>> - zswap_entry_put(entry);
>> - /* drop the initial reference from entry creation */
>> - refcount = zswap_entry_put(entry);
>> -
>> - /*
>> - * There are three possible values for refcount here:
>> - * (1) refcount is 1, load is in progress, unlink from rbtree,
>> - * load will free
>> - * (2) refcount is 0, (normal case) entry is valid,
>> - * remove from rbtree and free entry
>> - * (3) refcount is -1, invalidate happened during writeback,
>> - * free entry
>> - */
>> - if (refcount >= 0) {
>> - /* no invalidate yet, remove from rbtree */
>> - rb_erase(&entry->rbnode, &tree->rbroot);
>> - }
>> + rb_erase(&entry->rbnode, &tree->rbroot);
>> spin_unlock(&tree->lock);
>> - if (refcount <= 0) {
>> - /* free the entry */
>> - zswap_free_entry(tree, entry);
>> - return 0;
>> - }
>> - return -EAGAIN;
>> + zswap_free_entry(tree, entry);
>> +
>> + return 0;
>>
>> fail:
>> - spin_lock(&tree->lock);
>> - zswap_entry_put(entry);
>> - spin_unlock(&tree->lock);
>> + BUG_ON(atomic_cmpxchg(&entry->pos, POS_RECLAIM, POS_POOL) != POS_RECLAIM);
>> return ret;
>> }
>>
>> @@ -668,19 +666,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>> entry->offset = offset;
>> entry->handle = handle;
>> entry->length = dlen;
>> + atomic_set(&entry->pos, POS_POOL);
>>
>> /* map */
>> spin_lock(&tree->lock);
>> do {
>> ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
>> - if (ret == -EEXIST) {
>> - zswap_duplicate_entry++;
>> - /* remove from rbtree */
>> - rb_erase(&dupentry->rbnode, &tree->rbroot);
>> - if (!zswap_entry_put(dupentry)) {
>> - /* free */
>> - zswap_free_entry(tree, dupentry);
>> - }
>> + if (unlikely(ret == -EEXIST)) {
>> + if (zswap_invalidate_entry(tree, dupentry) == 0)
>> + zswap_duplicate_entry++;
>> }
>> } while (ret == -EEXIST);
>> spin_unlock(&tree->lock);
>> @@ -709,9 +703,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>> struct zswap_entry *entry;
>> u8 *src, *dst;
>> unsigned int dlen;
>> - int refcount, ret;
>> + int ret;
>>
>> /* find */
>> +find:
>> spin_lock(&tree->lock);
>> entry = zswap_rb_search(&tree->rbroot, offset);
>> if (!entry) {
>> @@ -719,7 +714,14 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>> spin_unlock(&tree->lock);
>> return -1;
>> }
>> - zswap_entry_get(entry);
>> + /* encounter entry reclaimed by another thread
>> + * just retry until that thread get ZSWAP_SWAPCACHE_EXIST returned
>> + * as we hold the entry, we can do anything: get or get_and_free
>> + */
>> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_LOAD) != POS_POOL) {
>> + spin_unlock(&tree->lock);
>> + goto find;
>> + }
>> spin_unlock(&tree->lock);
>>
>> /* decompress */
>> @@ -733,22 +735,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>> zbud_unmap(tree->pool, entry->handle);
>> BUG_ON(ret);
>>
>> - spin_lock(&tree->lock);
>> - refcount = zswap_entry_put(entry);
>> - if (likely(refcount)) {
>> - spin_unlock(&tree->lock);
>> - return 0;
>> - }
>> - spin_unlock(&tree->lock);
>> -
>> - /*
>> - * We don't have to unlink from the rbtree because
>> - * zswap_writeback_entry() or zswap_frontswap_invalidate page()
>> - * has already done this for us if we are the last reference.
>> - */
>> - /* free */
>> -
>> - zswap_free_entry(tree, entry);
>> + BUG_ON(atomic_cmpxchg(&entry->pos, POS_LOAD, POS_POOL) != POS_LOAD);
>>
>> return 0;
>> }
>> @@ -758,9 +745,9 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>> {
>> struct zswap_tree *tree = zswap_trees[type];
>> struct zswap_entry *entry;
>> - int refcount;
>>
>> /* find */
>> +find:
>> spin_lock(&tree->lock);
>> entry = zswap_rb_search(&tree->rbroot, offset);
>> if (!entry) {
>> @@ -769,19 +756,18 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>> return;
>> }
>>
>> + /* encounter entry reclaimed by another thread
>> + * just retry until that reclaim end.
>> + */
>> + if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) != POS_POOL) {
>> + spin_unlock(&tree->lock);
>> + goto find;
>> + }
>> +
>> /* remove from rbtree */
>> rb_erase(&entry->rbnode, &tree->rbroot);
>> -
>> - /* drop the initial reference from entry creation */
>> - refcount = zswap_entry_put(entry);
>> -
>> spin_unlock(&tree->lock);
>>
>> - if (refcount) {
>> - /* writeback in progress, writeback will free */
>> - return;
>> - }
>> -
>> /* free */
>> zswap_free_entry(tree, entry);
>> }
>> @@ -809,10 +795,7 @@ static void zswap_frontswap_invalidate_area(unsigned type)
>> */
>> while ((node = rb_first(&tree->rbroot))) {
>> entry = rb_entry(node, struct zswap_entry, rbnode);
>> - rb_erase(&entry->rbnode, &tree->rbroot);
>> - zbud_free(tree->pool, entry->handle);
>> - zswap_entry_cache_free(entry);
>> - atomic_dec(&zswap_stored_pages);
>> + zswap_invalidate_entry(tree, entry);
>> }
>> tree->rbroot = RB_ROOT;
>> spin_unlock(&tree->lock);
>> --
>> 1.7.10.4
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Kind regards,
> Minchan Kim