2013-08-18 16:14:47

by Weijie Yang

[permalink] [raw]
Subject: [BUG REPORT] ZSWAP: theoretical race condition issues

I found a few bugs in zswap when I review Linux-3.11-rc5, and I have
also some questions about it, described as following:

BUG:
1. A race condition when reclaim a page
when a handle alloced from zbud, zbud considers this handle is used
validly by upper(zswap) and can be a candidate for reclaim.
But zswap has to initialize it such as setting swapentry and addding
it to rbtree. so there is a race condition, such as:
thread 0: obtain handle x from zbud_alloc
thread 1: zbud_reclaim_page is called
thread 1: callback zswap_writeback_entry to reclaim handle x
thread 1: get swpentry from handle x (it is random value now)
thread 1: bad thing may happen
thread 0: initialize handle x with swapentry
Of course, this situation almost never happen, it is a "theoretical
race condition" issue.

2. Pollute swapcache data by add a pre-invalided swap page
when a swap_entry is invalidated, it will be reused by other anon
page. At the same time, zswap is reclaiming old page, pollute
swapcache of new page as a result, because old page and new page use
the same swap_entry, such as:
thread 1: zswap reclaim entry x
thread 0: zswap_frontswap_invalidate_page entry x
thread 0: entry x reused by other anon page
thread 1: add old data to swapcache of entry x
thread 0: swapcache of entry x is polluted
Of course, this situation almost never happen, it is another
"theoretical race condition" issue.

3. Frontswap uses frontswap_map bitmap to track page in "backend"
implementation, when zswap reclaim a
page, the corresponding bitmap record is not cleared.

4. zswap_tree is not freed when swapoff, and it got re-kzalloc in
swapon, memory leak occurs.

questions:
1. How about SetPageReclaim befor __swap_writepage, so that move it to
the tail of the inactive list?
2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim
function, does this lead to these function called recursively?
3. for reclaiming one zbud page which contains two buddies, zswap
needs to alloc two pages. Does this reclaim cost-efficient?


2013-08-19 02:17:56

by Bob Liu

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

Hi Weijie,

On 08/19/2013 12:14 AM, Weijie Yang wrote:
> I found a few bugs in zswap when I review Linux-3.11-rc5, and I have
> also some questions about it, described as following:
>
> BUG:
> 1. A race condition when reclaim a page
> when a handle alloced from zbud, zbud considers this handle is used
> validly by upper(zswap) and can be a candidate for reclaim.
> But zswap has to initialize it such as setting swapentry and addding
> it to rbtree. so there is a race condition, such as:
> thread 0: obtain handle x from zbud_alloc
> thread 1: zbud_reclaim_page is called
> thread 1: callback zswap_writeback_entry to reclaim handle x
> thread 1: get swpentry from handle x (it is random value now)
> thread 1: bad thing may happen
> thread 0: initialize handle x with swapentry

Yes, this may happen potentially but in rare case.
Because we have a LRU list for page frames, after Thread 0 called
zbud_alloc the corresponding page will be add to the head of LRU
list,While zbud_reclaim_page(Thread 1 called) is started from the tail
of LRU list.

> Of course, this situation almost never happen, it is a "theoretical
> race condition" issue.
>
> 2. Pollute swapcache data by add a pre-invalided swap page
> when a swap_entry is invalidated, it will be reused by other anon
> page. At the same time, zswap is reclaiming old page, pollute
> swapcache of new page as a result, because old page and new page use
> the same swap_entry, such as:
> thread 1: zswap reclaim entry x
> thread 0: zswap_frontswap_invalidate_page entry x
> thread 0: entry x reused by other anon page
> thread 1: add old data to swapcache of entry x

I didn't get your idea here, why thread1 will add old data to entry x?

> thread 0: swapcache of entry x is polluted
> Of course, this situation almost never happen, it is another
> "theoretical race condition" issue.
>
> 3. Frontswap uses frontswap_map bitmap to track page in "backend"
> implementation, when zswap reclaim a
> page, the corresponding bitmap record is not cleared.
>

That's true, but I don't think it's a big problem.
Only waste little time to search rbtree during zswap_frontswap_load().

> 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in
> swapon, memory leak occurs.

Nice catch! I think it should be freed in zswap_frontswap_invalidate_area().

>
> questions:
> 1. How about SetPageReclaim befor __swap_writepage, so that move it to
> the tail of the inactive list?

It will be added to inactive now.

> 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim
> function, does this lead to these function called recursively?

Yes, that's a potential problem.

> 3. for reclaiming one zbud page which contains two buddies, zswap
> needs to alloc two pages. Does this reclaim cost-efficient?
>

Yes, that's a problem too. And that's why we use zbud as the default
allocator instead of zsmalloc.
I think improving the write back path of zswap is the next important
step for zswap.

--
Regards,
-Bob

2013-08-19 05:47:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote:
> Hi Weijie,
>
> On 08/19/2013 12:14 AM, Weijie Yang wrote:
> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have
> > also some questions about it, described as following:
> >
> > BUG:
> > 1. A race condition when reclaim a page
> > when a handle alloced from zbud, zbud considers this handle is used
> > validly by upper(zswap) and can be a candidate for reclaim.
> > But zswap has to initialize it such as setting swapentry and addding
> > it to rbtree. so there is a race condition, such as:
> > thread 0: obtain handle x from zbud_alloc
> > thread 1: zbud_reclaim_page is called
> > thread 1: callback zswap_writeback_entry to reclaim handle x
> > thread 1: get swpentry from handle x (it is random value now)
> > thread 1: bad thing may happen
> > thread 0: initialize handle x with swapentry

Nice catch!

>
> Yes, this may happen potentially but in rare case.
> Because we have a LRU list for page frames, after Thread 0 called
> zbud_alloc the corresponding page will be add to the head of LRU
> list,While zbud_reclaim_page(Thread 1 called) is started from the tail
> of LRU list.
>
> > Of course, this situation almost never happen, it is a "theoretical
> > race condition" issue.

But it's doable and we should prevent that although you feel it's rare
because system could go hang. When I look at the code, Why should zbud
have LRU logic instead of zswap? If I missed some history, sorry about that.
But at least to me, zbud is just allocator so it should have a role
to handle alloc/free object and how client of the allocator uses objects
depends on the upper layer so zbud should handle LRU. If so, we wouldn't
encounter this problem, either.

> >
> > 2. Pollute swapcache data by add a pre-invalided swap page
> > when a swap_entry is invalidated, it will be reused by other anon
> > page. At the same time, zswap is reclaiming old page, pollute
> > swapcache of new page as a result, because old page and new page use
> > the same swap_entry, such as:
> > thread 1: zswap reclaim entry x
> > thread 0: zswap_frontswap_invalidate_page entry x
> > thread 0: entry x reused by other anon page
> > thread 1: add old data to swapcache of entry x
>
> I didn't get your idea here, why thread1 will add old data to entry x?
>
> > thread 0: swapcache of entry x is polluted
> > Of course, this situation almost never happen, it is another
> > "theoretical race condition" issue.

Don't swapcache_prepare close the race?

> >
> > 3. Frontswap uses frontswap_map bitmap to track page in "backend"
> > implementation, when zswap reclaim a
> > page, the corresponding bitmap record is not cleared.
> >
>
> That's true, but I don't think it's a big problem.
> Only waste little time to search rbtree during zswap_frontswap_load().
>
> > 4. zswap_tree is not freed when swapoff, and it got re-kzalloc in
> > swapon, memory leak occurs.
>
> Nice catch! I think it should be freed in zswap_frontswap_invalidate_area().
>
> >
> > questions:
> > 1. How about SetPageReclaim befor __swap_writepage, so that move it to
> > the tail of the inactive list?

It's a good idea to avoid unnecessary page scanning.

>
> It will be added to inactive now.
>
> > 2. zswap uses GFP_KERNEL flag to alloc things in store and reclaim
> > function, does this lead to these function called recursively?
>
> Yes, that's a potential problem.

It should use GFP_NOIO.

>
> > 3. for reclaiming one zbud page which contains two buddies, zswap
> > needs to alloc two pages. Does this reclaim cost-efficient?

It would be better to evict zpage which is a compressed sequence of
PAGE_SIZE bytes rather than decompresesed PAGE_SIZE bytes a page when
we are about to reclaim the page but it's hard part from frontswap API.

> >
>
> Yes, that's a problem too. And that's why we use zbud as the default
> allocator instead of zsmalloc.
> I think improving the write back path of zswap is the next important
> step for zswap.
>
> --
> Regards,
> -Bob
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-08-20 15:30:32

by Weijie Yang

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

2013/8/19 Minchan Kim <[email protected]>:
> On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote:
>> Hi Weijie,
>>
>> On 08/19/2013 12:14 AM, Weijie Yang wrote:
>> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have
>> > also some questions about it, described as following:
>> >
>> > BUG:
>> > 1. A race condition when reclaim a page
>> > when a handle alloced from zbud, zbud considers this handle is used
>> > validly by upper(zswap) and can be a candidate for reclaim.
>> > But zswap has to initialize it such as setting swapentry and addding
>> > it to rbtree. so there is a race condition, such as:
>> > thread 0: obtain handle x from zbud_alloc
>> > thread 1: zbud_reclaim_page is called
>> > thread 1: callback zswap_writeback_entry to reclaim handle x
>> > thread 1: get swpentry from handle x (it is random value now)
>> > thread 1: bad thing may happen
>> > thread 0: initialize handle x with swapentry
>
> Nice catch!
>
>>
>> Yes, this may happen potentially but in rare case.
>> Because we have a LRU list for page frames, after Thread 0 called
>> zbud_alloc the corresponding page will be add to the head of LRU
>> list,While zbud_reclaim_page(Thread 1 called) is started from the tail
>> of LRU list.
>>
>> > Of course, this situation almost never happen, it is a "theoretical
>> > race condition" issue.
>
> But it's doable and we should prevent that although you feel it's rare
> because system could go hang. When I look at the code, Why should zbud
> have LRU logic instead of zswap? If I missed some history, sorry about that.
> But at least to me, zbud is just allocator so it should have a role
> to handle alloc/free object and how client of the allocator uses objects
> depends on the upper layer so zbud should handle LRU. If so, we wouldn't
> encounter this problem, either.
>
>> >
>> > 2. Pollute swapcache data by add a pre-invalided swap page
>> > when a swap_entry is invalidated, it will be reused by other anon
>> > page. At the same time, zswap is reclaiming old page, pollute
>> > swapcache of new page as a result, because old page and new page use
>> > the same swap_entry, such as:
>> > thread 1: zswap reclaim entry x
>> > thread 0: zswap_frontswap_invalidate_page entry x
>> > thread 0: entry x reused by other anon page
>> > thread 1: add old data to swapcache of entry x
>>
>> I didn't get your idea here, why thread1 will add old data to entry x?
>>
>> > thread 0: swapcache of entry x is polluted
>> > Of course, this situation almost never happen, it is another
>> > "theoretical race condition" issue.
>
> Don't swapcache_prepare close the race?

Yes, I made a mistake, there is not a race here.
However, I find another bug here after my more careful review. It is
not only "theoretical", it will happen really. as:
thread 1: zswap reclaim entry x (get the refcount, but not call
zswap_get_swap_cache_page yet)
thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x
and its zbud is not freed as its refcount != 0)
now, the swap_map[x] = 0
thread 1: zswap_get_swap_cache_page called, swapcache_prepare return
-ENOENT because entry x is not used any more
zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
zswap_writeback_entry do nothing except put refcount
now, the memory of zswap_entry x leaks and its zpage become a zombie

Best Regards,
Weijie Yang

2013-08-21 07:49:12

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

Hello,

On Tue, Aug 20, 2013 at 11:30:29PM +0800, Weijie Yang wrote:
> 2013/8/19 Minchan Kim <[email protected]>:
> > On Mon, Aug 19, 2013 at 10:17:38AM +0800, Bob Liu wrote:
> >> Hi Weijie,
> >>
> >> On 08/19/2013 12:14 AM, Weijie Yang wrote:
> >> > I found a few bugs in zswap when I review Linux-3.11-rc5, and I have
> >> > also some questions about it, described as following:
> >> >
> >> > BUG:
> >> > 1. A race condition when reclaim a page
> >> > when a handle alloced from zbud, zbud considers this handle is used
> >> > validly by upper(zswap) and can be a candidate for reclaim.
> >> > But zswap has to initialize it such as setting swapentry and addding
> >> > it to rbtree. so there is a race condition, such as:
> >> > thread 0: obtain handle x from zbud_alloc
> >> > thread 1: zbud_reclaim_page is called
> >> > thread 1: callback zswap_writeback_entry to reclaim handle x
> >> > thread 1: get swpentry from handle x (it is random value now)
> >> > thread 1: bad thing may happen
> >> > thread 0: initialize handle x with swapentry
> >
> > Nice catch!
> >
> >>
> >> Yes, this may happen potentially but in rare case.
> >> Because we have a LRU list for page frames, after Thread 0 called
> >> zbud_alloc the corresponding page will be add to the head of LRU
> >> list,While zbud_reclaim_page(Thread 1 called) is started from the tail
> >> of LRU list.
> >>
> >> > Of course, this situation almost never happen, it is a "theoretical
> >> > race condition" issue.
> >
> > But it's doable and we should prevent that although you feel it's rare
> > because system could go hang. When I look at the code, Why should zbud
> > have LRU logic instead of zswap? If I missed some history, sorry about that.
> > But at least to me, zbud is just allocator so it should have a role
> > to handle alloc/free object and how client of the allocator uses objects
> > depends on the upper layer so zbud should handle LRU. If so, we wouldn't
> > encounter this problem, either.
> >
> >> >
> >> > 2. Pollute swapcache data by add a pre-invalided swap page
> >> > when a swap_entry is invalidated, it will be reused by other anon
> >> > page. At the same time, zswap is reclaiming old page, pollute
> >> > swapcache of new page as a result, because old page and new page use
> >> > the same swap_entry, such as:
> >> > thread 1: zswap reclaim entry x
> >> > thread 0: zswap_frontswap_invalidate_page entry x
> >> > thread 0: entry x reused by other anon page
> >> > thread 1: add old data to swapcache of entry x
> >>
> >> I didn't get your idea here, why thread1 will add old data to entry x?
> >>
> >> > thread 0: swapcache of entry x is polluted
> >> > Of course, this situation almost never happen, it is another
> >> > "theoretical race condition" issue.
> >
> > Don't swapcache_prepare close the race?
>
> Yes, I made a mistake, there is not a race here.
> However, I find another bug here after my more careful review. It is
> not only "theoretical", it will happen really. as:
> thread 1: zswap reclaim entry x (get the refcount, but not call
> zswap_get_swap_cache_page yet)
> thread 0: zswap_frontswap_invalidate_page entry x (finished, entry x
> and its zbud is not freed as its refcount != 0)
> now, the swap_map[x] = 0
> thread 1: zswap_get_swap_cache_page called, swapcache_prepare return
> -ENOENT because entry x is not used any more
> zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
> zswap_writeback_entry do nothing except put refcount
> now, the memory of zswap_entry x leaks and its zpage become a zombie

It makes sense to me.
Maybe we should introduce ZSWAP_SWAPCACHE_NOENT and free the entry in
the case. Would you mind to send patches on the problems you found?

>
> Best Regards,
> Weijie Yang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-09-25 08:09:09

by Weijie Yang

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

I think I find a new issue, for integrity of this mail thread, I reply
to this mail.

It is a concurrence issue either, when duplicate store and reclaim
concurrentlly.

zswap entry x with offset A is already stored in zswap backend.
Consider the following scenario:

thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)

thread 1: store new page with the same offset A, alloc a new zswap entry y.
store finished. shrink_page_list() call __remove_mapping(), and now
it is not in swap_cache

thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache

Now, swap cache has old data rather than new data for offset A.
error will happen If do_swap_page() get page from swap_cache.

2013-09-25 08:31:11

by Bob Liu

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
> I think I find a new issue, for integrity of this mail thread, I reply
> to this mail.
>
> It is a concurrence issue either, when duplicate store and reclaim
> concurrentlly.
>
> zswap entry x with offset A is already stored in zswap backend.
> Consider the following scenario:
>
> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>
> thread 1: store new page with the same offset A, alloc a new zswap entry y.
> store finished. shrink_page_list() call __remove_mapping(), and now
> it is not in swap_cache
>

But I don't think swap layer will call zswap with the same offset A.

> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache
>
> Now, swap cache has old data rather than new data for offset A.
> error will happen If do_swap_page() get page from swap_cache.
>

--
Regards,
--Bob

2013-09-25 09:33:46

by Weijie Yang

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
>> I think I find a new issue, for integrity of this mail thread, I reply
>> to this mail.
>>
>> It is a concurrence issue either, when duplicate store and reclaim
>> concurrentlly.
>>
>> zswap entry x with offset A is already stored in zswap backend.
>> Consider the following scenario:
>>
>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>>
>> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>> store finished. shrink_page_list() call __remove_mapping(), and now
>> it is not in swap_cache
>>
>
> But I don't think swap layer will call zswap with the same offset A.

1. store page of offset A in zswap
2. some time later, pagefault occur, load page data from zswap.
But notice that zswap entry x is still in zswap because it is not
frontswap_tmem_exclusive_gets_enabled.
this page is with PageSwapCache(page) and page_private(page) = entry.val
3. change this page data, and it become dirty
4. some time later again, swap this page on the same offset A.

so, a duplicate store happens.

what I can think is that use flags and CAS to protect store and reclaim on
the same offset happens concurrentlly.

>> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache
>>
>> Now, swap cache has old data rather than new data for offset A.
>> error will happen If do_swap_page() get page from swap_cache.
>>
>
> --
> Regards,
> --Bob

2013-09-25 10:02:17

by Bob Liu

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang <[email protected]> wrote:
> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
>> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
>>> I think I find a new issue, for integrity of this mail thread, I reply
>>> to this mail.
>>>
>>> It is a concurrence issue either, when duplicate store and reclaim
>>> concurrentlly.
>>>
>>> zswap entry x with offset A is already stored in zswap backend.
>>> Consider the following scenario:
>>>
>>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>>>
>>> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>>> store finished. shrink_page_list() call __remove_mapping(), and now
>>> it is not in swap_cache
>>>
>>
>> But I don't think swap layer will call zswap with the same offset A.
>
> 1. store page of offset A in zswap
> 2. some time later, pagefault occur, load page data from zswap.
> But notice that zswap entry x is still in zswap because it is not

Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase().

> frontswap_tmem_exclusive_gets_enabled.
> this page is with PageSwapCache(page) and page_private(page) = entry.val
> 3. change this page data, and it become dirty
> 4. some time later again, swap this page on the same offset A.
>
> so, a duplicate store happens.
>

Then I think we should erase the entry from rbtree in zswap_frontswap_load().
After the page is decompressed and loaded from zswap, still storing
the compressed data in zswap is meanless.

--
Regards,
--Bob

2013-09-26 02:06:43

by Weijie Yang

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Wed, Sep 25, 2013 at 6:02 PM, Bob Liu <[email protected]> wrote:
> On Wed, Sep 25, 2013 at 5:33 PM, Weijie Yang <[email protected]> wrote:
>> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
>>> On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
>>>> I think I find a new issue, for integrity of this mail thread, I reply
>>>> to this mail.
>>>>
>>>> It is a concurrence issue either, when duplicate store and reclaim
>>>> concurrentlly.
>>>>
>>>> zswap entry x with offset A is already stored in zswap backend.
>>>> Consider the following scenario:
>>>>
>>>> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>>>>
>>>> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>>>> store finished. shrink_page_list() call __remove_mapping(), and now
>>>> it is not in swap_cache
>>>>
>>>
>>> But I don't think swap layer will call zswap with the same offset A.
>>
>> 1. store page of offset A in zswap
>> 2. some time later, pagefault occur, load page data from zswap.
>> But notice that zswap entry x is still in zswap because it is not
>
> Sorry I didn't notice that zswap_frontswap_load() doesn't call rb_erase().
>
>> frontswap_tmem_exclusive_gets_enabled.
>> this page is with PageSwapCache(page) and page_private(page) = entry.val
>> 3. change this page data, and it become dirty
>> 4. some time later again, swap this page on the same offset A.
>>
>> so, a duplicate store happens.
>>
>
> Then I think we should erase the entry from rbtree in zswap_frontswap_load().
> After the page is decompressed and loaded from zswap, still storing
> the compressed data in zswap is meanless.

Of cause, erasing the entry after load() can resolve this.
However, this problem is not simple and interesting.

If we drop the entry after load(), we should SetPageDirty, it will
generate more swap
even if we don't change this page data.
I think that is why frontswap has two load mode: default(used now) and
exclusive_gets.

I don't have test data, but I think which mode is better is decided by
corresponding workload.
It's better to realize these two modes, I will try it.

> --
> Regards,
> --Bob

2013-09-26 05:57:30

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

Hello Weigie,

On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
> >> I think I find a new issue, for integrity of this mail thread, I reply
> >> to this mail.
> >>
> >> It is a concurrence issue either, when duplicate store and reclaim
> >> concurrentlly.
> >>
> >> zswap entry x with offset A is already stored in zswap backend.
> >> Consider the following scenario:
> >>
> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
> >>
> >> thread 1: store new page with the same offset A, alloc a new zswap entry y.
> >> store finished. shrink_page_list() call __remove_mapping(), and now
> >> it is not in swap_cache
> >>
> >
> > But I don't think swap layer will call zswap with the same offset A.
>
> 1. store page of offset A in zswap
> 2. some time later, pagefault occur, load page data from zswap.
> But notice that zswap entry x is still in zswap because it is not
> frontswap_tmem_exclusive_gets_enabled.

frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
between CPU burining by frequent swapout and memory footprint by duplicate
copy in swap cache and frontswap backend so it shouldn't affect the stability.

> this page is with PageSwapCache(page) and page_private(page) = entry.val
> 3. change this page data, and it become dirty

If non-shared swapin page become redirty, it should remove the page from
swapcache. If shared swapin page become redirty, it should do CoW so it's a
new page so that it doesn't live in swap cache. It means it should have new
offset which is different with old's one for swap out.

What's wrong with that?


> 4. some time later again, swap this page on the same offset A.
>
> so, a duplicate store happens.
>
> what I can think is that use flags and CAS to protect store and reclaim on
> the same offset happens concurrentlly.
>
> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache
> >>
> >> Now, swap cache has old data rather than new data for offset A.
> >> error will happen If do_swap_page() get page from swap_cache.
> >>
> >
> > --
> > Regards,
> > --Bob
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-09-26 07:26:35

by Weijie Yang

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim <[email protected]> wrote:
> Hello Weigie,
>
> On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
>> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
>> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
>> >> I think I find a new issue, for integrity of this mail thread, I reply
>> >> to this mail.
>> >>
>> >> It is a concurrence issue either, when duplicate store and reclaim
>> >> concurrentlly.
>> >>
>> >> zswap entry x with offset A is already stored in zswap backend.
>> >> Consider the following scenario:
>> >>
>> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>> >>
>> >> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>> >> store finished. shrink_page_list() call __remove_mapping(), and now
>> >> it is not in swap_cache
>> >>
>> >
>> > But I don't think swap layer will call zswap with the same offset A.
>>
>> 1. store page of offset A in zswap
>> 2. some time later, pagefault occur, load page data from zswap.
>> But notice that zswap entry x is still in zswap because it is not
>> frontswap_tmem_exclusive_gets_enabled.
>
> frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
> between CPU burining by frequent swapout and memory footprint by duplicate
> copy in swap cache and frontswap backend so it shouldn't affect the stability.

Thanks for explain this.
I don't mean to say this option affects the stability, but that zswap
only realize
one option. Maybe it's better to realize both options for different workloads.

>> this page is with PageSwapCache(page) and page_private(page) = entry.val
>> 3. change this page data, and it become dirty
>
> If non-shared swapin page become redirty, it should remove the page from
> swapcache. If shared swapin page become redirty, it should do CoW so it's a
> new page so that it doesn't live in swap cache. It means it should have new
> offset which is different with old's one for swap out.
>
> What's wrong with that?

It is really not a right scene for duplicate store. And I can not think out one.
If duplicate store is impossible, How about delete the handle code in zswap?
If it does exist, I think there is a potential issue as I described.

>> 4. some time later again, swap this page on the same offset A.
>>
>> so, a duplicate store happens.
>>
>> what I can think is that use flags and CAS to protect store and reclaim on
>> the same offset happens concurrentlly.
>>
>> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache
>> >>
>> >> Now, swap cache has old data rather than new data for offset A.
>> >> error will happen If do_swap_page() get page from swap_cache.
>> >>
>> >
>> > --
>> > Regards,
>> > --Bob
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Kind regards,
> Minchan Kim

2013-09-26 07:56:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote:
> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim <[email protected]> wrote:
> > Hello Weigie,
> >
> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
> >> >> I think I find a new issue, for integrity of this mail thread, I reply
> >> >> to this mail.
> >> >>
> >> >> It is a concurrence issue either, when duplicate store and reclaim
> >> >> concurrentlly.
> >> >>
> >> >> zswap entry x with offset A is already stored in zswap backend.
> >> >> Consider the following scenario:
> >> >>
> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
> >> >>
> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y.
> >> >> store finished. shrink_page_list() call __remove_mapping(), and now
> >> >> it is not in swap_cache
> >> >>
> >> >
> >> > But I don't think swap layer will call zswap with the same offset A.
> >>
> >> 1. store page of offset A in zswap
> >> 2. some time later, pagefault occur, load page data from zswap.
> >> But notice that zswap entry x is still in zswap because it is not
> >> frontswap_tmem_exclusive_gets_enabled.
> >
> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
> > between CPU burining by frequent swapout and memory footprint by duplicate
> > copy in swap cache and frontswap backend so it shouldn't affect the stability.
>
> Thanks for explain this.
> I don't mean to say this option affects the stability, but that zswap
> only realize
> one option. Maybe it's better to realize both options for different workloads.

"zswap only relize one option"
What does it mena? Sorry. I couldn't parse your intention. :)
You mean zswap should do something special to support frontswap_tmem_exclusive_gets?

>
> >> this page is with PageSwapCache(page) and page_private(page) = entry.val
> >> 3. change this page data, and it become dirty
> >
> > If non-shared swapin page become redirty, it should remove the page from
> > swapcache. If shared swapin page become redirty, it should do CoW so it's a
> > new page so that it doesn't live in swap cache. It means it should have new
> > offset which is different with old's one for swap out.
> >
> > What's wrong with that?
>
> It is really not a right scene for duplicate store. And I can not think out one.
> If duplicate store is impossible, How about delete the handle code in zswap?
> If it does exist, I think there is a potential issue as I described.

You mean "zswap_duplicate_entry"?
AFAIR, I already had a question to Seth when zswap was born but AFAIRC,
he said that he didn't know exact reason but he saw that case during
experiement so copy the code peice from zcache.

Do you see the case, too?

Anyway, we need to dive into that to know what happens and then open
our eyes for clear solution before dumping meaningless patch.

I hope Seth or Bob already know it.

>
> >> 4. some time later again, swap this page on the same offset A.
> >>
> >> so, a duplicate store happens.
> >>
> >> what I can think is that use flags and CAS to protect store and reclaim on
> >> the same offset happens concurrentlly.
> >>
> >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache
> >> >>
> >> >> Now, swap cache has old data rather than new data for offset A.
> >> >> error will happen If do_swap_page() get page from swap_cache.
> >> >>
> >> >
> >> > --
> >> > Regards,
> >> > --Bob
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to [email protected]. For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-09-26 08:48:09

by Weijie Yang

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote:
>> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim <[email protected]> wrote:
>> > Hello Weigie,
>> >
>> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
>> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
>> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
>> >> >> I think I find a new issue, for integrity of this mail thread, I reply
>> >> >> to this mail.
>> >> >>
>> >> >> It is a concurrence issue either, when duplicate store and reclaim
>> >> >> concurrentlly.
>> >> >>
>> >> >> zswap entry x with offset A is already stored in zswap backend.
>> >> >> Consider the following scenario:
>> >> >>
>> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>> >> >>
>> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>> >> >> store finished. shrink_page_list() call __remove_mapping(), and now
>> >> >> it is not in swap_cache
>> >> >>
>> >> >
>> >> > But I don't think swap layer will call zswap with the same offset A.
>> >>
>> >> 1. store page of offset A in zswap
>> >> 2. some time later, pagefault occur, load page data from zswap.
>> >> But notice that zswap entry x is still in zswap because it is not
>> >> frontswap_tmem_exclusive_gets_enabled.
>> >
>> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
>> > between CPU burining by frequent swapout and memory footprint by duplicate
>> > copy in swap cache and frontswap backend so it shouldn't affect the stability.
>>
>> Thanks for explain this.
>> I don't mean to say this option affects the stability, but that zswap
>> only realize
>> one option. Maybe it's better to realize both options for different workloads.
>
> "zswap only relize one option"
> What does it mena? Sorry. I couldn't parse your intention. :)
> You mean zswap should do something special to support frontswap_tmem_exclusive_gets?

Yes. But I am not sure whether it is worth.

>>
>> >> this page is with PageSwapCache(page) and page_private(page) = entry.val
>> >> 3. change this page data, and it become dirty
>> >
>> > If non-shared swapin page become redirty, it should remove the page from
>> > swapcache. If shared swapin page become redirty, it should do CoW so it's a
>> > new page so that it doesn't live in swap cache. It means it should have new
>> > offset which is different with old's one for swap out.
>> >
>> > What's wrong with that?
>>
>> It is really not a right scene for duplicate store. And I can not think out one.
>> If duplicate store is impossible, How about delete the handle code in zswap?
>> If it does exist, I think there is a potential issue as I described.
>
> You mean "zswap_duplicate_entry"?
> AFAIR, I already had a question to Seth when zswap was born but AFAIRC,
> he said that he didn't know exact reason but he saw that case during
> experiement so copy the code peice from zcache.
>
> Do you see the case, too?

Yes, I mean duplicate store.
I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores",
but I am still confused.

I wrote a zcache varietas which swap out compressed page to swapfile.
I did see that case when I test it on andorid smartphone(arm v7),
and it happens rarely and occasionally.
In one test, only 1 duplicate store occur in about 3157162 times stores.

> Anyway, we need to dive into that to know what happens and then open
> our eyes for clear solution before dumping meaningless patch.
>
> I hope Seth or Bob already know it.
>
>>
>> >> 4. some time later again, swap this page on the same offset A.
>> >>
>> >> so, a duplicate store happens.
>> >>
>> >> what I can think is that use flags and CAS to protect store and reclaim on
>> >> the same offset happens concurrentlly.
>> >>
>> >> >> thread 0: zswap_get_swap_cache_page called. old page data is added to swap_cache
>> >> >>
>> >> >> Now, swap cache has old data rather than new data for offset A.
>> >> >> error will happen If do_swap_page() get page from swap_cache.
>> >> >>
>> >> >
>> >> > --
>> >> > Regards,
>> >> > --Bob
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to [email protected]. For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> Kind regards,
> Minchan Kim

2013-09-26 09:56:56

by Bob Liu

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

On Thu, Sep 26, 2013 at 4:48 PM, Weijie Yang <[email protected]> wrote:
> On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim <[email protected]> wrote:
>> On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote:
>>> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim <[email protected]> wrote:
>>> > Hello Weigie,
>>> >
>>> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
>>> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
>>> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
>>> >> >> I think I find a new issue, for integrity of this mail thread, I reply
>>> >> >> to this mail.
>>> >> >>
>>> >> >> It is a concurrence issue either, when duplicate store and reclaim
>>> >> >> concurrentlly.
>>> >> >>
>>> >> >> zswap entry x with offset A is already stored in zswap backend.
>>> >> >> Consider the following scenario:
>>> >> >>
>>> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
>>> >> >>
>>> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>>> >> >> store finished. shrink_page_list() call __remove_mapping(), and now
>>> >> >> it is not in swap_cache
>>> >> >>
>>> >> >
>>> >> > But I don't think swap layer will call zswap with the same offset A.
>>> >>
>>> >> 1. store page of offset A in zswap
>>> >> 2. some time later, pagefault occur, load page data from zswap.
>>> >> But notice that zswap entry x is still in zswap because it is not
>>> >> frontswap_tmem_exclusive_gets_enabled.
>>> >
>>> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
>>> > between CPU burining by frequent swapout and memory footprint by duplicate
>>> > copy in swap cache and frontswap backend so it shouldn't affect the stability.
>>>
>>> Thanks for explain this.
>>> I don't mean to say this option affects the stability, but that zswap
>>> only realize
>>> one option. Maybe it's better to realize both options for different workloads.
>>
>> "zswap only relize one option"
>> What does it mena? Sorry. I couldn't parse your intention. :)
>> You mean zswap should do something special to support frontswap_tmem_exclusive_gets?
>
> Yes. But I am not sure whether it is worth.
>
>>>
>>> >> this page is with PageSwapCache(page) and page_private(page) = entry.val
>>> >> 3. change this page data, and it become dirty
>>> >
>>> > If non-shared swapin page become redirty, it should remove the page from
>>> > swapcache. If shared swapin page become redirty, it should do CoW so it's a
>>> > new page so that it doesn't live in swap cache. It means it should have new
>>> > offset which is different with old's one for swap out.
>>> >
>>> > What's wrong with that?
>>>
>>> It is really not a right scene for duplicate store. And I can not think out one.
>>> If duplicate store is impossible, How about delete the handle code in zswap?
>>> If it does exist, I think there is a potential issue as I described.
>>
>> You mean "zswap_duplicate_entry"?
>> AFAIR, I already had a question to Seth when zswap was born but AFAIRC,
>> he said that he didn't know exact reason but he saw that case during
>> experiement so copy the code peice from zcache.
>>
>> Do you see the case, too?
>
> Yes, I mean duplicate store.
> I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores",
> but I am still confused.
>
> I wrote a zcache varietas which swap out compressed page to swapfile.
> I did see that case when I test it on andorid smartphone(arm v7),

Why not test it based on zswap directly?
My suggestion is do more and more stress testing against current
zswap, if there is really any bug triggered or unusual performance
issue then we dig it out and fix it.

--
Regards,
--Bob

2013-09-27 04:58:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG REPORT] ZSWAP: theoretical race condition issues

Hello Weijie,

On Thu, Sep 26, 2013 at 04:48:03PM +0800, Weijie Yang wrote:
> On Thu, Sep 26, 2013 at 3:57 PM, Minchan Kim <[email protected]> wrote:
> > On Thu, Sep 26, 2013 at 03:26:33PM +0800, Weijie Yang wrote:
> >> On Thu, Sep 26, 2013 at 1:58 PM, Minchan Kim <[email protected]> wrote:
> >> > Hello Weigie,
> >> >
> >> > On Wed, Sep 25, 2013 at 05:33:43PM +0800, Weijie Yang wrote:
> >> >> On Wed, Sep 25, 2013 at 4:31 PM, Bob Liu <[email protected]> wrote:
> >> >> > On Wed, Sep 25, 2013 at 4:09 PM, Weijie Yang <[email protected]> wrote:
> >> >> >> I think I find a new issue, for integrity of this mail thread, I reply
> >> >> >> to this mail.
> >> >> >>
> >> >> >> It is a concurrence issue either, when duplicate store and reclaim
> >> >> >> concurrentlly.
> >> >> >>
> >> >> >> zswap entry x with offset A is already stored in zswap backend.
> >> >> >> Consider the following scenario:
> >> >> >>
> >> >> >> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
> >> >> >>
> >> >> >> thread 1: store new page with the same offset A, alloc a new zswap entry y.
> >> >> >> store finished. shrink_page_list() call __remove_mapping(), and now
> >> >> >> it is not in swap_cache
> >> >> >>
> >> >> >
> >> >> > But I don't think swap layer will call zswap with the same offset A.
> >> >>
> >> >> 1. store page of offset A in zswap
> >> >> 2. some time later, pagefault occur, load page data from zswap.
> >> >> But notice that zswap entry x is still in zswap because it is not
> >> >> frontswap_tmem_exclusive_gets_enabled.
> >> >
> >> > frontswap_tmem_exclusive_gets_enabled is just option to see tradeoff
> >> > between CPU burining by frequent swapout and memory footprint by duplicate
> >> > copy in swap cache and frontswap backend so it shouldn't affect the stability.
> >>
> >> Thanks for explain this.
> >> I don't mean to say this option affects the stability, but that zswap
> >> only realize
> >> one option. Maybe it's better to realize both options for different workloads.
> >
> > "zswap only relize one option"
> > What does it mena? Sorry. I couldn't parse your intention. :)
> > You mean zswap should do something special to support frontswap_tmem_exclusive_gets?
>
> Yes. But I am not sure whether it is worth.
>
> >>
> >> >> this page is with PageSwapCache(page) and page_private(page) = entry.val
> >> >> 3. change this page data, and it become dirty
> >> >
> >> > If non-shared swapin page become redirty, it should remove the page from
> >> > swapcache. If shared swapin page become redirty, it should do CoW so it's a
> >> > new page so that it doesn't live in swap cache. It means it should have new
> >> > offset which is different with old's one for swap out.
> >> >
> >> > What's wrong with that?
> >>
> >> It is really not a right scene for duplicate store. And I can not think out one.
> >> If duplicate store is impossible, How about delete the handle code in zswap?
> >> If it does exist, I think there is a potential issue as I described.
> >
> > You mean "zswap_duplicate_entry"?
> > AFAIR, I already had a question to Seth when zswap was born but AFAIRC,
> > he said that he didn't know exact reason but he saw that case during
> > experiement so copy the code peice from zcache.
> >
> > Do you see the case, too?
>
> Yes, I mean duplicate store.
> I check the /Documentation/vm/frontswap.txt, it mentions "duplicate stores",
> but I am still confused.

It seems that there are two Minchan in LKML.
Other Minchan, not me who have a horrible memory, already was first to
figure it out a few month ago.

https://lkml.org/lkml/2013/1/31/3

/me slaps self.
I'd like to look into that issue more but now I don't have a time.
Just FYI. ;-)

--
Kind regards,
Minchan Kim