On Fri, Nov 18, 2016 at 01:26:59PM -0800, Hugh Dickins wrote:
> On Fri, 18 Nov 2016, Minchan Kim wrote:
> > On Thu, Nov 17, 2016 at 08:35:10PM -0800, Hugh Dickins wrote:
> > >
> > > Maybe add SWP_STABLE_WRITES in include/linux/swap.h, and set that
> > > in swap_info->flags according to bdi_cap_stable_pages_required(),
> > > leaving mapping->host itself NULL as before?
> >
> > The problem with the approach is that we need to get swap_info_struct
> > in reuse_swap_page so maybe, every caller should pass swp_entry_t
> > into reuse_swap_page. It would be no problem if swap slot is really
> > referenced the page(IOW, pte is real swp_entry_t) but some cases
> > where swap slot is already empty but the page remains in only
> > swap cache, we cannot pass swp_entry_t which means that we cannot
> > get swap_info_struct.
>
> I don't see the problem: if the page is PageSwapCache (and page
> lock is held), then the swp_entry_t is there in page->private:
> see page_swapcount(), which reuse_swap_page() just called.
Right you are!!
>
> >
> > So, if I didn't miss, another option I can imagine is to move
> > SWP_STABLE_WRITES to address_space->flags as AS_STABLE_WRITES.
> > With that, we can always get the information without passing
> > swp_entry_t. Is there any better idea?
>
> I think what you suggest below would work fine (and be quicker
> than looking up the swap_info again): but is horribly misleading
> for anyone else interested in stable writes, for whom the info is
> kept in the bdi, and not in this new mapping flag.
>
> So I'd much prefer that you keep the swap special case inside the
> world of swap, with a SWP_STABLE_WRITES flag. Maybe unfold
> page_swapcount() inside reuse_swap_page(), so that it only
> needs a single lookup (or perhaps I'm optimizing prematurely).
>
I toally agree.
Here is new one.
>From 90b43e41a9fd8091e39246add583886c23360cdd Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Fri, 11 Nov 2016 15:02:57 +0900
Subject: [PATCH v2] mm: support anonymous stable page
For developemnt for zram-swap asynchronous writeback, I found
strange corruption of compressed page. With investigation, it
reveals currently stable page doesn't support anonymous page.
IOW, reuse_swap_page can reuse the page without waiting
writeback completion so that it can corrupt data during
zram compression. It can affect every swap device which supports
asynchronous writeback and CRC checking as well as zRAM.
Unfortunately, reuse_swap_page should be atomic so that we
cannot wait on writeback in there so the approach in this patch
is simply return false if we found it needs stable page.
Although it increases memory footprint temporarily, it happens
rarely and it should be reclaimed easily althoug it happened.
Also, It would be better than waiting of IO completion, which
is critial path for application latency.
Cc: Hugh Dickins <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
* from v1
* use swap_info_struct instead of swapper_space->host inode - Hugh
include/linux/swap.h | 3 ++-
mm/swapfile.c | 20 +++++++++++++++++++-
2 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523cefb9b..55ff5593c193 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -150,8 +150,9 @@ enum {
SWP_FILE = (1 << 7), /* set after swap_activate success */
SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */
SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */
+ SWP_STABLE_WRITES = (1 << 10), /* no overwrite PG_writeback pages */
/* add others here before... */
- SWP_SCANNING = (1 << 10), /* refcount in scan_swap_map */
+ SWP_SCANNING = (1 << 11), /* refcount in scan_swap_map */
};
#define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2210de290b54..66bc330c0b65 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -943,11 +943,25 @@ bool reuse_swap_page(struct page *page, int *total_mapcount)
count = page_trans_huge_mapcount(page, total_mapcount);
if (count <= 1 && PageSwapCache(page)) {
count += page_swapcount(page);
- if (count == 1 && !PageWriteback(page)) {
+ if (count != 1)
+ goto out;
+ if (!PageWriteback(page)) {
delete_from_swap_cache(page);
SetPageDirty(page);
+ } else {
+ swp_entry_t entry;
+ struct swap_info_struct *p;
+
+ entry.val = page_private(page);
+ p = swap_info_get(entry);
+ if (p->flags & SWP_STABLE_WRITES) {
+ spin_unlock(&p->lock);
+ return false;
+ }
+ spin_unlock(&p->lock);
}
}
+out:
return count <= 1;
}
@@ -2447,6 +2461,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
error = -ENOMEM;
goto bad_swap;
}
+
+ if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
+ p->flags |= SWP_STABLE_WRITES;
+
if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
int cpu;
--
2.7.4
On Mon, 21 Nov 2016, Minchan Kim wrote:
> From: Minchan Kim <[email protected]>
> Date: Fri, 11 Nov 2016 15:02:57 +0900
> Subject: [PATCH v2] mm: support anonymous stable page
>
> For developemnt for zram-swap asynchronous writeback, I found
> strange corruption of compressed page. With investigation, it
> reveals currently stable page doesn't support anonymous page.
> IOW, reuse_swap_page can reuse the page without waiting
> writeback completion so that it can corrupt data during
> zram compression. It can affect every swap device which supports
> asynchronous writeback and CRC checking as well as zRAM.
>
> Unfortunately, reuse_swap_page should be atomic so that we
> cannot wait on writeback in there so the approach in this patch
> is simply return false if we found it needs stable page.
> Although it increases memory footprint temporarily, it happens
> rarely and it should be reclaimed easily althoug it happened.
> Also, It would be better than waiting of IO completion, which
> is critial path for application latency.
>
> Cc: Hugh Dickins <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Looks good, thanks: we can always optimize away that little overhead
in the PageWriteback case, if it ever shows up in someone's testing.
Andrew might ask if we should Cc stable (haha): I think we agree
that it's a defect we've been aware of ever since stable pages were
first proposed, but nobody has actually been troubled by it before
your async zram development: so, you're right to be fixing it ahead
of your zram changes, but we don't see a call for backporting.
> ---
> * from v1
> * use swap_info_struct instead of swapper_space->host inode - Hugh
>
> include/linux/swap.h | 3 ++-
> mm/swapfile.c | 20 +++++++++++++++++++-
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a56523cefb9b..55ff5593c193 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -150,8 +150,9 @@ enum {
> SWP_FILE = (1 << 7), /* set after swap_activate success */
> SWP_AREA_DISCARD = (1 << 8), /* single-time swap area discards */
> SWP_PAGE_DISCARD = (1 << 9), /* freed swap page-cluster discards */
> + SWP_STABLE_WRITES = (1 << 10), /* no overwrite PG_writeback pages */
> /* add others here before... */
> - SWP_SCANNING = (1 << 10), /* refcount in scan_swap_map */
> + SWP_SCANNING = (1 << 11), /* refcount in scan_swap_map */
> };
>
> #define SWAP_CLUSTER_MAX 32UL
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2210de290b54..66bc330c0b65 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -943,11 +943,25 @@ bool reuse_swap_page(struct page *page, int *total_mapcount)
> count = page_trans_huge_mapcount(page, total_mapcount);
> if (count <= 1 && PageSwapCache(page)) {
> count += page_swapcount(page);
> - if (count == 1 && !PageWriteback(page)) {
> + if (count != 1)
> + goto out;
> + if (!PageWriteback(page)) {
> delete_from_swap_cache(page);
> SetPageDirty(page);
> + } else {
> + swp_entry_t entry;
> + struct swap_info_struct *p;
> +
> + entry.val = page_private(page);
> + p = swap_info_get(entry);
> + if (p->flags & SWP_STABLE_WRITES) {
> + spin_unlock(&p->lock);
> + return false;
> + }
> + spin_unlock(&p->lock);
> }
> }
> +out:
> return count <= 1;
> }
>
> @@ -2447,6 +2461,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> error = -ENOMEM;
> goto bad_swap;
> }
> +
> + if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
> + p->flags |= SWP_STABLE_WRITES;
> +
> if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
> int cpu;
>
> --
> 2.7.4
Hi Hugh,
On Mon, Nov 21, 2016 at 07:46:28PM -0800, Hugh Dickins wrote:
> On Mon, 21 Nov 2016, Minchan Kim wrote:
> > From: Minchan Kim <[email protected]>
> > Date: Fri, 11 Nov 2016 15:02:57 +0900
> > Subject: [PATCH v2] mm: support anonymous stable page
> >
> > For developemnt for zram-swap asynchronous writeback, I found
> > strange corruption of compressed page. With investigation, it
> > reveals currently stable page doesn't support anonymous page.
> > IOW, reuse_swap_page can reuse the page without waiting
> > writeback completion so that it can corrupt data during
> > zram compression. It can affect every swap device which supports
> > asynchronous writeback and CRC checking as well as zRAM.
> >
> > Unfortunately, reuse_swap_page should be atomic so that we
> > cannot wait on writeback in there so the approach in this patch
> > is simply return false if we found it needs stable page.
> > Although it increases memory footprint temporarily, it happens
> > rarely and it should be reclaimed easily althoug it happened.
> > Also, It would be better than waiting of IO completion, which
> > is critial path for application latency.
> >
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Darrick J. Wong <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> Acked-by: Hugh Dickins <[email protected]>
Thanks!
>
> Looks good, thanks: we can always optimize away that little overhead
> in the PageWriteback case, if it ever shows up in someone's testing.
Yeb.
>
> Andrew might ask if we should Cc stable (haha): I think we agree
> that it's a defect we've been aware of ever since stable pages were
> first proposed, but nobody has actually been troubled by it before
> your async zram development: so, you're right to be fixing it ahead
> of your zram changes, but we don't see a call for backporting.
I thought so until I see your comment. However, I checked again
and found it seems a ancient bug since zram birth.
swap_writepage unlock the page right before submitting bio while
it keeps the lock during rw_page operation during bdev_write_page.
So, if zram_rw_page fails(e.g, -ENOMEM) and then fallback to
submit_bio in __swap_writepage, the problem can occur.
Hmm, I will resend patchset with zram fix part with marking
the stable.
Thanks, Hugh!
On Tue, 22 Nov 2016, Minchan Kim wrote:
> On Mon, Nov 21, 2016 at 07:46:28PM -0800, Hugh Dickins wrote:
> >
> > Andrew might ask if we should Cc stable (haha): I think we agree
> > that it's a defect we've been aware of ever since stable pages were
> > first proposed, but nobody has actually been troubled by it before
> > your async zram development: so, you're right to be fixing it ahead
> > of your zram changes, but we don't see a call for backporting.
>
> I thought so until I see your comment. However, I checked again
> and found it seems a ancient bug since zram birth.
> swap_writepage unlock the page right before submitting bio while
> it keeps the lock during rw_page operation during bdev_write_page.
> So, if zram_rw_page fails(e.g, -ENOMEM) and then fallback to
> submit_bio in __swap_writepage, the problem can occur.
It's not clear to me why that matters. If it drives zram mad
to the point of crashing the kernel, yes, that would matter. But
if it just places incomprehensible or mis-CRCed data on the device,
who cares? The reused swap page is marked dirty, and nobody should
be reading the stale data back off swap. If you do resend with a
stable tag, please make clear why it matters.
Hugh
>
> Hmm, I will resend patchset with zram fix part with marking
> the stable.
>
> Thanks, Hugh!
Hi Hugh,
On Tue, Nov 22, 2016 at 08:43:54PM -0800, Hugh Dickins wrote:
> On Tue, 22 Nov 2016, Minchan Kim wrote:
> > On Mon, Nov 21, 2016 at 07:46:28PM -0800, Hugh Dickins wrote:
> > >
> > > Andrew might ask if we should Cc stable (haha): I think we agree
> > > that it's a defect we've been aware of ever since stable pages were
> > > first proposed, but nobody has actually been troubled by it before
> > > your async zram development: so, you're right to be fixing it ahead
> > > of your zram changes, but we don't see a call for backporting.
> >
> > I thought so until I see your comment. However, I checked again
> > and found it seems a ancient bug since zram birth.
> > swap_writepage unlock the page right before submitting bio while
> > it keeps the lock during rw_page operation during bdev_write_page.
> > So, if zram_rw_page fails(e.g, -ENOMEM) and then fallback to
> > submit_bio in __swap_writepage, the problem can occur.
>
> It's not clear to me why that matters. If it drives zram mad
> to the point of crashing the kernel, yes, that would matter. But
> if it just places incomprehensible or mis-CRCed data on the device,
> who cares? The reused swap page is marked dirty, and nobody should
> be reading the stale data back off swap. If you do resend with a
> stable tag, please make clear why it matters.
Your comment makes me think again. For old zram, it would be not a
problem. Thanks for the hint, Hugh!
However, it makes 4.7 kernel crash with per-cpu stream feature
introduced in zram to increase cache hit ratio.
The problem is it tries to compress page and then get compressed size.
With that size, it allocates buffer via zsmalloc but it could be
failed easily due to limited gfp_flag in per-cpu context so zram
retry to allocate buffer out of per-cpu context with more soft gfp
flag. If it get successfully, it retry to compress the page again
and copy the compressed data to the buffer allocated in advance.
During the operations, if the content is changed, it means
compressed size could be different so that buffer size allocated
in first trial is not vaild any more. It ends up buffer-overrun
so that zsmalloc free object chaining will be broken and go crash.
So, if we want to fix it really, it should go stable for v4.7,
at least.
Unfortunately, zram has reset feature which means zram shrink
the disksize to zero and then it should revalidate disk where
it will reset BDI_CAP_STABLE_WRITES. now zRAM cannot do it atomically
so someone can miss BDI_CAP_STABLE_WRITES of /dev/zram0.
I need to redesign the locking before supporting stable page of
zram so now I realize it's hard to reach stable tree, maybe.
I will think over with more time.
Thanks for always the helpful comment!