2022-12-01 16:21:45

by Andreas Gruenbacher

[permalink] [raw]
Subject: [RFC 0/3] Turn iomap_page_ops into iomap_folio_ops

Hello,

we're seeing a race between journaled data writes and the shrinker on
gfs2. What's happening is that gfs2_iomap_page_done() is called after
the page has been unlocked, so try_to_free_buffers() can come in and
free the buffers while gfs2_iomap_page_done() is trying to add them to
the transaction. Not good.

This is a proposal to change iomap_page_ops so that page_prepare()
prepares the write and grabs the locked page, and page_done() unlocks
and puts that page again. While at it, this also converts the hooks
from pages to folios.

To move the pagecache_isize_extended() call in iomap_write_end() out of
the way, a new folio_may_straddle_isize() helper is introduced that
takes a locked folio. That is then used when the inode size is updated,
before the folio is unlocked.

I've also converted the other applicable folio_may_straddle_isize()
users, namely generic_write_end(), ext4_write_end(), and
ext4_journalled_write_end().

Any thoughts?

Thanks,
Andreas

Andreas Gruenbacher (3):
fs: Add folio_may_straddle_isize helper
iomap: Turn iomap_page_ops into iomap_folio_ops
gfs2: Fix race between shrinker and gfs2_iomap_folio_done

fs/buffer.c | 5 ++---
fs/ext4/inode.c | 13 +++++------
fs/gfs2/bmap.c | 39 +++++++++++++++++++++++---------
fs/iomap/buffered-io.c | 51 +++++++++++++++++++++---------------------
include/linux/iomap.h | 24 ++++++++++----------
include/linux/mm.h | 2 ++
mm/truncate.c | 34 ++++++++++++++++++++++++++++
7 files changed, 110 insertions(+), 58 deletions(-)

--
2.38.1


2022-12-01 16:23:00

by Andreas Gruenbacher

[permalink] [raw]
Subject: [RFC 3/3] gfs2: Fix race between shrinker and gfs2_iomap_folio_done

In gfs2_iomap_folio_done(), add the modified buffer heads to the current
transaction while the folio is still locked. Otherwise, the shrinker
can come in and free them before we get to gfs2_page_add_databufs().

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/bmap.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 18dcaa95408e..d8d9ee843ac9 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -990,18 +990,17 @@ gfs2_iomap_folio_done(struct inode *inode, struct folio *folio,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);

- folio_unlock(folio);
-
if (!gfs2_is_stuffed(ip))
gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
copied);

+ folio_unlock(folio);
+ folio_put(folio);
+
if (tr->tr_num_buf_new)
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);

gfs2_trans_end(sdp);
-
- folio_put(folio);
}

static const struct iomap_folio_ops gfs2_iomap_folio_ops = {
--
2.38.1

2022-12-01 18:12:54

by Andreas Gruenbacher

[permalink] [raw]
Subject: [RFC v2 2/3] iomap: Turn iomap_page_ops into iomap_folio_ops

Rename the iomap page_ops into folio_ops, and rename the operations
accordingly. Move looking up the folio into ->folio_prepare(), and
unlocking and putting the folio into ->folio_done(). We'll need the
added flexibility in gfs2.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/bmap.c | 40 +++++++++++++++++++++++++----------
fs/iomap/buffered-io.c | 48 +++++++++++++++++++++++-------------------
include/linux/iomap.h | 24 ++++++++++-----------
3 files changed, 67 insertions(+), 45 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 3bdb2c668a71..18dcaa95408e 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -959,36 +959,54 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
goto out;
}

-static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
- unsigned len)
+static struct folio *
+gfs2_iomap_folio_prepare(struct inode *inode, unsigned fgp,
+ loff_t pos, unsigned len)
{
unsigned int blockmask = i_blocksize(inode) - 1;
struct gfs2_sbd *sdp = GFS2_SB(inode);
unsigned int blocks;
+ struct folio *folio;
+ int ret;

blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
- return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
+ ret = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
+ if (ret)
+ return ERR_PTR(ret);
+
+ folio = __filemap_get_folio(inode->i_mapping, pos >> PAGE_SHIFT, fgp,
+ mapping_gfp_mask(inode->i_mapping));
+ if (!folio)
+ gfs2_trans_end(sdp);
+
+ return folio;
}

-static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
- unsigned copied, struct page *page)
+static void
+gfs2_iomap_folio_done(struct inode *inode, struct folio *folio,
+ loff_t pos, unsigned copied)
{
struct gfs2_trans *tr = current->journal_info;
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);

- if (page && !gfs2_is_stuffed(ip))
- gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
+ folio_unlock(folio);
+
+ if (!gfs2_is_stuffed(ip))
+ gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
+ copied);

if (tr->tr_num_buf_new)
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);

gfs2_trans_end(sdp);
+
+ folio_put(folio);
}

-static const struct iomap_page_ops gfs2_iomap_page_ops = {
- .page_prepare = gfs2_iomap_page_prepare,
- .page_done = gfs2_iomap_page_done,
+static const struct iomap_folio_ops gfs2_iomap_folio_ops = {
+ .folio_prepare = gfs2_iomap_folio_prepare,
+ .folio_done = gfs2_iomap_folio_done,
};

static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
@@ -1064,7 +1082,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
}

if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip))
- iomap->page_ops = &gfs2_iomap_page_ops;
+ iomap->folio_ops = &gfs2_iomap_folio_ops;
return 0;

out_trans_end:
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 96e643de32a0..9f1656f3df17 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -587,7 +587,7 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
size_t len, struct folio **foliop)
{
- const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+ const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
const struct iomap *srcmap = iomap_iter_srcmap(iter);
struct folio *folio;
unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
@@ -606,17 +606,18 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
if (!mapping_large_folio_support(iter->inode->i_mapping))
len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));

- if (page_ops && page_ops->page_prepare) {
- status = page_ops->page_prepare(iter->inode, pos, len);
- if (status)
- return status;
+ if (folio_ops && folio_ops->folio_prepare) {
+ folio = folio_ops->folio_prepare(iter->inode, fgp, pos, len);
+ } else {
+ folio = __filemap_get_folio(iter->inode->i_mapping,
+ pos >> PAGE_SHIFT, fgp,
+ mapping_gfp_mask(iter->inode->i_mapping));
}
-
- folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
- fgp, mapping_gfp_mask(iter->inode->i_mapping));
- if (!folio) {
- status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
- goto out_no_page;
+ if (IS_ERR_OR_NULL(folio)) {
+ status = PTR_ERR(folio);
+ if (folio == NULL)
+ status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
+ return status;
}
if (pos + len > folio_pos(folio) + folio_size(folio))
len = folio_pos(folio) + folio_size(folio) - pos;
@@ -635,13 +636,13 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
return 0;

out_unlock:
- folio_unlock(folio);
- folio_put(folio);
+ if (folio_ops && folio_ops->folio_done) {
+ folio_ops->folio_done(iter->inode, folio, pos, 0);
+ } else {
+ folio_unlock(folio);
+ folio_put(folio);
+ }
iomap_write_failed(iter->inode, pos, len);
-
-out_no_page:
- if (page_ops && page_ops->page_done)
- page_ops->page_done(iter->inode, pos, 0, NULL);
return status;
}

@@ -691,7 +692,7 @@ static size_t iomap_write_end_inline(const struct iomap_iter *iter,
static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
size_t copied, struct folio *folio)
{
- const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
+ const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t old_size = iter->inode->i_size;
size_t ret;
@@ -715,10 +716,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
folio_may_straddle_isize(iter->inode, folio, old_size, pos);
}
- folio_unlock(folio);
- if (page_ops && page_ops->page_done)
- page_ops->page_done(iter->inode, pos, ret, &folio->page);
- folio_put(folio);
+ if (folio_ops && folio_ops->folio_done) {
+ folio_ops->folio_done(iter->inode, folio, pos, ret);
+ } else {
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+
if (ret < len)
iomap_write_failed(iter->inode, pos + ret, len - ret);
return ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17..9d3a6ad222cc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -76,7 +76,7 @@ struct vm_fault;
*/
#define IOMAP_NULL_ADDR -1ULL /* addr is not valid */

-struct iomap_page_ops;
+struct iomap_folio_ops;

struct iomap {
u64 addr; /* disk offset of mapping, bytes */
@@ -88,7 +88,7 @@ struct iomap {
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
void *private; /* filesystem private */
- const struct iomap_page_ops *page_ops;
+ const struct iomap_folio_ops *folio_ops;
};

static inline sector_t iomap_sector(const struct iomap *iomap, loff_t pos)
@@ -115,19 +115,19 @@ static inline bool iomap_inline_data_valid(const struct iomap *iomap)
}

/*
- * When a filesystem sets page_ops in an iomap mapping it returns, page_prepare
- * and page_done will be called for each page written to. This only applies to
- * buffered writes as unbuffered writes will not typically have pages
+ * When a filesystem sets folio_ops in an iomap mapping it returns, folio_prepare
+ * and folio_done will be called for each folio written to. This only applies to
+ * buffered writes as unbuffered writes will not typically have folios
* associated with them.
*
- * When page_prepare succeeds, page_done will always be called to do any
- * cleanup work necessary. In that page_done call, @page will be NULL if the
- * associated page could not be obtained.
+ * When folio_prepare succeeds, folio_done will always be called to do any
+ * cleanup work necessary.
*/
-struct iomap_page_ops {
- int (*page_prepare)(struct inode *inode, loff_t pos, unsigned len);
- void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
- struct page *page);
+struct iomap_folio_ops {
+ struct folio *(*folio_prepare)(struct inode *inode, unsigned fgp,
+ loff_t pos, unsigned len);
+ void (*folio_done)(struct inode *inode, struct folio *folio,
+ loff_t pos, unsigned copied);
};

/*
--
2.38.1

2022-12-01 18:12:54

by Andreas Gruenbacher

[permalink] [raw]
Subject: [RFC v2 1/3] fs: Add folio_may_straddle_isize helper

Add a folio_may_straddle_isize() helper as a replacement for
pagecache_isize_extended() when we have a locked folio.

Use the new helper in generic_write_end(), iomap_write_end(),
ext4_write_end(), and ext4_journalled_write_end().

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/buffer.c | 5 ++---
fs/ext4/inode.c | 13 ++++++-------
fs/iomap/buffered-io.c | 5 +----
include/linux/mm.h | 2 ++
mm/truncate.c | 34 ++++++++++++++++++++++++++++++++++
5 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d9c6d1fbb6dd..bbae1437994b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2164,16 +2164,15 @@ int generic_write_end(struct file *file, struct address_space *mapping,
* But it's important to update i_size while still holding page lock:
* page writeout could otherwise come in and zero beyond i_size.
*/
- if (pos + copied > inode->i_size) {
+ if (pos + copied > old_size) {
i_size_write(inode, pos + copied);
i_size_changed = true;
+ folio_may_straddle_isize(inode, page_folio(page), old_size, pos);
}

unlock_page(page);
put_page(page);

- if (old_size < pos)
- pagecache_isize_extended(inode, old_size, pos);
/*
* Don't mark the inode dirty under page lock. First, it unnecessarily
* makes the holding time of page lock longer. Second, it forces lock
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2b5ef1b64249..48e6b4716415 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1320,13 +1320,13 @@ static int ext4_write_end(struct file *file,
* If FS_IOC_ENABLE_VERITY is running on this inode, then Merkle tree
* blocks are being written past EOF, so skip the i_size update.
*/
- if (!verity)
+ if (!verity) {
i_size_changed = ext4_update_inode_size(inode, pos + copied);
+ folio_may_straddle_isize(inode, page_folio(page), old_size, pos);
+ }
unlock_page(page);
put_page(page);

- if (old_size < pos && !verity)
- pagecache_isize_extended(inode, old_size, pos);
/*
* Don't mark the inode dirty under page lock. First, it unnecessarily
* makes the holding time of page lock longer. Second, it forces lock
@@ -1432,16 +1432,15 @@ static int ext4_journalled_write_end(struct file *file,
if (!partial)
SetPageUptodate(page);
}
- if (!verity)
+ if (!verity) {
size_changed = ext4_update_inode_size(inode, pos + copied);
+ folio_may_straddle_isize(inode, page_folio(page), old_size, pos);
+ }
ext4_set_inode_state(inode, EXT4_STATE_JDATA);
EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
unlock_page(page);
put_page(page);

- if (old_size < pos && !verity)
- pagecache_isize_extended(inode, old_size, pos);
-
if (size_changed) {
ret2 = ext4_mark_inode_dirty(handle, inode);
if (!ret)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 91ee0b308e13..96e643de32a0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -713,15 +713,12 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
if (pos + ret > old_size) {
i_size_write(iter->inode, pos + ret);
iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
+ folio_may_straddle_isize(iter->inode, folio, old_size, pos);
}
folio_unlock(folio);
-
- if (old_size < pos)
- pagecache_isize_extended(iter->inode, old_size, pos);
if (page_ops && page_ops->page_done)
page_ops->page_done(iter->inode, pos, ret, &folio->page);
folio_put(folio);
-
if (ret < len)
iomap_write_failed(iter->inode, pos + ret, len - ret);
return ret;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..e1f03c9ed8df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1890,6 +1890,8 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,

extern void truncate_pagecache(struct inode *inode, loff_t new);
extern void truncate_setsize(struct inode *inode, loff_t newsize);
+void folio_may_straddle_isize(struct inode *inode, struct folio *folio,
+ loff_t old_size, loff_t start);
void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index c0be77e5c008..9a5d3c3c12d0 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -779,6 +779,40 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
}
EXPORT_SYMBOL(truncate_setsize);

+/**
+ * folio_may_straddle_isize - update pagecache after extending i_size
+ * @inode: inode for which i_size was extended
+ * @folio: folio to maybe mark read-only
+ * @old_size: original inode size
+ * @start: start of the write
+ *
+ * Handle extending an inode by a write that starts behind the old inode size.
+ * If a block-aligned hole exists between the old inode size and the start of
+ * the write, we mark the folio read-only so that page_mkwrite() is called on
+ * the nearest write access to the page. That way, the filesystem can be sure
+ * that page_mkwrite() is called on the page before a user writes to the page
+ * via mmap.
+ *
+ * This function must be called while we still hold i_rwsem - this not only
+ * makes sure i_size is stable but also that userspace cannot observe the new
+ * i_size value before we are prepared to handle mmap writes there.
+ */
+void folio_may_straddle_isize(struct inode *inode, struct folio *folio,
+ loff_t old_size, loff_t start)
+{
+ unsigned int blocksize = i_blocksize(inode);
+
+ if (round_up(old_size, blocksize) >= round_down(start, blocksize))
+ return;
+
+ /*
+ * See clear_page_dirty_for_io() for details why folio_set_dirty()
+ * is needed.
+ */
+ if (folio_mkclean(folio))
+ folio_set_dirty(folio);
+}
+
/**
* pagecache_isize_extended - update pagecache after extension of i_size
* @inode: inode for which i_size was extended
--
2.38.1

2022-12-01 18:12:57

by Andreas Gruenbacher

[permalink] [raw]
Subject: [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops

Hi again,

[Same thing, but with the patches split correctly this time.]

we're seeing a race between journaled data writes and the shrinker on
gfs2. What's happening is that gfs2_iomap_page_done() is called after
the page has been unlocked, so try_to_free_buffers() can come in and
free the buffers while gfs2_iomap_page_done() is trying to add them to
the transaction. Not good.

This is a proposal to change iomap_page_ops so that page_prepare()
prepares the write and grabs the locked page, and page_done() unlocks
and puts that page again. While at it, this also converts the hooks
from pages to folios.

To move the pagecache_isize_extended() call in iomap_write_end() out of
the way, a new folio_may_straddle_isize() helper is introduced that
takes a locked folio. That is then used when the inode size is updated,
before the folio is unlocked.

I've also converted the other applicable folio_may_straddle_isize()
users, namely generic_write_end(), ext4_write_end(), and
ext4_journalled_write_end().

Any thoughts?

Thanks,
Andreas

Andreas Gruenbacher (3):
fs: Add folio_may_straddle_isize helper
iomap: Turn iomap_page_ops into iomap_folio_ops
gfs2: Fix race between shrinker and gfs2_iomap_folio_done

fs/buffer.c | 5 ++---
fs/ext4/inode.c | 13 +++++------
fs/gfs2/bmap.c | 39 +++++++++++++++++++++++---------
fs/iomap/buffered-io.c | 51 +++++++++++++++++++++---------------------
include/linux/iomap.h | 24 ++++++++++----------
include/linux/mm.h | 2 ++
mm/truncate.c | 34 ++++++++++++++++++++++++++++
7 files changed, 110 insertions(+), 58 deletions(-)

--
2.38.1

2022-12-01 18:12:57

by Andreas Gruenbacher

[permalink] [raw]
Subject: [RFC v2 3/3] gfs2: Fix race between shrinker and gfs2_iomap_folio_done

In gfs2_iomap_folio_done(), add the modified buffer heads to the current
transaction while the folio is still locked. Otherwise, the shrinker
can come in and free them before we get to gfs2_page_add_databufs().

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/bmap.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 18dcaa95408e..d8d9ee843ac9 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -990,18 +990,17 @@ gfs2_iomap_folio_done(struct inode *inode, struct folio *folio,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);

- folio_unlock(folio);
-
if (!gfs2_is_stuffed(ip))
gfs2_page_add_databufs(ip, &folio->page, offset_in_page(pos),
copied);

+ folio_unlock(folio);
+ folio_put(folio);
+
if (tr->tr_num_buf_new)
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);

gfs2_trans_end(sdp);
-
- folio_put(folio);
}

static const struct iomap_folio_ops gfs2_iomap_folio_ops = {
--
2.38.1

2022-12-01 21:32:50

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops

On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote:
> Hi again,
>
> [Same thing, but with the patches split correctly this time.]
>
> we're seeing a race between journaled data writes and the shrinker on
> gfs2. What's happening is that gfs2_iomap_page_done() is called after
> the page has been unlocked, so try_to_free_buffers() can come in and
> free the buffers while gfs2_iomap_page_done() is trying to add them to
> the transaction. Not good.
>
> This is a proposal to change iomap_page_ops so that page_prepare()
> prepares the write and grabs the locked page, and page_done() unlocks
> and puts that page again. While at it, this also converts the hooks
> from pages to folios.
>
> To move the pagecache_isize_extended() call in iomap_write_end() out of
> the way, a new folio_may_straddle_isize() helper is introduced that
> takes a locked folio. That is then used when the inode size is updated,
> before the folio is unlocked.
>
> I've also converted the other applicable folio_may_straddle_isize()
> users, namely generic_write_end(), ext4_write_end(), and
> ext4_journalled_write_end().
>
> Any thoughts?

I doubt that moving page cache operations from the iomap core to
filesystem specific callouts will be acceptible. I recently proposed
patches that added page cache walking to an XFS iomap callout to fix
a data corruption, but they were NAKd on the basis that iomap is
supposed to completely abstract away the folio and page cache
manipulations from the filesystem.

This patchset seems to be doing the same thing - moving page cache
and folio management directly in filesystem specific callouts. Hence
I'm going to assume that the same architectural demarcation is
going to apply here, too...

FYI, there is already significant change committed to the iomap
write path in the current XFS tree as a result of the changes I
mention - there is stale IOMAP detection which adds a new page ops
method and adds new error paths with a locked folio in
iomap_write_begin().

And this other data corruption (and performance) fix for handling
zeroing over unwritten extents properly:

https://lore.kernel.org/linux-xfs/[email protected]/

changes the way folios are looked up and instantiated in the page
cache in iomap_write_begin(). It also adds new error conditions that
need to be returned to callers so to implement conditional "folio
must be present and dirty" page cache zeroing from
iomap_zero_iter(). Those semantics would also have to be supported
by gfs2, and that greatly complicates modifying and testing iomap
core changes.

To avoid all this, can we simple move the ->page_done() callout in
the error path and iomap_write_end() to before we unlock the folio?
You've already done that for pagecache_isize_extended(), and I can't
see anything obvious in the gfs2 ->page_done callout that
would cause issues if it is called with a locked dirty folio...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2022-12-02 01:59:17

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops

On Thu, Dec 1, 2022 at 10:30 PM Dave Chinner <[email protected]> wrote:
> On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote:
> > Hi again,
> >
> > [Same thing, but with the patches split correctly this time.]
> >
> > we're seeing a race between journaled data writes and the shrinker on
> > gfs2. What's happening is that gfs2_iomap_page_done() is called after
> > the page has been unlocked, so try_to_free_buffers() can come in and
> > free the buffers while gfs2_iomap_page_done() is trying to add them to
> > the transaction. Not good.
> >
> > This is a proposal to change iomap_page_ops so that page_prepare()
> > prepares the write and grabs the locked page, and page_done() unlocks
> > and puts that page again. While at it, this also converts the hooks
> > from pages to folios.
> >
> > To move the pagecache_isize_extended() call in iomap_write_end() out of
> > the way, a new folio_may_straddle_isize() helper is introduced that
> > takes a locked folio. That is then used when the inode size is updated,
> > before the folio is unlocked.
> >
> > I've also converted the other applicable folio_may_straddle_isize()
> > users, namely generic_write_end(), ext4_write_end(), and
> > ext4_journalled_write_end().
> >
> > Any thoughts?
>
> I doubt that moving page cache operations from the iomap core to
> filesystem specific callouts will be acceptible. I recently proposed
> patches that added page cache walking to an XFS iomap callout to fix
> a data corruption, but they were NAKd on the basis that iomap is
> supposed to completely abstract away the folio and page cache
> manipulations from the filesystem.

Right. The resulting code is really quite disgusting, for a
fundamentalist dream of abstraction.

> This patchset seems to be doing the same thing - moving page cache
> and folio management directly in filesystem specific callouts. Hence
> I'm going to assume that the same architectural demarcation is
> going to apply here, too...
>
> FYI, there is already significant change committed to the iomap
> write path in the current XFS tree as a result of the changes I
> mention - there is stale IOMAP detection which adds a new page ops
> method and adds new error paths with a locked folio in
> iomap_write_begin().

That would have belonged on the iomap-for-next branch rather than in
the middle of a bunch of xfs commits.

> And this other data corruption (and performance) fix for handling
> zeroing over unwritten extents properly:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
>
> changes the way folios are looked up and instantiated in the page
> cache in iomap_write_begin(). It also adds new error conditions that
> need to be returned to callers so to implement conditional "folio
> must be present and dirty" page cache zeroing from
> iomap_zero_iter(). Those semantics would also have to be supported
> by gfs2, and that greatly complicates modifying and testing iomap
> core changes.
>
> To avoid all this, can we simple move the ->page_done() callout in
> the error path and iomap_write_end() to before we unlock the folio?
> You've already done that for pagecache_isize_extended(), and I can't
> see anything obvious in the gfs2 ->page_done callout that
> would cause issues if it is called with a locked dirty folio...

Yes, I guess we can do that once pagecache_isize_extended() is
replaced by folio_may_straddle_isize().

Can people please scrutinize the math in folio_may_straddle_isize() in
particular?

Thanks,
Andreas

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2022-12-05 23:08:47

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops

On Fri, Dec 02, 2022 at 02:54:00AM +0100, Andreas Gruenbacher wrote:
> On Thu, Dec 1, 2022 at 10:30 PM Dave Chinner <[email protected]> wrote:
> > On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote:
> > > Hi again,
> > >
> > > [Same thing, but with the patches split correctly this time.]
> > >
> > > we're seeing a race between journaled data writes and the shrinker on
> > > gfs2. What's happening is that gfs2_iomap_page_done() is called after
> > > the page has been unlocked, so try_to_free_buffers() can come in and
> > > free the buffers while gfs2_iomap_page_done() is trying to add them to
> > > the transaction. Not good.
> > >
> > > This is a proposal to change iomap_page_ops so that page_prepare()
> > > prepares the write and grabs the locked page, and page_done() unlocks
> > > and puts that page again. While at it, this also converts the hooks
> > > from pages to folios.
> > >
> > > To move the pagecache_isize_extended() call in iomap_write_end() out of
> > > the way, a new folio_may_straddle_isize() helper is introduced that
> > > takes a locked folio. That is then used when the inode size is updated,
> > > before the folio is unlocked.
> > >
> > > I've also converted the other applicable folio_may_straddle_isize()
> > > users, namely generic_write_end(), ext4_write_end(), and
> > > ext4_journalled_write_end().
> > >
> > > Any thoughts?
> >
> > I doubt that moving page cache operations from the iomap core to
> > filesystem specific callouts will be acceptible. I recently proposed
> > patches that added page cache walking to an XFS iomap callout to fix
> > a data corruption, but they were NAKd on the basis that iomap is
> > supposed to completely abstract away the folio and page cache
> > manipulations from the filesystem.
>
> Right. The resulting code is really quite disgusting, for a
> fundamentalist dream of abstraction.
>
> > This patchset seems to be doing the same thing - moving page cache
> > and folio management directly in filesystem specific callouts. Hence
> > I'm going to assume that the same architectural demarcation is
> > going to apply here, too...
> >
> > FYI, there is already significant change committed to the iomap
> > write path in the current XFS tree as a result of the changes I
> > mention - there is stale IOMAP detection which adds a new page ops
> > method and adds new error paths with a locked folio in
> > iomap_write_begin().
>
> That would have belonged on the iomap-for-next branch rather than in
> the middle of a bunch of xfs commits.

Damned if you do, damned if you don't.

There were non-trivial cross dependencies between XFS and iomap in
that patch set. The initial IOMAP_F_STALE infrastructure needed XFS
changes first, otherwise it could deadlock at ENOSPC on write page
faults. i.e. the iomap change in isolation broke stuff, so we're
forced to either carry XFs changes in iomap or iomap changes in XFS
so that there are no regressions in a given tree.

Then we had to move XFS functionality to iomap to fix another data
corruption that the IOMAP_F_STALE infrastructure exposed in XFS via
generic/346. Once the code was moved, then we could build it up into
the page cache scanning functionality in iomap. And only then could
we add the XFS IOMAP_F_STALE validation to XFS to solve the original
data corruption that started all this off.

IOWs, there were so many cross dependencies between XFs and iomap
that it was largely impossible to break it up into two separate sets
of indpendent patches that didn't cause regressions in one or the
other tree. And in the end, we'd still have to merge the iomap tree
into XFS or vice versa to actually test that the data corruption fix
worked.

In situations like this, we commonly take the entire series into one
of the two trees rather than make a whole lot more work for
ourselves by trying to separate them out. And in this case, because
it was XFS data corruption and race conditions that needed fixing,
it made sense to take it through the XFS tree so that it gets
coverage from all the XFS testing that happens - the iomap tree gets
a lot less early coverage than the XFS tree...

> > And this other data corruption (and performance) fix for handling
> > zeroing over unwritten extents properly:
> >
> > https://lore.kernel.org/linux-xfs/[email protected]/
> >
> > changes the way folios are looked up and instantiated in the page
> > cache in iomap_write_begin(). It also adds new error conditions that
> > need to be returned to callers so to implement conditional "folio
> > must be present and dirty" page cache zeroing from
> > iomap_zero_iter(). Those semantics would also have to be supported
> > by gfs2, and that greatly complicates modifying and testing iomap
> > core changes.
> >
> > To avoid all this, can we simple move the ->page_done() callout in
> > the error path and iomap_write_end() to before we unlock the folio?
> > You've already done that for pagecache_isize_extended(), and I can't
> > see anything obvious in the gfs2 ->page_done callout that
> > would cause issues if it is called with a locked dirty folio...
>
> Yes, I guess we can do that once pagecache_isize_extended() is
> replaced by folio_may_straddle_isize().
>
> Can people please scrutinize the math in folio_may_straddle_isize() in
> particular?

I'll look at it more closely in the next couple of days.

-Dave.
--
Dave Chinner
[email protected]