2018-06-19 16:41:30

by Christoph Hellwig

[permalink] [raw]
Subject: iomap preparations for GFS2 v3

Hi all,

this is a slight rework of the patches from Andreas to prepare for gfs2
using the iomap code.

Note: I'd like to start with an immutable branch for iomap patches in either
the XFS tree or a tree of mine at the beginning of the merge window so that
we have a common base for the GFS2 and XFS iomap work.

Changes since v2:
- add another patch from Andreas to provide private data for the file
system
- revert the unioning of bdev and dax_dev for now, needs further work

Changes since v1:
- move code to a different spot in iomap.c to prefer for readpages
inline data support
- add a forward declaration for struct page to iomap.h
- fix a typo in the dax_dev/bdev unioning
- fix a comment typo


2018-06-19 16:41:32

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 2/6] iomap: mark newly allocated buffer heads as new

From: Andreas Gruenbacher <[email protected]>

In iomap_to_bh, not only mark buffer heads in IOMAP_UNWRITTEN maps as
new, but also buffer heads in IOMAP_MAPPED maps with the IOMAP_F_NEW
flag set. This will be used by filesystems like gfs2, which allocate
blocks in iomap->begin.

Minor corrections to the comment for IOMAP_UNWRITTEN maps.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/buffer.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index aba2a948b235..c8c2b7d8b8d6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1900,15 +1900,16 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
break;
case IOMAP_UNWRITTEN:
/*
- * For unwritten regions, we always need to ensure that
- * sub-block writes cause the regions in the block we are not
- * writing to are zeroed. Set the buffer as new to ensure this.
+ * For unwritten regions, we always need to ensure that regions
+ * in the block we are not writing to are zeroed. Mark the
+ * buffer as new to ensure this.
*/
set_buffer_new(bh);
set_buffer_unwritten(bh);
/* FALLTHRU */
case IOMAP_MAPPED:
- if (offset >= i_size_read(inode))
+ if ((iomap->flags & IOMAP_F_NEW) ||
+ offset >= i_size_read(inode))
set_buffer_new(bh);
bh->b_blocknr = (iomap->addr + offset - iomap->offset) >>
inode->i_blkbits;
--
2.17.1

2018-06-19 16:41:35

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 5/6] iomap: add a page_done callback

This will be used by gfs2 to attach data to transactions for the journaled
data mode. But the concept is generic enough that we might be able to
use it for other purposes like encryption/integrity post-processing in the
future.

Based on a patch from Andreas Gruenbacher.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/iomap.c | 3 +++
include/linux/iomap.h | 9 +++++++++
2 files changed, 12 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 4aecd7c5dbd8..a1f71e64ea49 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -201,6 +201,9 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
copied, page, NULL);
}

+ if (iomap->page_done)
+ iomap->page_done(inode, pos, copied, page, iomap);
+
if (ret < len)
iomap_write_failed(inode, pos, len);
return ret;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 10d6cff7f69a..45f43865b0f0 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -9,6 +9,7 @@ struct fiemap_extent_info;
struct inode;
struct iov_iter;
struct kiocb;
+struct page;
struct vm_area_struct;
struct vm_fault;

@@ -56,6 +57,14 @@ struct iomap {
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
+
+ /*
+ * Called when finished processing a page in the mapping returned in
+ * this iomap. At least for now this is only supported in the buffered
+ * write path.
+ */
+ void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
+ struct page *page, struct iomap *iomap);
};

/*
--
2.17.1

2018-06-19 16:41:36

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 6/6] iomap: add private pointer to struct iomap

From: Andreas Gruenbacher <[email protected]>

Add a private pointer to struct iomap to allow filesystems to pass data
from iomap_begin to iomap_end. Will be used by gfs2 for passing on the
on-disk inode buffer head.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
include/linux/iomap.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 45f43865b0f0..1f36523d448a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -57,6 +57,7 @@ struct iomap {
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
void *inline_data;
+ void *private; /* filesystem private */

/*
* Called when finished processing a page in the mapping returned in
--
2.17.1

2018-06-19 16:41:33

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 3/6] iomap: complete partial direct I/O writes synchronously

From: Andreas Gruenbacher <[email protected]>

According to xfstest generic/240, applications seem to expect direct I/O
writes to either complete as a whole or to fail; short direct I/O writes
are apparently not appreciated. This means that when only part of an
asynchronous direct I/O write succeeds, we can either fail the entire
write, or we can wait for the partial write to complete and retry the
remaining write as buffered I/O. The old __blockdev_direct_IO helper
has code for waiting for partial writes to complete; the new
iomap_dio_rw iomap helper does not.

The above mentioned fallback mode is needed for gfs2, which doesn't
allow block allocations under direct I/O to avoid taking cluster-wide
exclusive locks. As a consequence, an asynchronous direct I/O write to
a file range that contains a hole will result in a short write. In that
case, wait for the short write to complete to allow gfs2 to recover.

Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/iomap.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 77397b5a96ef..9c454459a1e9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -811,6 +811,7 @@ struct iomap_dio {
atomic_t ref;
unsigned flags;
int error;
+ bool wait_for_completion;

union {
/* used during submission and for synchronous completion: */
@@ -914,9 +915,8 @@ static void iomap_dio_bio_end_io(struct bio *bio)
iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));

if (atomic_dec_and_test(&dio->ref)) {
- if (is_sync_kiocb(dio->iocb)) {
+ if (dio->wait_for_completion) {
struct task_struct *waiter = dio->submit.waiter;
-
WRITE_ONCE(dio->submit.waiter, NULL);
wake_up_process(waiter);
} else if (dio->flags & IOMAP_DIO_WRITE) {
@@ -1131,13 +1131,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
dio->end_io = end_io;
dio->error = 0;
dio->flags = 0;
+ dio->wait_for_completion = is_sync_kiocb(iocb);

dio->submit.iter = iter;
- if (is_sync_kiocb(iocb)) {
- dio->submit.waiter = current;
- dio->submit.cookie = BLK_QC_T_NONE;
- dio->submit.last_queue = NULL;
- }
+ dio->submit.waiter = current;
+ dio->submit.cookie = BLK_QC_T_NONE;
+ dio->submit.last_queue = NULL;

if (iov_iter_rw(iter) == READ) {
if (pos >= dio->i_size)
@@ -1187,7 +1186,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
dio_warn_stale_pagecache(iocb->ki_filp);
ret = 0;

- if (iov_iter_rw(iter) == WRITE && !is_sync_kiocb(iocb) &&
+ if (iov_iter_rw(iter) == WRITE && !dio->wait_for_completion &&
!inode->i_sb->s_dio_done_wq) {
ret = sb_init_dio_done_wq(inode->i_sb);
if (ret < 0)
@@ -1202,8 +1201,10 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
iomap_dio_actor);
if (ret <= 0) {
/* magic error code to fall back to buffered I/O */
- if (ret == -ENOTBLK)
+ if (ret == -ENOTBLK) {
+ dio->wait_for_completion = true;
ret = 0;
+ }
break;
}
pos += ret;
@@ -1224,7 +1225,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
dio->flags &= ~IOMAP_DIO_NEED_SYNC;

if (!atomic_dec_and_test(&dio->ref)) {
- if (!is_sync_kiocb(iocb))
+ if (!dio->wait_for_completion)
return -EIOCBQUEUED;

for (;;) {
--
2.17.1

2018-06-19 16:41:34

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 4/6] iomap: generic inline data handling

From: Andreas Gruenbacher <[email protected]>

Add generic inline data handling by adding a pointer to the inline data
region to struct iomap. When handling a buffered IOMAP_INLINE write,
iomap_write_begin will copy the current inline data from the inline data
region into the page cache, and iomap_write_end will copy the changes in
the page cache back to the inline data region.

This doesn't cover inline data reads and direct I/O yet because so far,
we have no users.

Signed-off-by: Andreas Gruenbacher <[email protected]>
[hch: small cleanups to better fit in with other iomap work]
Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/iomap.c | 62 ++++++++++++++++++++++++++++++++++++++-----
include/linux/iomap.h | 1 +
2 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 9c454459a1e9..4aecd7c5dbd8 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -103,6 +103,26 @@ iomap_sector(struct iomap *iomap, loff_t pos)
return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
}

+static void
+iomap_read_inline_data(struct inode *inode, struct page *page,
+ struct iomap *iomap)
+{
+ size_t size = i_size_read(inode);
+ void *addr;
+
+ if (PageUptodate(page))
+ return;
+
+ BUG_ON(page->index);
+ BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data));
+
+ addr = kmap_atomic(page);
+ memcpy(addr, iomap->inline_data, size);
+ memset(addr + size, 0, PAGE_SIZE - size);
+ kunmap_atomic(addr);
+ SetPageUptodate(page);
+}
+
static void
iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
{
@@ -133,7 +153,11 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
if (!page)
return -ENOMEM;

- status = __block_write_begin_int(page, pos, len, NULL, iomap);
+ if (iomap->type == IOMAP_INLINE)
+ iomap_read_inline_data(inode, page, iomap);
+ else
+ status = __block_write_begin_int(page, pos, len, NULL, iomap);
+
if (unlikely(status)) {
unlock_page(page);
put_page(page);
@@ -146,14 +170,37 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
return status;
}

+static int
+iomap_write_end_inline(struct inode *inode, struct page *page,
+ struct iomap *iomap, loff_t pos, unsigned copied)
+{
+ void *addr;
+
+ WARN_ON_ONCE(!PageUptodate(page));
+ BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
+
+ addr = kmap_atomic(page);
+ memcpy(iomap->inline_data + pos, addr + pos, copied);
+ kunmap_atomic(addr);
+
+ mark_inode_dirty(inode);
+ __generic_write_end(inode, pos, copied, page);
+ return copied;
+}
+
static int
iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
- unsigned copied, struct page *page)
+ unsigned copied, struct page *page, struct iomap *iomap)
{
int ret;

- ret = generic_write_end(NULL, inode->i_mapping, pos, len,
- copied, page, NULL);
+ if (iomap->type == IOMAP_INLINE) {
+ ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
+ } else {
+ ret = generic_write_end(NULL, inode->i_mapping, pos, len,
+ copied, page, NULL);
+ }
+
if (ret < len)
iomap_write_failed(inode, pos, len);
return ret;
@@ -208,7 +255,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,

flush_dcache_page(page);

- status = iomap_write_end(inode, pos, bytes, copied, page);
+ status = iomap_write_end(inode, pos, bytes, copied, page,
+ iomap);
if (unlikely(status < 0))
break;
copied = status;
@@ -302,7 +350,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t length, void *data,

WARN_ON_ONCE(!PageUptodate(page));

- status = iomap_write_end(inode, pos, bytes, bytes, page);
+ status = iomap_write_end(inode, pos, bytes, bytes, page, iomap);
if (unlikely(status <= 0)) {
if (WARN_ON_ONCE(status == 0))
return -EIO;
@@ -354,7 +402,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
zero_user(page, offset, bytes);
mark_page_accessed(page);

- return iomap_write_end(inode, pos, bytes, bytes, page);
+ return iomap_write_end(inode, pos, bytes, bytes, page, iomap);
}

static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index a044a824da85..10d6cff7f69a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -55,6 +55,7 @@ struct iomap {
u16 flags; /* flags for mapping */
struct block_device *bdev; /* block device for I/O */
struct dax_device *dax_dev; /* dax_dev for dax operations */
+ void *inline_data;
};

/*
--
2.17.1

2018-06-19 16:41:31

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 1/6] fs: factor out a __generic_write_end helper

Bits of the buffer.c based write_end implementations that don't know
about buffer_heads and can be reused by other implementations.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 67 +++++++++++++++++++++++++++------------------------
fs/internal.h | 2 ++
2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index cabc045f483d..aba2a948b235 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2076,6 +2076,40 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
}
EXPORT_SYMBOL(block_write_begin);

+int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
+ struct page *page)
+{
+ loff_t old_size = inode->i_size;
+ bool i_size_changed = false;
+
+ /*
+ * No need to use i_size_read() here, the i_size cannot change under us
+ * because we hold i_rwsem.
+ *
+ * But it's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+ if (pos + copied > inode->i_size) {
+ i_size_write(inode, pos + copied);
+ i_size_changed = true;
+ }
+
+ unlock_page(page);
+ put_page(page);
+
+ if (old_size < pos)
+ pagecache_isize_extended(inode, old_size, pos);
+ /*
+ * Don't mark the inode dirty under page lock. First, it unnecessarily
+ * makes the holding time of page lock longer. Second, it forces lock
+ * ordering of page lock and transaction start for journaling
+ * filesystems.
+ */
+ if (i_size_changed)
+ mark_inode_dirty(inode);
+ return copied;
+}
+
int block_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
@@ -2116,39 +2150,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
{
- struct inode *inode = mapping->host;
- loff_t old_size = inode->i_size;
- int i_size_changed = 0;
-
copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- *
- * But it's important to update i_size while still holding page lock:
- * page writeout could otherwise come in and zero beyond i_size.
- */
- if (pos+copied > inode->i_size) {
- i_size_write(inode, pos+copied);
- i_size_changed = 1;
- }
-
- unlock_page(page);
- put_page(page);
-
- if (old_size < pos)
- pagecache_isize_extended(inode, old_size, pos);
- /*
- * Don't mark the inode dirty under page lock. First, it unnecessarily
- * makes the holding time of page lock longer. Second, it forces lock
- * ordering of page lock and transaction start for journaling
- * filesystems.
- */
- if (i_size_changed)
- mark_inode_dirty(inode);
-
- return copied;
+ return __generic_write_end(mapping->host, pos, copied, page);
}
EXPORT_SYMBOL(generic_write_end);

diff --git a/fs/internal.h b/fs/internal.h
index 980d005b21b4..4a18bdbd2214 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -43,6 +43,8 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
extern void guard_bio_eod(int rw, struct bio *bio);
extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block, struct iomap *iomap);
+int __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
+ struct page *page);

/*
* char_dev.c
--
2.17.1

2018-07-13 02:58:53

by Goldwyn Rodrigues

[permalink] [raw]
Subject: Re: [PATCH 5/6] iomap: add a page_done callback

On 18:41 19/06, Christoph Hellwig wrote:
> This will be used by gfs2 to attach data to transactions for the journaled
> data mode. But the concept is generic enough that we might be able to
> use it for other purposes like encryption/integrity post-processing in the
> future.
>
> Based on a patch from Andreas Gruenbacher.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/iomap.c | 3 +++
> include/linux/iomap.h | 9 +++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 4aecd7c5dbd8..a1f71e64ea49 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -201,6 +201,9 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> copied, page, NULL);
> }
>
> + if (iomap->page_done)
> + iomap->page_done(inode, pos, copied, page, iomap);
> +
> if (ret < len)
> iomap_write_failed(inode, pos, len);
> return ret;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 10d6cff7f69a..45f43865b0f0 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -9,6 +9,7 @@ struct fiemap_extent_info;
> struct inode;
> struct iov_iter;
> struct kiocb;
> +struct page;
> struct vm_area_struct;
> struct vm_fault;
>
> @@ -56,6 +57,14 @@ struct iomap {
> struct block_device *bdev; /* block device for I/O */
> struct dax_device *dax_dev; /* dax_dev for dax operations */
> void *inline_data;
> +
> + /*
> + * Called when finished processing a page in the mapping returned in
> + * this iomap. At least for now this is only supported in the buffered
> + * write path.
> + */
> + void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
> + struct page *page, struct iomap *iomap);
> };
>

Would it be better if we keep this function as a part of iomap_ops? I
know gfs2 sets this conditionally, but we can always check the condition
in the function assigned to page_done(). It would keep all the functions
in one place.

--
Goldwyn

2018-07-17 14:02:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/6] iomap: add a page_done callback

On Thu, Jul 12, 2018 at 09:58:53PM -0500, Goldwyn Rodrigues wrote:
> Would it be better if we keep this function as a part of iomap_ops? I
> know gfs2 sets this conditionally, but we can always check the condition
> in the function assigned to page_done(). It would keep all the functions
> in one place.

It works at a very different level than the other iomap ops. We had
extensive discussions on previous versions, it should all be on
the list.