LinuxLists.cc - [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

2012-11-20 07:41:44

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Hi everybody,

On March 29th, Jeff Moyer posted to lkml a patchset with this note:

> Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
> are sync'd before the I/O is actually issued (everybody else). The following
> patch series fixes this in two parts. First, for the file systems that use
> the generic routines, Jan has provided some generic infrastructure to perform
> the syncs after the I/O is completed. Second, for those file systems which
> require some endio processing of their own for O_DIRECT writes (xfs and
> ext4), [Jeff] implemented file system specific syncing. This passes the
> updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
> in the aio group. [Jeff] tested ext3, ext4, xfs, and btrfs only.

Since the original post a few months ago, this patchset doesn't seem to have
made any progress. An internal testing team here discovered that the issue
also affects O_SYNC+AIO+DIO writes to block devices. Worse yet, since the
flushes were being issued (and waited upon) directly in the io_submit call
graph, the io_submit calls themselves would take a very long time to complete.
Therefore, I added another patch to move the flush to the io_end processing.

The blockdev patch was written by me. The ext4 patch had to be updated to
accomodate a rework of the ext4 endio code that landed since March. Everything
else has been passed through from Jeff's March 30th resend, with few changes.

This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.

Comments and questions are, as always, welcome.

--D

2012-11-20 07:41:38

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

Hi,

Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
function that can be used from the I/O post-processing path.

From: Jeff Moyer <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
---
fs/xfs/xfs_file.c | 44 +++++++++++++++++++++++++++++---------------
fs/xfs/xfs_inode.h | 1 +
2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index aa473fa..507f446 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -152,25 +152,18 @@ xfs_dir_fsync(
return _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
}

-STATIC int
-xfs_file_fsync(
- struct file *file,
- loff_t start,
- loff_t end,
+/*
+ * Returns 0 on success, -errno on failure.
+ */
+int
+do_xfs_file_fsync(
+ struct xfs_inode *ip,
+ struct xfs_mount *mp,
int datasync)
{
- struct inode *inode = file->f_mapping->host;
- struct xfs_inode *ip = XFS_I(inode);
- struct xfs_mount *mp = ip->i_mount;
- int error = 0;
int log_flushed = 0;
xfs_lsn_t lsn = 0;
-
- trace_xfs_file_fsync(ip);
-
- error = filemap_write_and_wait_range(inode->i_mapping, start, end);
- if (error)
- return error;
+ int error = 0;

if (XFS_FORCED_SHUTDOWN(mp))
return -XFS_ERROR(EIO);
@@ -222,6 +215,27 @@ xfs_file_fsync(
return -error;
}

+STATIC int
+xfs_file_fsync(
+ struct file *file,
+ loff_t start,
+ loff_t end,
+ int datasync)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ int error = 0;
+
+ trace_xfs_file_fsync(ip);
+
+ error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+ if (error)
+ return error;
+
+ return do_xfs_file_fsync(ip, mp, datasync);
+}
+
STATIC ssize_t
xfs_file_aio_read(
struct kiocb *iocb,
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 94b32f9..d5bf105 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -546,6 +546,7 @@ do { \
iput(VFS_I(ip)); \
} while (0)

+int do_xfs_file_fsync(struct xfs_inode *, struct xfs_mount *, int);
#endif /* __KERNEL__ */

/*

2012-11-20 07:41:23

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystems wanting to
use the helpers have to pass DIO_SYNC_WRITES to __blockdev_direct_IO. If the
filesystem doesn't provide its own direct IO end_io handler, the generic code
will take care of issuing the flush. Otherwise, the filesystem's custom end_io
handler is passed struct dio_sync_io_work pointer as 'private' argument, and it
must call generic_dio_end_io() to finish the AIO DIO. The generic code then
takes care to call generic_write_sync() from a workqueue context when AIO DIO
is complete.

Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
and the generic suffices for them, make blockdev_direct_IO() pass the new
DIO_SYNC_WRITES flag.

From: Jan Kara <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
[[email protected]: Minor style and changelog fixes]
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/direct-io.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/super.c | 2 +
include/linux/fs.h | 14 +++++-
3 files changed, 137 insertions(+), 5 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index f86c720..b7391d4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
#include <linux/uio.h>
#include <linux/atomic.h>
#include <linux/prefetch.h>
+#include <asm/cmpxchg.h>

/*
* How many user pages to map in one call to get_user_pages(). This determines
@@ -112,6 +113,15 @@ struct dio_submit {
unsigned tail; /* last valid page + 1 */
};

+/* state needed for final sync and completion of O_SYNC AIO DIO */
+struct dio_sync_io_work {
+ struct kiocb *iocb;
+ loff_t offset;
+ ssize_t len;
+ int ret;
+ struct work_struct work;
+};
+
/* dio_state communicated between submission path and end_io */
struct dio {
int flags; /* doesn't change */
@@ -134,6 +144,7 @@ struct dio {
/* AIO related stuff */
struct kiocb *iocb; /* kiocb */
ssize_t result; /* IO result */
+ struct dio_sync_io_work *sync_work; /* work used for O_SYNC AIO */

/*
* pages[] (and any fields placed after it) are not zeroed out at
@@ -217,6 +228,45 @@ static inline struct page *dio_get_page(struct dio *dio,
}

/**
+ * generic_dio_end_io() - generic dio ->end_io handler
+ * @iocb: iocb of finishing DIO
+ * @offset: the byte offset in the file of the completed operation
+ * @bytes: length of the completed operation
+ * @work: work to queue for O_SYNC AIO DIO, NULL otherwise
+ * @ret: error code if IO failed
+ * @is_async: is this AIO?
+ *
+ * This is generic callback to be called when direct IO is finished. It
+ * handles update of number of outstanding DIOs for an inode, completion
+ * of async iocb and queueing of work if we need to call fsync() because
+ * io was O_SYNC.
+ */
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+ struct dio_sync_io_work *work, int ret, bool is_async)
+{
+ struct inode *inode = iocb->ki_filp->f_dentry->d_inode;
+
+ if (!is_async) {
+ inode_dio_done(inode);
+ return;
+ }
+
+ /*
+ * If we need to sync file, we offload completion to workqueue
+ */
+ if (work) {
+ work->ret = ret;
+ work->offset = offset;
+ work->len = bytes;
+ queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
+ } else {
+ aio_complete(iocb, ret, 0);
+ inode_dio_done(inode);
+ }
+}
+EXPORT_SYMBOL(generic_dio_end_io);
+
+/**
* dio_complete() - called when all DIO BIO I/O has been completed
* @offset: the byte offset in the file of the completed operation
*
@@ -258,12 +308,23 @@ static ssize_t dio_complete(struct dio *dio, loff_t offset, ssize_t ret, bool is
ret = transferred;

if (dio->end_io && dio->result) {
+ void *private;
+
+ if (dio->sync_work)
+ private = dio->sync_work;
+ else
+ private = dio->private;
+
dio->end_io(dio->iocb, offset, transferred,
- dio->private, ret, is_async);
+ private, ret, is_async);
} else {
- if (is_async)
- aio_complete(dio->iocb, ret, 0);
- inode_dio_done(dio->inode);
+ /* No IO submitted? Skip syncing... */
+ if (!dio->result && dio->sync_work) {
+ kfree(dio->sync_work);
+ dio->sync_work = NULL;
+ }
+ generic_dio_end_io(dio->iocb, offset, transferred,
+ dio->sync_work, ret, is_async);
}

return ret;
@@ -1020,6 +1081,41 @@ static inline int drop_refcount(struct dio *dio)
}

/*
+ * Work performed from workqueue when AIO DIO is finished.
+ */
+static void dio_aio_sync_work(struct work_struct *work)
+{
+ struct dio_sync_io_work *sync_work =
+ container_of(work, struct dio_sync_io_work, work);
+ struct kiocb *iocb = sync_work->iocb;
+ struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
+ int err, ret = sync_work->ret;
+
+ err = generic_write_sync(iocb->ki_filp, sync_work->offset,
+ sync_work->len);
+ if (err < 0 && ret > 0)
+ ret = err;
+ aio_complete(iocb, ret, 0);
+ inode_dio_done(inode);
+}
+
+static noinline int dio_create_flush_wq(struct super_block *sb)
+{
+ struct workqueue_struct *wq =
+ alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
+
+ if (!wq)
+ return -ENOMEM;
+ /*
+ * Atomically put workqueue in place. Release our one in case someone
+ * else won the race and attached workqueue to superblock.
+ */
+ if (cmpxchg(&sb->s_dio_flush_wq, NULL, wq))
+ destroy_workqueue(wq);
+ return 0;
+}
+
+/*
* This is a library function for use by filesystem drivers.
*
* The locking rules are governed by the flags parameter:
@@ -1112,6 +1208,26 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
memset(dio, 0, offsetof(struct dio, pages));

dio->flags = flags;
+ if (flags & DIO_SYNC_WRITES && rw & WRITE &&
+ ((iocb->ki_filp->f_flags & O_DSYNC) || IS_SYNC(inode))) {
+ /* The first O_SYNC AIO DIO for this FS? Create workqueue... */
+ if (!inode->i_sb->s_dio_flush_wq) {
+ retval = dio_create_flush_wq(inode->i_sb);
+ if (retval) {
+ kmem_cache_free(dio_cache, dio);
+ goto out;
+ }
+ }
+ dio->sync_work = kmalloc(sizeof(struct dio_sync_io_work),
+ GFP_KERNEL);
+ if (!dio->sync_work) {
+ retval = -ENOMEM;
+ kmem_cache_free(dio_cache, dio);
+ goto out;
+ }
+ INIT_WORK(&dio->sync_work->work, dio_aio_sync_work);
+ dio->sync_work->iocb = iocb;
+ }
if (dio->flags & DIO_LOCKING) {
if (rw == READ) {
struct address_space *mapping =
@@ -1124,6 +1240,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
end - 1);
if (retval) {
mutex_unlock(&inode->i_mutex);
+ kfree(dio->sync_work);
kmem_cache_free(dio_cache, dio);
goto out;
}
@@ -1271,6 +1388,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,

if (drop_refcount(dio) == 0) {
retval = dio_complete(dio, offset, retval, false);
+ kfree(dio->sync_work);
kmem_cache_free(dio_cache, dio);
} else
BUG_ON(retval != -EIOCBQUEUED);
diff --git a/fs/super.c b/fs/super.c
index 12f1237..7478202 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -244,6 +244,8 @@ static inline void destroy_super(struct super_block *s)
#ifdef CONFIG_SMP
free_percpu(s->s_files);
#endif
+ if (s->s_dio_flush_wq)
+ destroy_workqueue(s->s_dio_flush_wq);
destroy_sb_writers(s);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..40dd206 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -44,6 +44,7 @@ struct vm_area_struct;
struct vfsmount;
struct cred;
struct swap_info_struct;
+struct workqueue_struct;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1321,6 +1322,9 @@ struct super_block {

/* Being remounted read-only */
int s_readonly_remount;
+
+ /* Pending fsync calls for completed AIO DIO with O_SYNC */
+ struct workqueue_struct *s_dio_flush_wq;
};

/* superblock cache pruning functions */
@@ -2433,8 +2437,13 @@ enum {

/* filesystem does not support filling holes */
DIO_SKIP_HOLES = 0x02,
+
+ /* need generic handling of O_SYNC aio writes */
+ DIO_SYNC_WRITES = 0x04,
};

+struct dio_sync_io_work;
+
void dio_end_io(struct bio *bio, int error);

ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -2448,12 +2457,15 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
{
return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
offset, nr_segs, get_block, NULL, NULL,
- DIO_LOCKING | DIO_SKIP_HOLES);
+ DIO_LOCKING | DIO_SKIP_HOLES |
+ DIO_SYNC_WRITES);
}
#endif

void inode_dio_wait(struct inode *inode);
void inode_dio_done(struct inode *inode);
+void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
+ struct dio_sync_io_work *work, int ret, bool is_async);

extern const struct file_operations generic_ro_fops;

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2012-11-20 07:41:31

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
flushed after the write completion. Instead, it's flushed *before* the
I/O is sent to the disk (in __generic_file_aio_write). This patch
attempts to fix that problem by marking an I/O as requiring a cache
flush in endio processing. I'll send a follow-on patch to the
generic write code to get rid of the bogus generic_write_sync call
when EIOCBQUEUED is returned.

From: Jeff Moyer <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
[[email protected]: Rework original patch to reflect a subsequent
ext4 reorganization]
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 9 +++++
fs/ext4/file.c | 2 +
fs/ext4/inode.c | 6 +++
fs/ext4/page-io.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--------
fs/ext4/super.c | 13 +++++++
5 files changed, 106 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c20de1..f5a0b89 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -186,6 +186,7 @@ struct mpage_da_data {
#define EXT4_IO_END_ERROR 0x0002
#define EXT4_IO_END_QUEUED 0x0004
#define EXT4_IO_END_DIRECT 0x0008
+#define EXT4_IO_END_NEEDS_SYNC 0x0010

struct ext4_io_page {
struct page *p_page;
@@ -1279,6 +1280,9 @@ struct ext4_sb_info {
/* workqueue for dio unwritten */
struct workqueue_struct *dio_unwritten_wq;

+ /* workqueue for aio+dio+o_sync disk cache flushing */
+ struct workqueue_struct *aio_dio_flush_wq;
+
/* timer for periodic error stats printing */
struct timer_list s_err_report;

@@ -1335,6 +1339,11 @@ static inline void ext4_set_io_unwritten_flag(struct inode *inode,
}
}

+static inline int ext4_io_end_deferred(ext4_io_end_t *i)
+{
+ return i->flag & (EXT4_IO_END_UNWRITTEN | EXT4_IO_END_NEEDS_SYNC);
+}
+
static inline ext4_io_end_t *ext4_inode_aio(struct inode *inode)
{
return inode->i_private;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bf3966b..dd34b81 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -155,7 +155,7 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);

- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
ssize_t err;

err = generic_write_sync(file, pos, ret);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..fc4f05e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2893,8 +2893,12 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,

iocb->private = NULL;

+ /* AIO+DIO+O_SYNC I/Os need a cache flush after completion */
+ if (is_async && (IS_SYNC(inode) || (iocb->ki_filp->f_flags & O_DSYNC)))
+ io_end->flag |= EXT4_IO_END_NEEDS_SYNC;
+
/* if not aio dio with unwritten extents, just free io and return */
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+ if (!ext4_io_end_deferred(io_end)) {
ext4_free_io_end(io_end);
out:
if (is_async)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 68e896e..cda013a 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -84,6 +84,50 @@ void ext4_free_io_end(ext4_io_end_t *io)
kmem_cache_free(io_end_cachep, io);
}

+/*
+ * This function is called in the completion path for AIO O_SYNC|O_DIRECT
+ * writes, and also in the fsync path. The purpose is to ensure that the
+ * disk caches for the journal and data devices are flushed.
+ */
+static int ext4_end_io_do_flush(ext4_io_end_t *io)
+{
+ struct inode *inode = io->inode;
+ tid_t commit_tid;
+ bool needs_barrier = false;
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ int barriers_enabled = test_opt(inode->i_sb, BARRIER);
+ int ret = 0;
+
+ if (!barriers_enabled)
+ return 0;
+
+ /*
+ * If we are running in nojournal mode, just flush the disk
+ * cache and return.
+ */
+ if (!journal)
+ return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
+
+ if (ext4_should_journal_data(inode)) {
+ ret = ext4_force_commit(inode->i_sb);
+ goto out;
+ }
+
+ commit_tid = io->iocb->ki_filp->f_flags & __O_SYNC ?
+ EXT4_I(inode)->i_sync_tid : EXT4_I(inode)->i_datasync_tid;
+ if (!jbd2_trans_will_send_data_barrier(journal, commit_tid))
+ needs_barrier = true;
+
+ jbd2_log_start_commit(journal, commit_tid);
+ ret = jbd2_log_wait_commit(journal, commit_tid);
+
+ if (!ret && needs_barrier)
+ ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
+
+out:
+ return ret;
+}
+
/* check a range of space and convert unwritten extents to written. */
static int ext4_end_io(ext4_io_end_t *io)
{
@@ -96,21 +140,37 @@ static int ext4_end_io(ext4_io_end_t *io)
"list->prev 0x%p\n",
io, inode->i_ino, io->list.next, io->list.prev);

- ret = ext4_convert_unwritten_extents(inode, offset, size);
- if (ret < 0) {
- ext4_msg(inode->i_sb, KERN_EMERG,
- "failed to convert unwritten extents to written "
- "extents -- potential data loss! "
- "(inode %lu, offset %llu, size %zd, error %d)",
- inode->i_ino, offset, size, ret);
+ if (io->flag & EXT4_IO_END_UNWRITTEN) {
+ ret = ext4_convert_unwritten_extents(inode, offset, size);
+ if (ret < 0) {
+ ext4_msg(inode->i_sb, KERN_EMERG,
+ "failed to convert unwritten extents to "
+ "written extents -- potential data loss! "
+ "(inode %lu, offset %llu, size %zd, error %d)",
+ inode->i_ino, offset, size, ret);
+ goto endio;
+ }
}
+
+ /*
+ * This function has two callers. The first is the end_io_work
+ * routine just below, which is an asynchronous completion context.
+ * The second is in the fsync path. For the latter path, we can't
+ * return from here until the job is done. Hence,
+ * ext4_end_io_do_flush is blocking.
+ */
+ if (io->flag & EXT4_IO_END_NEEDS_SYNC)
+ ret = ext4_end_io_do_flush(io);
+
+endio:
if (io->iocb)
aio_complete(io->iocb, io->result, 0);

if (io->flag & EXT4_IO_END_DIRECT)
inode_dio_done(inode);
/* Wake up anyone waiting on unwritten extent conversion */
- if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
+ if (io->flag & EXT4_IO_END_UNWRITTEN &&
+ atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
wake_up_all(ext4_ioend_wq(io->inode));
return ret;
}
@@ -149,8 +209,11 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
struct workqueue_struct *wq;
unsigned long flags;

- BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
- wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+ BUG_ON(!ext4_io_end_deferred(io_end));
+ if (io_end->flag & EXT4_IO_END_UNWRITTEN)
+ wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
+ else
+ wq = EXT4_SB(io_end->inode->i_sb)->aio_dio_flush_wq;

spin_lock_irqsave(&ei->i_completed_io_lock, flags);
if (list_empty(&ei->i_completed_io_list)) {
@@ -180,7 +243,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,

while (!list_empty(&unwritten)) {
io = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
+ BUG_ON(!ext4_io_end_deferred(io));
list_del_init(&io->list);

err = ext4_end_io(io);
@@ -192,7 +255,8 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
spin_lock_irqsave(&ei->i_completed_io_lock, flags);
while (!list_empty(&complete)) {
io = list_entry(complete.next, ext4_io_end_t, list);
- io->flag &= ~EXT4_IO_END_UNWRITTEN;
+ io->flag &= ~(EXT4_IO_END_UNWRITTEN |
+ EXT4_IO_END_NEEDS_SYNC);
/* end_io context can not be destroyed now because it still
* used by queued worker. Worker thread will destroy it later */
if (io->flag & EXT4_IO_END_QUEUED)
@@ -204,7 +268,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
* flag, and destroy it's end_io if it was converted already */
if (work_io) {
work_io->flag &= ~EXT4_IO_END_QUEUED;
- if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
+ if (!ext4_io_end_deferred(work_io))
list_add_tail(&work_io->list, &to_free);
}
spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
@@ -319,7 +383,7 @@ static void ext4_end_bio(struct bio *bio, int error)
bi_sector >> (inode->i_blkbits - 9));
}

- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+ if (!ext4_io_end_deferred(io_end)) {
ext4_free_io_end(io_end);
return;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 80928f7..c657c4d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -849,7 +849,9 @@ static void ext4_put_super(struct super_block *sb)
dquot_disable(sb, -1, DQUOT_USAGE_ENABLED | DQUOT_LIMITS_ENABLED);

flush_workqueue(sbi->dio_unwritten_wq);
+ flush_workqueue(sbi->aio_dio_flush_wq);
destroy_workqueue(sbi->dio_unwritten_wq);
+ destroy_workqueue(sbi->aio_dio_flush_wq);

if (sbi->s_journal) {
err = jbd2_journal_destroy(sbi->s_journal);
@@ -3913,6 +3915,14 @@ no_journal:
goto failed_mount_wq;
}

+ EXT4_SB(sb)->aio_dio_flush_wq =
+ alloc_workqueue("ext4-aio-dio-flush",
+ WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+ if (!EXT4_SB(sb)->aio_dio_flush_wq) {
+ pr_err("EXT4-fs: failed to create flush workqueue\n");
+ goto failed_flush_wq;
+ }
+
/*
* The jbd2_journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
@@ -4045,6 +4055,8 @@ failed_mount4a:
sb->s_root = NULL;
failed_mount4:
ext4_msg(sb, KERN_ERR, "mount failed");
+ destroy_workqueue(EXT4_SB(sb)->aio_dio_flush_wq);
+failed_flush_wq:
destroy_workqueue(EXT4_SB(sb)->dio_unwritten_wq);
failed_mount_wq:
if (sbi->s_journal) {
@@ -4494,6 +4506,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)

trace_ext4_sync_fs(sb, wait);
flush_workqueue(sbi->dio_unwritten_wq);
+ flush_workqueue(sbi->aio_dio_flush_wq);
/*
* Writeback quota in non-journalled quota case - journalled quota has
* no dirty dquots

2012-11-20 08:16:16

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
flushed after the write completion for AIOs. This patch attempts to fix
that problem by marking an I/O as requiring a cache flush in endio
processing, and then issuing the cache flush after any unwritten extent
conversion is done.

From: Jeff Moyer <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
[[email protected]: Rework patch to use per-mount workqueues]
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/xfs/xfs_aops.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_aops.h | 1 +
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 8 ++++++++
4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e57e2da..9cebbb7 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -173,6 +173,24 @@ xfs_setfilesize(
}

/*
+ * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
+ * the disk cache when the I/O is complete.
+ */
+STATIC bool
+xfs_ioend_needs_cache_flush(
+ struct xfs_ioend *ioend)
+{
+ struct xfs_inode *ip = XFS_I(ioend->io_inode);
+ struct xfs_mount *mp = ip->i_mount;
+
+ if (!(mp->m_flags & XFS_MOUNT_BARRIER))
+ return false;
+
+ return IS_SYNC(ioend->io_inode) ||
+ (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
+}
+
+/*
* Schedule IO completion handling on the final put of an ioend.
*
* If there is no work to do we might as well call it a day and free the
@@ -189,11 +207,30 @@ xfs_finish_ioend(
queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
else if (ioend->io_append_trans)
queue_work(mp->m_data_workqueue, &ioend->io_work);
+ else if (ioend->io_needs_fsync)
+ queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
else
xfs_destroy_ioend(ioend);
}
}

+STATIC int
+xfs_ioend_force_cache_flush(
+ xfs_ioend_t *ioend)
+{
+ struct xfs_inode *ip = XFS_I(ioend->io_inode);
+ struct xfs_mount *mp = ip->i_mount;
+ int err = 0;
+ int datasync;
+
+ datasync = !IS_SYNC(ioend->io_inode) &&
+ !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
+ err = do_xfs_file_fsync(ip, mp, datasync);
+ xfs_destroy_ioend(ioend);
+ /* do_xfs_file_fsync returns -errno. our caller expects positive. */
+ return -err;
+}
+
/*
* IO write completion.
*/
@@ -250,12 +287,22 @@ xfs_end_io(
error = xfs_setfilesize(ioend);
if (error)
ioend->io_error = -error;
+ } else if (ioend->io_needs_fsync) {
+ error = xfs_ioend_force_cache_flush(ioend);
+ if (error && ioend->io_result > 0)
+ ioend->io_error = -error;
+ ioend->io_needs_fsync = 0;
} else {
ASSERT(!xfs_ioend_is_append(ioend));
}

done:
- xfs_destroy_ioend(ioend);
+ /* the honoring of O_SYNC has to be done last */
+ if (ioend->io_needs_fsync) {
+ atomic_inc(&ioend->io_remaining);
+ xfs_finish_ioend(ioend);
+ } else
+ xfs_destroy_ioend(ioend);
}

/*
@@ -292,6 +339,7 @@ xfs_alloc_ioend(
atomic_set(&ioend->io_remaining, 1);
ioend->io_isasync = 0;
ioend->io_isdirect = 0;
+ ioend->io_needs_fsync = 0;
ioend->io_error = 0;
ioend->io_list = NULL;
ioend->io_type = type;
@@ -1409,6 +1457,8 @@ xfs_end_io_direct_write(

if (is_async) {
ioend->io_isasync = 1;
+ if (xfs_ioend_needs_cache_flush(ioend))
+ ioend->io_needs_fsync = 1;
xfs_finish_ioend(ioend);
} else {
xfs_finish_ioend_sync(ioend);
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index c325abb..e48c7c2 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -47,6 +47,7 @@ typedef struct xfs_ioend {
atomic_t io_remaining; /* hold count */
unsigned int io_isasync : 1; /* needs aio_complete */
unsigned int io_isdirect : 1;/* direct I/O */
+ unsigned int io_needs_fsync : 1; /* aio+dio+o_sync */
struct inode *io_inode; /* file being written to */
struct buffer_head *io_buffer_head;/* buffer linked list head */
struct buffer_head *io_buffer_tail;/* buffer linked list tail */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index deee09e..ecd3d2e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -209,6 +209,7 @@ typedef struct xfs_mount {
struct workqueue_struct *m_data_workqueue;
struct workqueue_struct *m_unwritten_workqueue;
struct workqueue_struct *m_cil_workqueue;
+ struct workqueue_struct *m_aio_blkdev_flush_wq;
} xfs_mount_t;

/*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 26a09bd..b05b557 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -863,8 +863,15 @@ xfs_init_mount_workqueues(
WQ_MEM_RECLAIM, 0, mp->m_fsname);
if (!mp->m_cil_workqueue)
goto out_destroy_unwritten;
+
+ mp->m_aio_blkdev_flush_wq = alloc_workqueue("xfs-aio-blkdev-flush/%s",
+ WQ_MEM_RECLAIM, 0, mp->m_fsname);
+ if (!mp->m_aio_blkdev_flush_wq)
+ goto out_destroy_cil_queue;
return 0;

+out_destroy_cil_queue:
+ destroy_workqueue(mp->m_cil_workqueue);
out_destroy_unwritten:
destroy_workqueue(mp->m_unwritten_workqueue);
out_destroy_data_iodone_queue:
@@ -877,6 +884,7 @@ STATIC void
xfs_destroy_mount_workqueues(
struct xfs_mount *mp)
{
+ destroy_workqueue(mp->m_aio_blkdev_flush_wq);
destroy_workqueue(mp->m_cil_workqueue);
destroy_workqueue(mp->m_data_workqueue);
destroy_workqueue(mp->m_unwritten_workqueue);

2012-11-20 07:51:14

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 6/9] gfs2: Use generic handlers of O_SYNC AIO DIO

Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

From: Jan Kara <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
---
fs/gfs2/aops.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 01c4975..8b1d0f7 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1022,7 +1022,7 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,

rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
offset, nr_segs, gfs2_get_block_direct,
- NULL, NULL, 0);
+ NULL, NULL, DIO_SYNC_WRITES);
out:
gfs2_glock_dq(&gh);
gfs2_holder_uninit(&gh);

2012-11-20 07:51:14

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 5/9] btrfs: Use generic handlers of O_SYNC AIO DIO

Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file. Although we use our own bio->end_io function, we call dio_end_io()
from it and thus, because we don't set any specific dio->end_io function,
generic code ends up calling generic_dio_end_io() which is all what we need
for proper O_SYNC AIO DIO handling.

From: Jan Kara <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
[[email protected]: Don't issue flush if aio is queued]
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/btrfs/file.c | 2 +-
fs/btrfs/inode.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9ab1bed..37b5bb3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1495,7 +1495,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
* one running right now.
*/
BTRFS_I(inode)->last_trans = root->fs_info->generation + 1;
- if (num_written > 0 || num_written == -EIOCBQUEUED) {
+ if (num_written > 0) {
err = generic_write_sync(file, pos, num_written);
if (err < 0 && num_written > 0)
num_written = err;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 95542a1..c8b6049 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6579,7 +6579,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
return __blockdev_direct_IO(rw, iocb, inode,
BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
- btrfs_submit_direct, 0);
+ btrfs_submit_direct, DIO_SYNC_WRITES);
}

static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,

2012-11-20 08:19:30

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 7/9] ocfs2: Use generic handlers of O_SYNC AIO DIO

Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
file.

From: Jan Kara <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
---
fs/ocfs2/aops.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 6577432..60457cc 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
level = ocfs2_iocb_rw_locked_level(iocb);
ocfs2_rw_unlock(inode, level);

- if (is_async)
- aio_complete(iocb, ret, 0);
- inode_dio_done(inode);
+ generic_dio_end_io(iocb, offset, bytes, private, ret, is_async);
}

/*
@@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
iov, offset, nr_segs,
ocfs2_direct_IO_get_blocks,
- ocfs2_dio_end_io, NULL, 0);
+ ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES);
}

static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,

2012-11-20 07:51:15

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly

When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
flag so that flushes are issued /after/ the write completes, not before.

Note, however, that for block devices, the DIO setup code ensures that a flush
wq is attached to the superblock of the bdevfs filesystem, not the filesystem
that the device node happens to reside in. This means that unlike regular
files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb. Therefore, adjust
Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/block_dev.c | 5 +++--
fs/direct-io.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..05ff33a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
struct inode *inode = file->f_mapping->host;

return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
- nr_segs, blkdev_get_blocks, NULL, NULL, 0);
+ nr_segs, blkdev_get_blocks, NULL, NULL,
+ DIO_SYNC_WRITES);
}

int __sync_blockdev(struct block_device *bdev, int wait)
@@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
percpu_down_read(&bdev->bd_block_size_semaphore);

ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
ssize_t err;

err = generic_write_sync(file, pos, ret);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b7391d4..c626c43 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
work->ret = ret;
work->offset = offset;
work->len = bytes;
- queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
+ queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
+ &work->work);
} else {
aio_complete(iocb, ret, 0);
inode_dio_done(inode);

2012-11-20 08:19:37

by Darrick J. Wong

[permalink] [raw]

Subject: [PATCH 8/9] filemap: don't call generic_write_sync for -EIOCBQUEUED

Hi,

As it stands, generic_file_aio_write will call into generic_write_sync
when -EIOCBQUEUED is returned from __generic_file_aio_write. EIOCBQUEUED
indicates that an I/O was submitted but NOT completed. Thus, we will
flush the disk cache, potentially before the write(s) even make it to
the disk! Up until now, this has been the best we could do, as file
systems didn't bother to flush the disk cache after an O_SYNC AIO+DIO
write. After applying the prior two patches to xfs and ext4, at least
the major two file systems do the right thing. So, let's go ahead and
fix this backwards logic.

From: Jeff Moyer <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..8e14c10 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2532,7 +2532,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);

- if (ret > 0 || ret == -EIOCBQUEUED) {
+ if (ret > 0) {
ssize_t err;

err = generic_write_sync(file, pos, ret);

2012-11-20 08:38:51

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

Apologies for the garbage emails. The corporate email server lost its mind,
and I'm well on my way to losing mine.... <grumble> <shakes fist>

They at least were so garbled that they're not attached to this thread.

--D

> Hi everybody,
>
> On March 29th, Jeff Moyer posted to lkml a patchset with this note:
>
> > Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
> > are sync'd before the I/O is actually issued (everybody else). The following
> > patch series fixes this in two parts. First, for the file systems that use
> > the generic routines, Jan has provided some generic infrastructure to perform
> > the syncs after the I/O is completed. Second, for those file systems which
> > require some endio processing of their own for O_DIRECT writes (xfs and
> > ext4), [Jeff] implemented file system specific syncing. This passes the
> > updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
> > in the aio group. [Jeff] tested ext3, ext4, xfs, and btrfs only.
>
> Since the original post a few months ago, this patchset doesn't seem to have
> made any progress. An internal testing team here discovered that the issue
> also affects O_SYNC+AIO+DIO writes to block devices. Worse yet, since the
> flushes were being issued (and waited upon) directly in the io_submit call
> graph, the io_submit calls themselves would take a very long time to complete.
> Therefore, I added another patch to move the flush to the io_end processing.
>
> The blockdev patch was written by me. The ext4 patch had to be updated to
> accomodate a rework of the ext4 endio code that landed since March. Everything
> else has been passed through from Jeff's March 30th resend, with few changes.
>
> This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
> ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.
>
> Comments and questions are, as always, welcome.
>
> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-11-20 10:07:55

[permalink] [raw]

Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

On Mon 19-11-12 23:41:31, Darrick J. Wong wrote:
> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion. Instead, it's flushed *before* the
> I/O is sent to the disk (in __generic_file_aio_write). This patch
> attempts to fix that problem by marking an I/O as requiring a cache
> flush in endio processing. I'll send a follow-on patch to the
> generic write code to get rid of the bogus generic_write_sync call
> when EIOCBQUEUED is returned.
>
> From: Jeff Moyer <[email protected]>
> Signed-off-by: Jeff Moyer <[email protected]>
> [[email protected]: Rework original patch to reflect a subsequent
> ext4 reorganization]
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/ext4/ext4.h | 9 +++++
> fs/ext4/file.c | 2 +
> fs/ext4/inode.c | 6 +++
> fs/ext4/page-io.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--------
> fs/ext4/super.c | 13 +++++++
> 5 files changed, 106 insertions(+), 16 deletions(-)
>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3c20de1..f5a0b89 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -186,6 +186,7 @@ struct mpage_da_data {
> #define EXT4_IO_END_ERROR 0x0002
> #define EXT4_IO_END_QUEUED 0x0004
> #define EXT4_IO_END_DIRECT 0x0008
> +#define EXT4_IO_END_NEEDS_SYNC 0x0010
>
> struct ext4_io_page {
> struct page *p_page;
> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {
> /* workqueue for dio unwritten */
> struct workqueue_struct *dio_unwritten_wq;
>
> + /* workqueue for aio+dio+o_sync disk cache flushing */
> + struct workqueue_struct *aio_dio_flush_wq;
> +
Umm, I'm not completely decided whether we really need a separate
workqueue. But it doesn't cost too much so I guess it makes some sense -
fsync() is rather heavy so syncing won't starve extent conversion...

> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 68e896e..cda013a 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -84,6 +84,50 @@ void ext4_free_io_end(ext4_io_end_t *io)
> kmem_cache_free(io_end_cachep, io);
> }
>
> +/*
> + * This function is called in the completion path for AIO O_SYNC|O_DIRECT
> + * writes, and also in the fsync path. The purpose is to ensure that the
> + * disk caches for the journal and data devices are flushed.
> + */
> +static int ext4_end_io_do_flush(ext4_io_end_t *io)
> +{
> + struct inode *inode = io->inode;
> + tid_t commit_tid;
> + bool needs_barrier = false;
> + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + int barriers_enabled = test_opt(inode->i_sb, BARRIER);
> + int ret = 0;
> +
> + if (!barriers_enabled)
> + return 0;
This is definitely wrong. Even if barriers are disabled, we may need to
push out some buffers or commit a transaction.

> +
> + /*
> + * If we are running in nojournal mode, just flush the disk
> + * cache and return.
> + */
> + if (!journal)
> + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
And this is wrong as well - you need to do work similar to what
ext4_sync_file() does. Actually it would be *much* better if these two
sites used the same helper function. Which also poses an interesting
question about locking - do we need i_mutex or not? Forcing a transaction
commit is definitely OK without it, similarly as grabbing transaction ids
from inode or ext4_should_journal_data() test. __sync_inode() call seems
to be OK without i_mutex as well so I believe we can just get rid of it
(getting i_mutex from the workqueue is a locking nightmare we don't want to
return to).

> +
> + if (ext4_should_journal_data(inode)) {
> + ret = ext4_force_commit(inode->i_sb);
> + goto out;
> + }
> +
> + commit_tid = io->iocb->ki_filp->f_flags & __O_SYNC ?
> + EXT4_I(inode)->i_sync_tid : EXT4_I(inode)->i_datasync_tid;
> + if (!jbd2_trans_will_send_data_barrier(journal, commit_tid))
> + needs_barrier = true;
> +
> + jbd2_log_start_commit(journal, commit_tid);
> + ret = jbd2_log_wait_commit(journal, commit_tid);
> +
> + if (!ret && needs_barrier)
> + ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
> +
> +out:
> + return ret;
> +}
> +
> /* check a range of space and convert unwritten extents to written. */
> static int ext4_end_io(ext4_io_end_t *io)
> {
...
> @@ -149,8 +209,11 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
> struct workqueue_struct *wq;
> unsigned long flags;
>
> - BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> - wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> + BUG_ON(!ext4_io_end_deferred(io_end));
> + if (io_end->flag & EXT4_IO_END_UNWRITTEN)
> + wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
> + else
> + wq = EXT4_SB(io_end->inode->i_sb)->aio_dio_flush_wq;
Umm, I'd prefer if we used aio_dio_flush_wq when EXT4_IO_END_NEEDS_SYNC
is set. That way slow syncing works will be always offloaded to a separate
workqueue.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-20 10:16:02

[permalink] [raw]

Subject: Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly

On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:
> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
> flag so that flushes are issued /after/ the write completes, not before.
>
> Note, however, that for block devices, the DIO setup code ensures that a flush
> wq is attached to the superblock of the bdevfs filesystem, not the filesystem
> that the device node happens to reside in. This means that unlike regular
> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb. Therefore, adjust
> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/block_dev.c | 5 +++--
> fs/direct-io.c | 3 ++-
> 2 files changed, 5 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1a1e5e3..05ff33a 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> struct inode *inode = file->f_mapping->host;
>
> return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
> - nr_segs, blkdev_get_blocks, NULL, NULL, 0);
> + nr_segs, blkdev_get_blocks, NULL, NULL,
> + DIO_SYNC_WRITES);
> }
>
> int __sync_blockdev(struct block_device *bdev, int wait)
> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
> percpu_down_read(&bdev->bd_block_size_semaphore);
>
> ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
> - if (ret > 0 || ret == -EIOCBQUEUED) {
> + if (ret > 0) {
> ssize_t err;
>
> err = generic_write_sync(file, pos, ret);
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b7391d4..c626c43 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
> work->ret = ret;
> work->offset = offset;
> work->len = bytes;
> - queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
> + queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
> + &work->work);
This should be folded into the original patch introducing the
s_dio_flush_wq. And please add a comment before this line saying that block
devices need a dereference exactly like this... Otherwise the patch looks
good so you can add:
Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-20 10:24:23

[permalink] [raw]

Subject: Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

On Mon 19-11-12 23:51:14, Darrick J. Wong wrote:
> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion for AIOs. This patch attempts to fix
> that problem by marking an I/O as requiring a cache flush in endio
> processing, and then issuing the cache flush after any unwritten extent
> conversion is done.
>
> From: Jeff Moyer <[email protected]>
> Signed-off-by: Jeff Moyer <[email protected]>
> [[email protected]: Rework patch to use per-mount workqueues]
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/xfs/xfs_aops.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/xfs/xfs_aops.h | 1 +
> fs/xfs/xfs_mount.h | 1 +
> fs/xfs/xfs_super.c | 8 ++++++++
> 4 files changed, 61 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e57e2da..9cebbb7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -173,6 +173,24 @@ xfs_setfilesize(
> }
>
> /*
> + * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
> + * the disk cache when the I/O is complete.
> + */
> +STATIC bool
> +xfs_ioend_needs_cache_flush(
> + struct xfs_ioend *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> +
> + if (!(mp->m_flags & XFS_MOUNT_BARRIER))
> + return false;
> +
> + return IS_SYNC(ioend->io_inode) ||
> + (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
> +}
I'm not really a XFS developer but calling things "...cache_flush" when
you actually mean fsync is IMHO misleading.

> +
> +/*
> * Schedule IO completion handling on the final put of an ioend.
> *
> * If there is no work to do we might as well call it a day and free the
> @@ -189,11 +207,30 @@ xfs_finish_ioend(
> queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> else if (ioend->io_append_trans)
> queue_work(mp->m_data_workqueue, &ioend->io_work);
> + else if (ioend->io_needs_fsync)
> + queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
> else
> xfs_destroy_ioend(ioend);
> }
> }
>
> +STATIC int
> +xfs_ioend_force_cache_flush(
> + xfs_ioend_t *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> + int err = 0;
> + int datasync;
> +
> + datasync = !IS_SYNC(ioend->io_inode) &&
> + !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
> + err = do_xfs_file_fsync(ip, mp, datasync);
> + xfs_destroy_ioend(ioend);
> + /* do_xfs_file_fsync returns -errno. our caller expects positive. */
> + return -err;
> +}
> +
> /*
> * IO write completion.
> */
> @@ -250,12 +287,22 @@ xfs_end_io(
> error = xfs_setfilesize(ioend);
> if (error)
> ioend->io_error = -error;
> + } else if (ioend->io_needs_fsync) {
> + error = xfs_ioend_force_cache_flush(ioend);
> + if (error && ioend->io_result > 0)
> + ioend->io_error = -error;
> + ioend->io_needs_fsync = 0;
> } else {
> ASSERT(!xfs_ioend_is_append(ioend));
> }
>
> done:
> - xfs_destroy_ioend(ioend);
> + /* the honoring of O_SYNC has to be done last */
> + if (ioend->io_needs_fsync) {
> + atomic_inc(&ioend->io_remaining);
> + xfs_finish_ioend(ioend);
> + } else
> + xfs_destroy_ioend(ioend);
> }
Umm, I don't quite get why you do things the way you do in xfs_end_io().
Why don't you handle fsync there always but offload it instead to workqueue
again in some cases? Is it so that it gets processed in the right workqueue
or why?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-20 10:47:41

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

On Mon, Nov 19, 2012 at 11:41:38PM -0800, Darrick J. Wong wrote:
> Hi,
>
> Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
> function that can be used from the I/O post-processing path.
>
> From: Jeff Moyer <[email protected]>
> Signed-off-by: Jeff Moyer <[email protected]>
> ---
> fs/xfs/xfs_file.c | 44 +++++++++++++++++++++++++++++---------------
> fs/xfs/xfs_inode.h | 1 +
> 2 files changed, 30 insertions(+), 15 deletions(-)
>
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index aa473fa..507f446 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -152,25 +152,18 @@ xfs_dir_fsync(
> return _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL);
> }
>
> -STATIC int
> -xfs_file_fsync(
> - struct file *file,
> - loff_t start,
> - loff_t end,
> +/*
> + * Returns 0 on success, -errno on failure.
> + */
> +int
> +do_xfs_file_fsync(
> + struct xfs_inode *ip,
> + struct xfs_mount *mp,
> int datasync)

xfs_do_file_fsync()

And being an internal XFS function, it should return a positive
error number.

....

> +{
> + struct inode *inode = file->f_mapping->host;
> + struct xfs_inode *ip = XFS_I(inode);
> + struct xfs_mount *mp = ip->i_mount;
> + int error = 0;
> +
> + trace_xfs_file_fsync(ip);
> +
> + error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> + if (error)
> + return error;
> +
> + return do_xfs_file_fsync(ip, mp, datasync);

And be negated here.

> +}
> +
> STATIC ssize_t
> xfs_file_aio_read(
> struct kiocb *iocb,
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 94b32f9..d5bf105 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -546,6 +546,7 @@ do { \
> iput(VFS_I(ip)); \
> } while (0)
>
> +int do_xfs_file_fsync(struct xfs_inode *, struct xfs_mount *, int);
> #endif /* __KERNEL__ */

This should probably go in fs/xfs/xfs_vnodeops.h (like the only
other non-static function (xfs_zero_eof) in fs/xfs/xfs_file.c is.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-11-20 11:20:38

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

On Mon, Nov 19, 2012 at 11:51:14PM -0800, Darrick J. Wong wrote:
> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion for AIOs. This patch attempts to fix
> that problem by marking an I/O as requiring a cache flush in endio
> processing, and then issuing the cache flush after any unwritten extent
> conversion is done.
>
> From: Jeff Moyer <[email protected]>
> Signed-off-by: Jeff Moyer <[email protected]>
> [[email protected]: Rework patch to use per-mount workqueues]
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/xfs/xfs_aops.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/xfs/xfs_aops.h | 1 +
> fs/xfs/xfs_mount.h | 1 +
> fs/xfs/xfs_super.c | 8 ++++++++
> 4 files changed, 61 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e57e2da..9cebbb7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -173,6 +173,24 @@ xfs_setfilesize(
> }
>
> /*
> + * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
> + * the disk cache when the I/O is complete.
> + */

That's not quite right - we need to fsync the metadata when the data
IO is complete. Block device/disk cache flushes are irrelevant at
this level as that is wholly encapsulated inside the metadata fsync
processing.

> +STATIC bool
> +xfs_ioend_needs_cache_flush(

xfs_ioend_need_fsync()

> + struct xfs_ioend *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> +
> + if (!(mp->m_flags & XFS_MOUNT_BARRIER))
> + return false;

And regardless of whether we have barriers enabled or not, we need
to flush the dirty metadata to the log for O_SYNC/O_DSYNC+AIO+DIO
writes here. So there should be no check of the mount flags here.

> + return IS_SYNC(ioend->io_inode) ||
> + (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
> +}
> +
> +/*
> * Schedule IO completion handling on the final put of an ioend.
> *
> * If there is no work to do we might as well call it a day and free the
> @@ -189,11 +207,30 @@ xfs_finish_ioend(
> queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> else if (ioend->io_append_trans)
> queue_work(mp->m_data_workqueue, &ioend->io_work);
> + else if (ioend->io_needs_fsync)
> + queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
> else
> xfs_destroy_ioend(ioend);
> }
> }
>
> +STATIC int
> +xfs_ioend_force_cache_flush(

STATIC void
xfs_ioend_fsync()

> + xfs_ioend_t *ioend)
> +{
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> + int err = 0;
> + int datasync;

trace_xfs_ioend_fsync(ip);

> + datasync = !IS_SYNC(ioend->io_inode) &&
> + !(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
> + err = do_xfs_file_fsync(ip, mp, datasync);
> + xfs_destroy_ioend(ioend);

I think this is wrong. The ioend is destroyed by the caller, so
putting it here turns all subsequent uses by the caller into
use-after-free memory corruption bugs.

> + /* do_xfs_file_fsync returns -errno. our caller expects positive. */
> + return -err;

This is why xfs_do_file_fsync() should be returning a positive
error - to avoid repeated juggling of error signs.

> +}
> +
> /*
> * IO write completion.
> */
> @@ -250,12 +287,22 @@ xfs_end_io(
> error = xfs_setfilesize(ioend);
> if (error)
> ioend->io_error = -error;
> + } else if (ioend->io_needs_fsync) {
> + error = xfs_ioend_force_cache_flush(ioend);
> + if (error && ioend->io_result > 0)
> + ioend->io_error = -error;
> + ioend->io_needs_fsync = 0;

So this is all use-after-free. Also, there's no need to clear
io_needs_fsync() as the ioend is about to be destroyed.

> } else {
> ASSERT(!xfs_ioend_is_append(ioend));
> }
>
> done:
> - xfs_destroy_ioend(ioend);
> + /* the honoring of O_SYNC has to be done last */
> + if (ioend->io_needs_fsync) {
> + atomic_inc(&ioend->io_remaining);
> + xfs_finish_ioend(ioend);
> + } else
> + xfs_destroy_ioend(ioend);

And requeuing work from one workqueue to the next is something that
we can avoid. We know at IO submission time (i.e.
xfs_vm_direct_io)) whether an fsync completion is going to be needed
during Io completion. The ioend->io_needs_fsync flag can be set
then, and the first pass through xfs_finish_ioend() can queue it to
the correct workqueue. i.e. it only needs to be queued if it's not
already an unwritten or append ioend and it needs an fsync.

As it is, all the data completion workqueues run the same completion
function so all you need to do is handle the fsync case at the end
of the existing processing - it's not an else case. i.e the end of
xfs_end_io() becomes:

if (ioend->io_needs_fsync) {
error = xfs_ioend_fsync(ioend);
if (error)
ioend->io_error = -error;
goto done;
}
done:
xfs_destroy_ioend(ioend);

As it is, this code is going to change before these changes go in -
there's a nasty regression in the DIO code that I found this
afternoon that requires reworking this IO completion logic to
avoid. The patch will appear on the list soon....

> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -209,6 +209,7 @@ typedef struct xfs_mount {
> struct workqueue_struct *m_data_workqueue;
> struct workqueue_struct *m_unwritten_workqueue;
> struct workqueue_struct *m_cil_workqueue;
> + struct workqueue_struct *m_aio_blkdev_flush_wq;

struct workqueue_struct *m_aio_fsync_wq;

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-11-20 14:23:43

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

"Darrick J. Wong" <[email protected]> writes:

> Hi everybody,
>
> On March 29th, Jeff Moyer posted to lkml a patchset with this note:
>
>> Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
>> are sync'd before the I/O is actually issued (everybody else). The following
>> patch series fixes this in two parts. First, for the file systems that use
>> the generic routines, Jan has provided some generic infrastructure to perform
>> the syncs after the I/O is completed. Second, for those file systems which
>> require some endio processing of their own for O_DIRECT writes (xfs and
>> ext4), [Jeff] implemented file system specific syncing. This passes the
>> updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
>> in the aio group. [Jeff] tested ext3, ext4, xfs, and btrfs only.
>
> Since the original post a few months ago, this patchset doesn't seem to have
> made any progress. An internal testing team here discovered that the issue
> also affects O_SYNC+AIO+DIO writes to block devices. Worse yet, since the
> flushes were being issued (and waited upon) directly in the io_submit call
> graph, the io_submit calls themselves would take a very long time to complete.
> Therefore, I added another patch to move the flush to the io_end processing.
>
> The blockdev patch was written by me. The ext4 patch had to be updated to
> accomodate a rework of the ext4 endio code that landed since March. Everything
> else has been passed through from Jeff's March 30th resend, with few changes.
>
> This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
> ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.
>
> Comments and questions are, as always, welcome.

Hi, Darrick,

I just finished testing my version of this patch set on ext4 and xfs
(btrfs is next), but you beat me to the posting! Sorry, I should have
been more clear when I said I would see about refreshing the series.

How have you tested these patches? You just said "lightly", and I'm
afraid I don't know what that means. xfstests? You should at least run
them through test 113.

Would you like to push the set in, or should I post the patches I've
got? Doesn't matter to me, just let me know.

Cheers,
Jeff

2012-11-20 18:57:00

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

On Tue, Nov 20, 2012 at 09:23:18AM -0500, Jeff Moyer wrote:
> "Darrick J. Wong" <[email protected]> writes:
>
> > Hi everybody,
> >
> > On March 29th, Jeff Moyer posted to lkml a patchset with this note:
> >
> >> Currently, AIO+DIO+O_SYNC writes are not actually sync'd (for xfs), or they
> >> are sync'd before the I/O is actually issued (everybody else). The following
> >> patch series fixes this in two parts. First, for the file systems that use
> >> the generic routines, Jan has provided some generic infrastructure to perform
> >> the syncs after the I/O is completed. Second, for those file systems which
> >> require some endio processing of their own for O_DIRECT writes (xfs and
> >> ext4), [Jeff] implemented file system specific syncing. This passes the
> >> updated xfs-tests 113 test [Jeff] posted earlier, as well as all of the tests
> >> in the aio group. [Jeff] tested ext3, ext4, xfs, and btrfs only.
> >
> > Since the original post a few months ago, this patchset doesn't seem to have
> > made any progress. An internal testing team here discovered that the issue
> > also affects O_SYNC+AIO+DIO writes to block devices. Worse yet, since the
> > flushes were being issued (and waited upon) directly in the io_submit call
> > graph, the io_submit calls themselves would take a very long time to complete.
> > Therefore, I added another patch to move the flush to the io_end processing.
> >
> > The blockdev patch was written by me. The ext4 patch had to be updated to
> > accomodate a rework of the ext4 endio code that landed since March. Everything
> > else has been passed through from Jeff's March 30th resend, with few changes.
> >
> > This patchset has been tested (albeit lightly) against 3.7-rc6 on x64, with
> > ext4, xfs, btrfs, vfat, jfs, hfsplus, ext2, ext3, and raw block devices.
> >
> > Comments and questions are, as always, welcome.
>
> Hi, Darrick,
>
> I just finished testing my version of this patch set on ext4 and xfs
> (btrfs is next), but you beat me to the posting! Sorry, I should have
> been more clear when I said I would see about refreshing the series.
>
> How have you tested these patches? You just said "lightly", and I'm
> afraid I don't know what that means. xfstests? You should at least run
> them through test 113.

xfstest'd for ext4, but only sanity checked on the rest.

> Would you like to push the set in, or should I post the patches I've
> got? Doesn't matter to me, just let me know.

Since you're the original author of most of the patches in my set anyway, would
you mind simply picking up my blockdev patch for your next submission? It
sounds like you're a little further along in the testing than I am anyway.

--D
>
> Cheers,
> Jeff

2012-11-20 19:05:54

[permalink] [raw]

Subject: Re: [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part

"Darrick J. Wong" <[email protected]> writes:

>> Would you like to push the set in, or should I post the patches I've
>> got? Doesn't matter to me, just let me know.
>
> Since you're the original author of most of the patches in my set anyway, would
> you mind simply picking up my blockdev patch for your next submission? It
> sounds like you're a little further along in the testing than I am anyway.

Not at all. I'll address the comments so far and get another version
out tomorrow.

Cheers,
Jeff

2012-11-20 19:43:06

[permalink] [raw]

Subject: Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

Dave Chinner <[email protected]> writes:

> And requeuing work from one workqueue to the next is something that
> we can avoid. We know at IO submission time (i.e.
> xfs_vm_direct_io)) whether an fsync completion is going to be needed
> during Io completion. The ioend->io_needs_fsync flag can be set
> then, and the first pass through xfs_finish_ioend() can queue it to
> the correct workqueue. i.e. it only needs to be queued if it's not
> already an unwritten or append ioend and it needs an fsync.
>
> As it is, all the data completion workqueues run the same completion
> function so all you need to do is handle the fsync case at the end
> of the existing processing - it's not an else case. i.e the end of
> xfs_end_io() becomes:
>
> if (ioend->io_needs_fsync) {
> error = xfs_ioend_fsync(ioend);
> if (error)
> ioend->io_error = -error;
> goto done;
> }
> done:
> xfs_destroy_ioend(ioend);

Works for me, that makes things simpler.

> As it is, this code is going to change before these changes go in -
> there's a nasty regression in the DIO code that I found this
> afternoon that requires reworking this IO completion logic to
> avoid. The patch will appear on the list soon....

I'm not on the xfs list, so if you haven't already sent it, mind Cc-ing
me?

>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -209,6 +209,7 @@ typedef struct xfs_mount {
>> struct workqueue_struct *m_data_workqueue;
>> struct workqueue_struct *m_unwritten_workqueue;
>> struct workqueue_struct *m_cil_workqueue;
>> + struct workqueue_struct *m_aio_blkdev_flush_wq;
>
> struct workqueue_struct *m_aio_fsync_wq;

For the record, m_aio_blkdev_flush_wq is the name you chose previously.
;-)

Thanks for the review!

Cheers,
Jeff

2012-11-20 20:02:15

[permalink] [raw]

Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

Jan Kara <[email protected]> writes:

>> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {
>> /* workqueue for dio unwritten */
>> struct workqueue_struct *dio_unwritten_wq;
>>
>> + /* workqueue for aio+dio+o_sync disk cache flushing */
>> + struct workqueue_struct *aio_dio_flush_wq;
>> +
> Umm, I'm not completely decided whether we really need a separate
> workqueue. But it doesn't cost too much so I guess it makes some sense -
> fsync() is rather heavy so syncing won't starve extent conversion...

I'm assuming you'd like me to convert the names from flush to fsync,
yes?

>> +
>> + /*
>> + * If we are running in nojournal mode, just flush the disk
>> + * cache and return.
>> + */
>> + if (!journal)
>> + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
> And this is wrong as well - you need to do work similar to what
> ext4_sync_file() does. Actually it would be *much* better if these two
> sites used the same helper function. Which also poses an interesting
> question about locking - do we need i_mutex or not? Forcing a transaction
> commit is definitely OK without it, similarly as grabbing transaction ids
> from inode or ext4_should_journal_data() test. __sync_inode() call seems
> to be OK without i_mutex as well so I believe we can just get rid of it
> (getting i_mutex from the workqueue is a locking nightmare we don't want to
> return to).

Just to be clear, are you saying you would like me to remove the
mutex_lock/unlock pair from ext4_sync_file? (I had already factored out
the common code between this new code path and the fsync path in my tree.)

>> @@ -149,8 +209,11 @@ void ext4_add_complete_io(ext4_io_end_t *io_end)
>> struct workqueue_struct *wq;
>> unsigned long flags;
>>
>> - BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
>> - wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
>> + BUG_ON(!ext4_io_end_deferred(io_end));
>> + if (io_end->flag & EXT4_IO_END_UNWRITTEN)
>> + wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
>> + else
>> + wq = EXT4_SB(io_end->inode->i_sb)->aio_dio_flush_wq;
> Umm, I'd prefer if we used aio_dio_flush_wq when EXT4_IO_END_NEEDS_SYNC
> is set. That way slow syncing works will be always offloaded to a separate
> workqueue.

OK.

Thanks!
Jeff

2012-11-20 20:08:30

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests

On Tue, Nov 20, 2012 at 02:42:48PM -0500, Jeff Moyer wrote:
> Dave Chinner <[email protected]> writes:
>
> > And requeuing work from one workqueue to the next is something that
> > we can avoid. We know at IO submission time (i.e.
> > xfs_vm_direct_io)) whether an fsync completion is going to be needed
> > during Io completion. The ioend->io_needs_fsync flag can be set
> > then, and the first pass through xfs_finish_ioend() can queue it to
> > the correct workqueue. i.e. it only needs to be queued if it's not
> > already an unwritten or append ioend and it needs an fsync.
> >
> > As it is, all the data completion workqueues run the same completion
> > function so all you need to do is handle the fsync case at the end
> > of the existing processing - it's not an else case. i.e the end of
> > xfs_end_io() becomes:
> >
> > if (ioend->io_needs_fsync) {
> > error = xfs_ioend_fsync(ioend);
> > if (error)
> > ioend->io_error = -error;
> > goto done;
> > }
> > done:
> > xfs_destroy_ioend(ioend);
>
> Works for me, that makes things simpler.
>
> > As it is, this code is going to change before these changes go in -
> > there's a nasty regression in the DIO code that I found this
> > afternoon that requires reworking this IO completion logic to
> > avoid. The patch will appear on the list soon....
>
> I'm not on the xfs list, so if you haven't already sent it, mind Cc-ing
> me?

http://oss.sgi.com/archives/xfs/2012-11/msg00493.html

First cut is here, but it will change a bit as review goes on...

> >> --- a/fs/xfs/xfs_mount.h
> >> +++ b/fs/xfs/xfs_mount.h
> >> @@ -209,6 +209,7 @@ typedef struct xfs_mount {
> >> struct workqueue_struct *m_data_workqueue;
> >> struct workqueue_struct *m_unwritten_workqueue;
> >> struct workqueue_struct *m_cil_workqueue;
> >> + struct workqueue_struct *m_aio_blkdev_flush_wq;
> >
> > struct workqueue_struct *m_aio_fsync_wq;
>
> For the record, m_aio_blkdev_flush_wq is the name you chose previously.
> ;-)

<cue Led Zeppelin>

It's been a long time since I read that patch....

:)

Cheers,

Dave....
--
Dave Chinner
[email protected]

2012-11-20 20:48:08

[permalink] [raw]

Subject: Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly

Jan Kara <[email protected]> writes:

> On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:
>> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
>> flag so that flushes are issued /after/ the write completes, not before.
>>
>> Note, however, that for block devices, the DIO setup code ensures that a flush
>> wq is attached to the superblock of the bdevfs filesystem, not the filesystem
>> that the device node happens to reside in. This means that unlike regular
>> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb. Therefore, adjust
>> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.
>>
>> Signed-off-by: Darrick J. Wong <[email protected]>
>> ---
>> fs/block_dev.c | 5 +++--
>> fs/direct-io.c | 3 ++-
>> 2 files changed, 5 insertions(+), 3 deletions(-)
>>
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 1a1e5e3..05ff33a 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>> struct inode *inode = file->f_mapping->host;
>>
>> return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
>> - nr_segs, blkdev_get_blocks, NULL, NULL, 0);
>> + nr_segs, blkdev_get_blocks, NULL, NULL,
>> + DIO_SYNC_WRITES);
>> }
>>
>> int __sync_blockdev(struct block_device *bdev, int wait)
>> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
>> percpu_down_read(&bdev->bd_block_size_semaphore);
>>
>> ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
>> - if (ret > 0 || ret == -EIOCBQUEUED) {
>> + if (ret > 0) {
>> ssize_t err;
>>
>> err = generic_write_sync(file, pos, ret);
>> diff --git a/fs/direct-io.c b/fs/direct-io.c
>> index b7391d4..c626c43 100644
>> --- a/fs/direct-io.c
>> +++ b/fs/direct-io.c
>> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
>> work->ret = ret;
>> work->offset = offset;
>> work->len = bytes;
>> - queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
>> + queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
>> + &work->work);
> This should be folded into the original patch introducing the
> s_dio_flush_wq. And please add a comment before this line saying that block
> devices need a dereference exactly like this... Otherwise the patch looks
> good so you can add:
> Reviewed-by: Jan Kara <[email protected]>

When you say, "This," do you mean that one change, or the whole patch?

Cheers,
Jeff

2012-11-21 00:56:30

[permalink] [raw]

Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

On Tue 20-11-12 15:02:15, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> >> @@ -1279,6 +1280,9 @@ struct ext4_sb_info {
> >> /* workqueue for dio unwritten */
> >> struct workqueue_struct *dio_unwritten_wq;
> >>
> >> + /* workqueue for aio+dio+o_sync disk cache flushing */
> >> + struct workqueue_struct *aio_dio_flush_wq;
> >> +
> > Umm, I'm not completely decided whether we really need a separate
> > workqueue. But it doesn't cost too much so I guess it makes some sense -
> > fsync() is rather heavy so syncing won't starve extent conversion...
>
> I'm assuming you'd like me to convert the names from flush to fsync,
> yes?
Would be nicer, yes.

> >> +
> >> + /*
> >> + * If we are running in nojournal mode, just flush the disk
> >> + * cache and return.
> >> + */
> >> + if (!journal)
> >> + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_NOIO, NULL);
> > And this is wrong as well - you need to do work similar to what
> > ext4_sync_file() does. Actually it would be *much* better if these two
> > sites used the same helper function. Which also poses an interesting
> > question about locking - do we need i_mutex or not? Forcing a transaction
> > commit is definitely OK without it, similarly as grabbing transaction ids
> > from inode or ext4_should_journal_data() test. __sync_inode() call seems
> > to be OK without i_mutex as well so I believe we can just get rid of it
> > (getting i_mutex from the workqueue is a locking nightmare we don't want to
> > return to).
>
> Just to be clear, are you saying you would like me to remove the
> mutex_lock/unlock pair from ext4_sync_file? (I had already factored out
> the common code between this new code path and the fsync path in my tree.)
Yes, after some thinking I came to that conclusion. We actually need to
keep i_mutex around ext4_flush_unwritten_io() to avoid livelocks but the
rest doesn't need it. The change should be definitely a separate patch just
in case there's something subtle I missed and we need to bisect in
future... I've attached a patch for that so that blame for bugs goes my way
;) Compile tested only so far. I'll give it some more testing overnight.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Attachments:

(No filename) (2.18 kB)
0001-ext4-Reduce-i_mutex-usage-in-ext4_file_sync.patch (1.82 kB)
Download all attachments

2012-11-21 00:57:41

[permalink] [raw]

Subject: Re: [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly

On Tue 20-11-12 15:47:54, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > On Mon 19-11-12 23:51:15, Darrick J. Wong wrote:
> >> When performing O_SYNC+AIO+DIO writes to block devices, use the DIO_SYNC_WRITES
> >> flag so that flushes are issued /after/ the write completes, not before.
> >>
> >> Note, however, that for block devices, the DIO setup code ensures that a flush
> >> wq is attached to the superblock of the bdevfs filesystem, not the filesystem
> >> that the device node happens to reside in. This means that unlike regular
> >> files, iocb->ki_filp->f_mapping->host->i_sb != inode->i_sb. Therefore, adjust
> >> Jeff's earlier patch to keep the pointer use consistent and avoid a NULL deref.
> >>
> >> Signed-off-by: Darrick J. Wong <[email protected]>
> >> ---
> >> fs/block_dev.c | 5 +++--
> >> fs/direct-io.c | 3 ++-
> >> 2 files changed, 5 insertions(+), 3 deletions(-)
> >>
> >>
> >> diff --git a/fs/block_dev.c b/fs/block_dev.c
> >> index 1a1e5e3..05ff33a 100644
> >> --- a/fs/block_dev.c
> >> +++ b/fs/block_dev.c
> >> @@ -235,7 +235,8 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> >> struct inode *inode = file->f_mapping->host;
> >>
> >> return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
> >> - nr_segs, blkdev_get_blocks, NULL, NULL, 0);
> >> + nr_segs, blkdev_get_blocks, NULL, NULL,
> >> + DIO_SYNC_WRITES);
> >> }
> >>
> >> int __sync_blockdev(struct block_device *bdev, int wait)
> >> @@ -1631,7 +1632,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
> >> percpu_down_read(&bdev->bd_block_size_semaphore);
> >>
> >> ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
> >> - if (ret > 0 || ret == -EIOCBQUEUED) {
> >> + if (ret > 0) {
> >> ssize_t err;
> >>
> >> err = generic_write_sync(file, pos, ret);
> >> diff --git a/fs/direct-io.c b/fs/direct-io.c
> >> index b7391d4..c626c43 100644
> >> --- a/fs/direct-io.c
> >> +++ b/fs/direct-io.c
> >> @@ -258,7 +258,8 @@ void generic_dio_end_io(struct kiocb *iocb, loff_t offset, ssize_t bytes,
> >> work->ret = ret;
> >> work->offset = offset;
> >> work->len = bytes;
> >> - queue_work(inode->i_sb->s_dio_flush_wq, &work->work);
> >> + queue_work(iocb->ki_filp->f_mapping->host->i_sb->s_dio_flush_wq,
> >> + &work->work);
> > This should be folded into the original patch introducing the
> > s_dio_flush_wq. And please add a comment before this line saying that block
> > devices need a dereference exactly like this... Otherwise the patch looks
> > good so you can add:
> > Reviewed-by: Jan Kara <[email protected]>
>
> When you say, "This," do you mean that one change, or the whole patch?
I meant the whole patch after that hung is folded ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-21 10:08:09

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

On Mon, Nov 19, 2012 at 11:41:23PM -0800, Darrick J. Wong wrote:
> Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystems wanting to
> use the helpers have to pass DIO_SYNC_WRITES to __blockdev_direct_IO. If the
> filesystem doesn't provide its own direct IO end_io handler, the generic code
> will take care of issuing the flush. Otherwise, the filesystem's custom end_io
> handler is passed struct dio_sync_io_work pointer as 'private' argument, and it
> must call generic_dio_end_io() to finish the AIO DIO. The generic code then
> takes care to call generic_write_sync() from a workqueue context when AIO DIO
> is complete.
>
> Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
> and the generic suffices for them, make blockdev_direct_IO() pass the new
> DIO_SYNC_WRITES flag.

I'd like to use this as a vehicle to revisit how dio completions work.
Now that the generic code has a reason to defer aio completions to a
workqueue can we maybe take the whole offload to a workqueue code into
the direct-io code instead of reimplementing it in ext4 and xfs?

>From a simplicity point of view I'd love to do it unconditionally, but I
also remember that this was causing performance regressions on important
workload. So maybe we just need a flag in the dio structure, with a way
that the get_blocks callback can communicate that it's needed.

For the specific case of O_(D)SYNC aio this would allos allow to call
->fsync from generic code instead of the filesystems having to
reimplement this.

> + if (dio->sync_work)
> + private = dio->sync_work;
> + else
> + private = dio->private;
> +
> dio->end_io(dio->iocb, offset, transferred,
> - dio->private, ret, is_async);
> + private, ret, is_async);

Eww. I'd be much happier to add a new argument than having two
different members passed as the private argument.

Maybe it's even time to bite the bullet and make struct dio public
and pass that to the end_io argument as well as generic_dio_end_io.

> + /* No IO submitted? Skip syncing... */
> + if (!dio->result && dio->sync_work) {
> + kfree(dio->sync_work);
> + dio->sync_work = NULL;
> + }
> + generic_dio_end_io(dio->iocb, offset, transferred,
> + dio->sync_work, ret, is_async);

Any reason the check above isn't done inside of generic_dio_end_io?

> +static noinline int dio_create_flush_wq(struct super_block *sb)
> +{
> + struct workqueue_struct *wq =
> + alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
> +
> + if (!wq)
> + return -ENOMEM;
> + /*
> + * Atomically put workqueue in place. Release our one in case someone
> + * else won the race and attached workqueue to superblock.
> + */
> + if (cmpxchg(&sb->s_dio_flush_wq, NULL, wq))
> + destroy_workqueue(wq);
> + return 0;

Eww. Workqueues are cheap, just create it on bootup instead of this
uglyness. Also I don't really see any reason to make it per-fs instead
of global.

2012-11-21 10:09:13

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

On Mon, Nov 19, 2012 at 11:41:38PM -0800, Darrick J. Wong wrote:
> Hi,
>
> Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
> function that can be used from the I/O post-processing path.

Why would we need to skip the filemap_write_and_wait_range call here?
If we're doing direct I/O we should not have any pages in this regions
anyway. You're also not skipping it in the generic implementation as
far as I can see, so I see no point in doing it just in XFS.

2012-11-21 14:09:58

[permalink] [raw]

Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

Jan Kara <[email protected]> writes:

>> Just to be clear, are you saying you would like me to remove the
>> mutex_lock/unlock pair from ext4_sync_file? (I had already factored out
>> the common code between this new code path and the fsync path in my tree.)
> Yes, after some thinking I came to that conclusion. We actually need to
> keep i_mutex around ext4_flush_unwritten_io() to avoid livelocks but the
> rest doesn't need it. The change should be definitely a separate patch just
> in case there's something subtle I missed and we need to bisect in
> future... I've attached a patch for that so that blame for bugs goes my way
> ;) Compile tested only so far. I'll give it some more testing overnight.

Great, thanks Jan! I'll include this in the next posting.

Cheers,
Jeff

2012-11-21 15:04:32

[permalink] [raw]

Subject: Re: [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync

Christoph Hellwig <[email protected]> writes:

> On Mon, Nov 19, 2012 at 11:41:38PM -0800, Darrick J. Wong wrote:
>> Hi,
>>
>> Fsyncing is tricky business, so factor out the bits of the xfs_file_fsync
>> function that can be used from the I/O post-processing path.
>
> Why would we need to skip the filemap_write_and_wait_range call here?
> If we're doing direct I/O we should not have any pages in this regions
> anyway. You're also not skipping it in the generic implementation as
> far as I can see, so I see no point in doing it just in XFS.

OK, I'll fix that.

Thanks!
Jeff

2012-11-21 16:54:41

[permalink] [raw]

Subject: Re: [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests

On Wed 21-11-12 09:09:41, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> >> Just to be clear, are you saying you would like me to remove the
> >> mutex_lock/unlock pair from ext4_sync_file? (I had already factored out
> >> the common code between this new code path and the fsync path in my tree.)
> > Yes, after some thinking I came to that conclusion. We actually need to
> > keep i_mutex around ext4_flush_unwritten_io() to avoid livelocks but the
> > rest doesn't need it. The change should be definitely a separate patch just
> > in case there's something subtle I missed and we need to bisect in
> > future... I've attached a patch for that so that blame for bugs goes my way
> > ;) Compile tested only so far. I'll give it some more testing overnight.
>
> Great, thanks Jan! I'll include this in the next posting.
OK, patch passed xfstests and a test banging one file with random IO and
fsyncs from 8 processes (in data=ordered, data=journal, and nojournal
modes). So it seems I didn't miss anything substantial. So ship it! ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-21 16:58:05

[permalink] [raw]

Subject: Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

Christoph Hellwig <[email protected]> writes:

> On Mon, Nov 19, 2012 at 11:41:23PM -0800, Darrick J. Wong wrote:
>> Provide VFS helpers for handling O_SYNC AIO DIO writes. Filesystems wanting to
>> use the helpers have to pass DIO_SYNC_WRITES to __blockdev_direct_IO. If the
>> filesystem doesn't provide its own direct IO end_io handler, the generic code
>> will take care of issuing the flush. Otherwise, the filesystem's custom end_io
>> handler is passed struct dio_sync_io_work pointer as 'private' argument, and it
>> must call generic_dio_end_io() to finish the AIO DIO. The generic code then
>> takes care to call generic_write_sync() from a workqueue context when AIO DIO
>> is complete.
>>
>> Since all filesystems using blockdev_direct_IO() need O_SYNC aio dio handling
>> and the generic suffices for them, make blockdev_direct_IO() pass the new
>> DIO_SYNC_WRITES flag.
>
> I'd like to use this as a vehicle to revisit how dio completions work.

I don't like the sound of that. ;-) It sounds like this bugfix may get
further delayed by the desire for unrelated code cleanup.

> Now that the generic code has a reason to defer aio completions to a
> workqueue can we maybe take the whole offload to a workqueue code into
> the direct-io code instead of reimplementing it in ext4 and xfs?

On the surface, I don't see a problem with that.

> From a simplicity point of view I'd love to do it unconditionally, but I
> also remember that this was causing performance regressions on important
> workload. So maybe we just need a flag in the dio structure, with a way
> that the get_blocks callback can communicate that it's needed.

Yeah, adding context switches to the normal io completion path is a
non-starter.

> For the specific case of O_(D)SYNC aio this would allos allow to call
> ->fsync from generic code instead of the filesystems having to
> reimplement this.

This is the only reason I'd even consider such a cleanup for this
series. Alas, I don't find it compelling enough to do the work.

>> + if (dio->sync_work)
>> + private = dio->sync_work;
>> + else
>> + private = dio->private;
>> +
>> dio->end_io(dio->iocb, offset, transferred,
>> - dio->private, ret, is_async);
>> + private, ret, is_async);
>
> Eww. I'd be much happier to add a new argument than having two
> different members passed as the private argument.

OK.

> Maybe it's even time to bite the bullet and make struct dio public
> and pass that to the end_io argument as well as generic_dio_end_io.

But I don't agree with that. Really, nothing needs to know about the
struct dio outside of fs/direct-io.c.

>> + /* No IO submitted? Skip syncing... */
>> + if (!dio->result && dio->sync_work) {
>> + kfree(dio->sync_work);
>> + dio->sync_work = NULL;
>> + }
>> + generic_dio_end_io(dio->iocb, offset, transferred,
>> + dio->sync_work, ret, is_async);
>
>
> Any reason the check above isn't done inside of generic_dio_end_io?

Jan? It does seem as though it might make more sense to do the check in
generic_dio_end_io.

>> +static noinline int dio_create_flush_wq(struct super_block *sb)
>> +{
>> + struct workqueue_struct *wq =
>> + alloc_workqueue("dio-sync", WQ_UNBOUND, 1);
>> +
>> + if (!wq)
>> + return -ENOMEM;
>> + /*
>> + * Atomically put workqueue in place. Release our one in case someone
>> + * else won the race and attached workqueue to superblock.
>> + */
>> + if (cmpxchg(&sb->s_dio_flush_wq, NULL, wq))
>> + destroy_workqueue(wq);
>> + return 0;
>
> Eww. Workqueues are cheap, just create it on bootup instead of this
> uglyness. Also I don't really see any reason to make it per-fs instead
> of global.

I would prefer to keep it per-fs. Consider the possibility for sync
work on your database device being backed up behind sync work for your
root file system. So, given my preference to keep it per-fs, would you
rather the workqueues get created at mount time?

Cheers,
Jeff

2012-11-21 18:29:17

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

On Wed, Nov 21, 2012 at 11:58:05AM -0500, Jeff Moyer wrote:
> > I'd like to use this as a vehicle to revisit how dio completions work.
>
> I don't like the sound of that. ;-) It sounds like this bugfix may get
> further delayed by the desire for unrelated code cleanup.

I've got a prototype that isn't much more invasive than the current
series. I'll finish it up and run it through QA and will post it
tomorrow.

2012-11-21 18:38:55

[permalink] [raw]

Subject: Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

Christoph Hellwig <[email protected]> writes:

> On Wed, Nov 21, 2012 at 11:58:05AM -0500, Jeff Moyer wrote:
>> > I'd like to use this as a vehicle to revisit how dio completions work.
>>
>> I don't like the sound of that. ;-) It sounds like this bugfix may get
>> further delayed by the desire for unrelated code cleanup.
>
> I've got a prototype that isn't much more invasive than the current
> series. I'll finish it up and run it through QA and will post it
> tomorrow.

Works for me. Thanks!

-Jeff

2012-11-21 19:33:03

[permalink] [raw]

Subject: Re: [PATCH 7/9] ocfs2: Use generic handlers of O_SYNC AIO DIO

On Mon, Nov 19, 2012 at 11:51:14PM -0800, Darrick J. Wong wrote:
> Use generic handlers to queue fsync() when AIO DIO is completed for O_SYNC
> file.
>
> From: Jan Kara <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> Signed-off-by: Jeff Moyer <[email protected]>
Acked-by: Joel Becker <[email protected]>

> ---
> fs/ocfs2/aops.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 6577432..60457cc 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -593,9 +593,7 @@ static void ocfs2_dio_end_io(struct kiocb *iocb,
> level = ocfs2_iocb_rw_locked_level(iocb);
> ocfs2_rw_unlock(inode, level);
>
> - if (is_async)
> - aio_complete(iocb, ret, 0);
> - inode_dio_done(inode);
> + generic_dio_end_io(iocb, offset, bytes, private, ret, is_async);
> }
>
> /*
> @@ -642,7 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
> return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
> iov, offset, nr_segs,
> ocfs2_direct_IO_get_blocks,
> - ocfs2_dio_end_io, NULL, 0);
> + ocfs2_dio_end_io, NULL, DIO_SYNC_WRITES);
> }
>
> static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--

"Hell is oneself, hell is alone, the other figures in it, merely projections."
- T. S. Eliot

http://www.jlbec.org/
[email protected]

2012-11-21 21:37:32

[permalink] [raw]

Subject: Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

On Wed 21-11-12 13:29:01, Christoph Hellwig wrote:
> On Wed, Nov 21, 2012 at 11:58:05AM -0500, Jeff Moyer wrote:
> > > I'd like to use this as a vehicle to revisit how dio completions work.
> >
> > I don't like the sound of that. ;-) It sounds like this bugfix may get
> > further delayed by the desire for unrelated code cleanup.
>
> I've got a prototype that isn't much more invasive than the current
> series. I'll finish it up and run it through QA and will post it
> tomorrow.
Great, I wanted to have a look into it but you beat me to it ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2012-11-21 23:09:48

by Jeffrey Ellis

[permalink] [raw]

Subject: Re: [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly

Best,
J.

On Nov 21, 2012, at 4:37 PM, Jan Kara <[email protected]> wrote:

> On Wed 21-11-12 13:29:01, Christoph Hellwig wrote:
>> On Wed, Nov 21, 2012 at 11:58:05AM -0500, Jeff Moyer wrote:
>>>> I'd like to use this as a vehicle to revisit how dio completions work.
>>>
>>> I don't like the sound of that. ;-) It sounds like this bugfix may get
>>> further delayed by the desire for unrelated code cleanup.
>>
>> I've got a prototype that isn't much more invasive than the current
>> series. I'll finish it up and run it through QA and will post it
>> tomorrow.
> Great, I wanted to have a look into it but you beat me to it ;)
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
> _______________________________________________
> xfs mailing list
> [email protected]
> http://oss.sgi.com/mailman/listinfo/xfs