2012-02-27 21:19:56

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 00/22] loop: Issue O_DIRECT aio with pages

This patchset was begun by Zach Brown and was originally submitted for
review in October, 2009. Feedback was positive, and I have picked up
where he left off, porting his patches to 3.3-rc4 and adding support
for ext4, btrfs, and nfs.

http://www.spinics.net/lists/linux-fsdevel/msg27514.htm

This patch series adds a kernel interface to fs/aio.c so that kernel code can
issue concurrent asynchronous IO to file systems. It adds an aio command and
file system methods which specify io memory with pages instead of userspace
addresses.

This series was written to reduce the current overhead loop imposes by
performing synchronus buffered file system IO from a kernel thread. These
patches turn loop into a light weight layer that translates bios into iocbs.

Thanks,
Shaggy

Dave Kleikamp (4):
fuse: convert fuse to use iov_iter_copy_[to|from]_user
ext4: add support for read_iter, write_iter, and direct_IO_bvec
btrfs: add support for read_iter, write_iter, and direct_IO_bvec
nfs: add support for read_iter, write_iter

Zach Brown (18):
iov_iter: move into its own file
iov_iter: add copy_to_user support
iov_iter: hide iovec details behind ops function pointers
iov_iter: add bvec support
iov_iter: add a shorten call
iov_iter: let callers extract iovecs and bio_vecs
dio: create a dio_aligned() helper function
dio: add dio_alloc_init() helper function
dio: add sdio_init() helper function
dio: add dio_lock_and_flush() helper
dio: add dio_post_submission() helper function
dio: add __blockdev_direct_IO_bdev()
fs: pull iov_iter use higher up the stack
aio: add aio_kernel_() interface
aio: add aio support for iov_iter arguments
bio: add bvec_length(), like iov_length()
ext3: add support for .read_iter and .write_iter
ocfs2: add support for read_iter, write_iter, and direct_IO_bvec

drivers/block/loop.c | 55 ++++-
fs/aio.c | 156 +++++++++++++++
fs/btrfs/file.c | 2 +
fs/btrfs/inode.c | 116 +++++++----
fs/direct-io.c | 435 ++++++++++++++++++++++++++--------------
fs/ext3/file.c | 2 +
fs/ext3/inode.c | 149 +++++++++-----
fs/ext4/ext4.h | 3 +
fs/ext4/file.c | 2 +
fs/ext4/indirect.c | 169 ++++++++++++----
fs/ext4/inode.c | 206 ++++++++++++-------
fs/fuse/file.c | 29 +--
fs/nfs/direct.c | 508 ++++++++++++++++++++++++++++++++++++++---------
fs/nfs/file.c | 80 ++++++++
fs/ocfs2/aops.c | 31 +++
fs/ocfs2/file.c | 82 +++++---
fs/ocfs2/ocfs2_trace.h | 6 +-
include/linux/aio.h | 14 ++
include/linux/aio_abi.h | 2 +
include/linux/bio.h | 8 +
include/linux/fs.h | 139 ++++++++++++-
include/linux/loop.h | 1 +
include/linux/nfs_fs.h | 4 +
mm/Makefile | 2 +-
mm/filemap.c | 405 +++++++++++++++++--------------------
mm/iov-iter.c | 377 +++++++++++++++++++++++++++++++++++
26 files changed, 2262 insertions(+), 721 deletions(-)
create mode 100644 mm/iov-iter.c

--
1.7.9.2


2012-02-27 21:20:22

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 02/22] iov_iter: add copy_to_user support

From: Zach Brown <[email protected]>

This adds iov_iter wrappers around copy_to_user() to match the existing
wrappers around copy_from_user().

This will be used by the generic file system buffered read path.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 4 +++
mm/iov-iter.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 82 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 386da09..c66aa4b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -535,6 +535,10 @@ struct iov_iter {
size_t count;
};

+size_t iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
size_t iov_iter_copy_from_user(struct page *page,
diff --git a/mm/iov-iter.c b/mm/iov-iter.c
index 596fcf0..eea21ea 100644
--- a/mm/iov-iter.c
+++ b/mm/iov-iter.c
@@ -6,6 +6,84 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>

+static size_t __iovec_copy_to_user_inatomic(char *vaddr,
+ const struct iovec *iov, size_t base, size_t bytes)
+{
+ size_t copied = 0, left = 0;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + base;
+ int copy = min(bytes, iov->iov_len - base);
+
+ base = 0;
+ left = __copy_to_user_inatomic(buf, vaddr, copy);
+ copied += copy;
+ bytes -= copy;
+ vaddr += copy;
+ iov++;
+
+ if (unlikely(left))
+ break;
+ }
+ return copied - left;
+}
+
+/*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ BUG_ON(!in_atomic());
+ kaddr = kmap_atomic(page, KM_USER0);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_to_user_inatomic(buf, kaddr + offset, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_to_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap_atomic(kaddr, KM_USER0);
+
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_to_user_atomic);
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_to_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ kaddr = kmap(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = copy_to_user(buf, kaddr + offset, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_to_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap(page);
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_to_user);
+
+
static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
{
--
1.7.9.2

2012-02-27 21:20:34

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 14/22] fs: pull iov_iter use higher up the stack

From: Zach Brown <[email protected]>

Right now only callers of generic_perform_write() pack their iovec
arguments into an iov_iter structure. All the callers higher up in the
stack work on raw iovec arguments.

This patch introduces the use of the iov_iter abstraction higher up the
stack. Private generic path functions are changed to operation on
iov_iter instead of on raw iovecs. Exported interfaces that take iovecs
immediately pack their arguments into an iov_iter and call into the
shared functions.

File operation struct functions are added with iov_iter as an argument
so that callers to the generic file system functions can specify
abstract memory rather than iovec arrays only.

Almost all of this patch only transforms arguments and shouldn't change
functionality. The buffered read path is the exception. We add a
read_actor function which uses the iov_iter helper functions instead of
operating on each individual iovec element. This may improve
performance as the iov_iter helper can copy multiple iovec elements from
one mapped page cache page.

As always, the direct IO path is special. Sadly, it may still be
cleanest to have it work on the underlying memory structures directly
instead of working through the iov_iter abstraction.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 12 +++
mm/filemap.c | 261 +++++++++++++++++++++++++++++++++++-----------------
2 files changed, 190 insertions(+), 83 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94f2d0a..92727a9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1686,7 +1686,9 @@ struct file_operations {
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -2448,13 +2450,23 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *,
+ loff_t);
extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
loff_t *);
+extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *,
+ loff_t *);
extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *,
+ loff_t);
extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
+extern ssize_t generic_file_direct_write_iter(struct kiocb *, struct iov_iter *,
+ loff_t, loff_t *, size_t);
extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern ssize_t generic_file_buffered_write_iter(struct kiocb *,
+ struct iov_iter *, loff_t, loff_t *, ssize_t);
extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
extern int generic_segment_checks(const struct iovec *iov,
diff --git a/mm/filemap.c b/mm/filemap.c
index 0533a71..7ce5bff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1381,31 +1381,56 @@ int generic_segment_checks(const struct iovec *iov,
}
EXPORT_SYMBOL(generic_segment_checks);

+static ssize_t mapping_direct_IO(struct address_space *mapping, int rw,
+ struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ if (iov_iter_has_iovec(iter))
+ return mapping->a_ops->direct_IO(rw, iocb, iov_iter_iovec(iter),
+ pos, iter->nr_segs);
+ else if (iov_iter_has_bvec(iter))
+ return mapping->a_ops->direct_IO_bvec(rw, iocb,
+ iov_iter_bvec(iter), pos,
+ iter->nr_segs);
+ else
+ BUG();
+}
+
+static int file_read_iter_actor(read_descriptor_t *desc, struct page *page,
+ unsigned long offset, unsigned long size)
+{
+ struct iov_iter *iter = desc->arg.data;
+ unsigned long copied = 0;
+
+ if (size > desc->count)
+ size = desc->count;
+
+ copied = iov_iter_copy_to_user(page, iter, offset, size);
+ if (copied < size)
+ desc->error = -EFAULT;
+
+ iov_iter_advance(iter, copied);
+ desc->count -= copied;
+ desc->written += copied;
+
+ return copied;
+}
+
/**
- * generic_file_aio_read - generic filesystem read routine
+ * generic_file_read_iter - generic filesystem read routine
* @iocb: kernel I/O control block
- * @iov: io vector request
- * @nr_segs: number of segments in the iovec
+ * @iov_iter: memory vector
* @pos: current file position
- *
- * This is the "read()" routine for all filesystems
- * that can use the page cache directly.
*/
ssize_t
-generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct file *filp = iocb->ki_filp;
- ssize_t retval;
- unsigned long seg = 0;
- size_t count;
+ read_descriptor_t desc;
+ ssize_t retval = 0;
+ size_t count = iov_iter_count(iter);
loff_t *ppos = &iocb->ki_pos;

- count = 0;
- retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
- if (retval)
- return retval;
-
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
loff_t size;
@@ -1419,13 +1444,13 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
size = i_size_read(inode);
if (pos < size) {
retval = filemap_write_and_wait_range(mapping, pos,
- pos + iov_length(iov, nr_segs) - 1);
+ pos + count - 1);
if (!retval) {
struct blk_plug plug;

blk_start_plug(&plug);
- retval = mapping->a_ops->direct_IO(READ, iocb,
- iov, pos, nr_segs);
+ retval = mapping_direct_IO(mapping, READ,
+ iocb, iter, pos);
blk_finish_plug(&plug);
}
if (retval > 0) {
@@ -1448,42 +1473,47 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
}
}

- count = retval;
- for (seg = 0; seg < nr_segs; seg++) {
- read_descriptor_t desc;
- loff_t offset = 0;
-
- /*
- * If we did a short DIO read we need to skip the section of the
- * iov that we've already read data into.
- */
- if (count) {
- if (count > iov[seg].iov_len) {
- count -= iov[seg].iov_len;
- continue;
- }
- offset = count;
- count = 0;
- }
-
- desc.written = 0;
- desc.arg.buf = iov[seg].iov_base + offset;
- desc.count = iov[seg].iov_len - offset;
- if (desc.count == 0)
- continue;
- desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
- retval += desc.written;
- if (desc.error) {
- retval = retval ?: desc.error;
- break;
- }
- if (desc.count > 0)
- break;
- }
+ desc.written = 0;
+ desc.arg.data = iter;
+ desc.count = count;
+ desc.error = 0;
+ do_generic_file_read(filp, ppos, &desc, file_read_iter_actor);
+ if (desc.written)
+ retval = desc.written;
+ else
+ retval = desc.error;
out:
return retval;
}
+EXPORT_SYMBOL(generic_file_read_iter);
+
+/**
+ * generic_file_aio_read - generic filesystem read routine
+ * @iocb: kernel I/O control block
+ * @iov: io vector request
+ * @nr_segs: number of segments in the iovec
+ * @pos: current file position
+ *
+ * This is the "read()" routine for all filesystems
+ * that can use the page cache directly.
+ */
+ssize_t
+generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct iov_iter iter;
+ int ret;
+ size_t count;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
+ if (ret)
+ return ret;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+
+ return generic_file_read_iter(iocb, &iter, pos);
+}
EXPORT_SYMBOL(generic_file_aio_read);

static ssize_t
@@ -2116,9 +2146,8 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
EXPORT_SYMBOL(pagecache_write_end);

ssize_t
-generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long *nr_segs, loff_t pos, loff_t *ppos,
- size_t count, size_t ocount)
+generic_file_direct_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, loff_t *ppos, size_t count)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -2127,10 +2156,13 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
size_t write_len;
pgoff_t end;

- if (count != ocount)
- *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
+ if (count != iov_iter_count(iter)) {
+ written = iov_iter_shorten(iter, count);
+ if (written)
+ goto out;
+ }

- write_len = iov_length(iov, *nr_segs);
+ write_len = count;
end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;

written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
@@ -2157,7 +2189,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
}
}

- written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
+ written = mapping_direct_IO(mapping, WRITE, iocb, iter, pos);

/*
* Finally, try again to invalidate clean pages which might have been
@@ -2183,6 +2215,23 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
out:
return written;
}
+EXPORT_SYMBOL(generic_file_direct_write_iter);
+
+ssize_t
+generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long *nr_segs, loff_t pos, loff_t *ppos,
+ size_t count, size_t ocount)
+{
+ struct iov_iter iter;
+ ssize_t ret;
+
+ iov_iter_init(&iter, iov, *nr_segs, ocount, 0);
+ ret = generic_file_direct_write_iter(iocb, &iter, pos, ppos, count);
+ /* generic_file_direct_write_iter() might have shortened the vec */
+ if (*nr_segs != iter.nr_segs)
+ *nr_segs = iter.nr_segs;
+ return ret;
+}
EXPORT_SYMBOL(generic_file_direct_write);

/*
@@ -2314,16 +2363,13 @@ again:
}

ssize_t
-generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, loff_t *ppos,
- size_t count, ssize_t written)
+generic_file_buffered_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, loff_t *ppos, ssize_t written)
{
struct file *file = iocb->ki_filp;
ssize_t status;
- struct iov_iter i;

- iov_iter_init(&i, iov, nr_segs, count, written);
- status = generic_perform_write(file, &i, pos);
+ status = generic_perform_write(file, iter, pos);

if (likely(status >= 0)) {
written += status;
@@ -2332,13 +2378,24 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,

return written ? written : status;
}
+EXPORT_SYMBOL(generic_file_buffered_write_iter);
+
+ssize_t
+generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos, loff_t *ppos,
+ size_t count, ssize_t written)
+{
+ struct iov_iter iter;
+ iov_iter_init(&iter, iov, nr_segs, count, written);
+ return generic_file_buffered_write_iter(iocb, &iter, pos, ppos,
+ written);
+}
EXPORT_SYMBOL(generic_file_buffered_write);

/**
* __generic_file_aio_write - write data to a file
* @iocb: IO state structure (file, offset, etc.)
- * @iov: vector with data to write
- * @nr_segs: number of segments in the vector
+ * @iter: iov_iter specifying memory to write
* @ppos: position where to write
*
* This function does all the work needed for actually writing data to a
@@ -2353,24 +2410,18 @@ EXPORT_SYMBOL(generic_file_buffered_write);
* A caller has to handle it. This is mainly due to the fact that we want to
* avoid syncing under i_mutex.
*/
-ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
+ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t *ppos)
{
struct file *file = iocb->ki_filp;
struct address_space * mapping = file->f_mapping;
- size_t ocount; /* original count */
size_t count; /* after file limit checks */
struct inode *inode = mapping->host;
loff_t pos;
ssize_t written;
ssize_t err;

- ocount = 0;
- err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
- if (err)
- return err;
-
- count = ocount;
+ count = iov_iter_count(iter);
pos = *ppos;

vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
@@ -2397,8 +2448,8 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
loff_t endbyte;
ssize_t written_buffered;

- written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
- ppos, count, ocount);
+ written = generic_file_direct_write_iter(iocb, iter, pos,
+ ppos, count);
if (written < 0 || written == count)
goto out;
/*
@@ -2407,9 +2458,9 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
*/
pos += written;
count -= written;
- written_buffered = generic_file_buffered_write(iocb, iov,
- nr_segs, pos, ppos, count,
- written);
+ iov_iter_advance(iter, written);
+ written_buffered = generic_file_buffered_write_iter(iocb, iter,
+ pos, ppos, written);
/*
* If generic_file_buffered_write() retuned a synchronous error
* then we want to return the number of bytes which were
@@ -2441,13 +2492,57 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
*/
}
} else {
- written = generic_file_buffered_write(iocb, iov, nr_segs,
- pos, ppos, count, written);
+ iter->count = count;
+ written = generic_file_buffered_write_iter(iocb, iter,
+ pos, ppos, written);
}
out:
current->backing_dev_info = NULL;
return written ? written : err;
}
+EXPORT_SYMBOL(__generic_file_write_iter);
+
+ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+
+ mutex_lock(&inode->i_mutex);
+ ret = __generic_file_write_iter(iocb, iter, &iocb->ki_pos);
+ mutex_unlock(&inode->i_mutex);
+
+ if (ret > 0 || ret == -EIOCBQUEUED) {
+ ssize_t err;
+
+ err = generic_write_sync(file, pos, ret);
+ if (err < 0 && ret > 0)
+ ret = err;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(generic_file_write_iter);
+
+ssize_t
+__generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t *ppos)
+{
+ struct iov_iter iter;
+ size_t count;
+ int ret;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
+ if (ret)
+ goto out;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+
+ ret = __generic_file_write_iter(iocb, &iter, ppos);
+out:
+ return ret;
+}
EXPORT_SYMBOL(__generic_file_aio_write);

/**
--
1.7.9.2

2012-02-27 21:20:41

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 18/22] ext3: add support for .read_iter and .write_iter

From: Zach Brown <[email protected]>

ext3 uses the generic .read_iter and .write_iter functions.

ext3_direct_IO() is refactored in to helpers which are called by the
.direct_IO{,bvec}() methods which call __blockdev_direct_IO{,_bvec}().

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: [email protected]
---
drivers/block/loop.c | 55 ++++++++++++++++++-
fs/ext3/file.c | 2 +
fs/ext3/inode.c | 149 ++++++++++++++++++++++++++++++++++----------------
include/linux/loop.h | 1 +
4 files changed, 160 insertions(+), 47 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cd50435..cdc34e1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,7 @@
#include <linux/sysfs.h>
#include <linux/miscdevice.h>
#include <linux/falloc.h>
+#include <linux/aio.h>

#include <asm/uaccess.h>

@@ -213,6 +214,46 @@ lo_do_transfer(struct loop_device *lo, int cmd,
return lo->transfer(lo, cmd, rpage, roffs, lpage, loffs, size, rblock);
}

+void lo_rw_aio_complete(u64 data, long res)
+{
+ struct bio *bio = (struct bio *)data;
+
+ if (res > 0)
+ res = 0;
+ else if (res < 0)
+ res = -EIO;
+
+ bio_endio(bio, res);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct bio *bio)
+{
+ struct file *file = lo->lo_backing_file;
+ struct kiocb *iocb;
+ unsigned short op;
+ struct iov_iter iter;
+ struct bio_vec *bvec;
+ size_t nr_segs;
+ loff_t pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;
+
+ iocb = aio_kernel_alloc(GFP_NOIO);
+ if (!iocb)
+ return -ENOMEM;
+
+ if (bio_rw(bio) & WRITE)
+ op = IOCB_CMD_WRITE_ITER;
+ else
+ op = IOCB_CMD_READ_ITER;
+
+ bvec = bio_iovec_idx(bio, bio->bi_idx);
+ nr_segs = bio_segments(bio);
+ iov_iter_init_bvec(&iter, bvec, nr_segs, bvec_length(bvec, nr_segs), 0);
+ aio_kernel_init_iter(iocb, file, op, &iter, pos);
+ aio_kernel_init_callback(iocb, lo_rw_aio_complete, (u64)bio);
+
+ return aio_kernel_submit(iocb);
+}
+
/**
* __do_lo_send_write - helper for writing data to a loop device
*
@@ -512,7 +553,14 @@ static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
do_loop_switch(lo, bio->bi_private);
bio_put(bio);
} else {
- int ret = do_bio_filebacked(lo, bio);
+ int ret;
+ if (lo->lo_flags & LO_FLAGS_USE_AIO &&
+ lo->transfer == transfer_none) {
+ ret = lo_rw_aio(lo, bio);
+ if (ret == 0)
+ return;
+ } else
+ ret = do_bio_filebacked(lo, bio);
bio_endio(bio, ret);
}
}
@@ -854,6 +902,11 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
!file->f_op->write)
lo_flags |= LO_FLAGS_READ_ONLY;

+ if (file->f_op->write_iter && file->f_op->read_iter) {
+ file->f_flags |= O_DIRECT;
+ lo_flags |= LO_FLAGS_USE_AIO;
+ }
+
lo_blocksize = S_ISBLK(inode->i_mode) ?
inode->i_bdev->bd_block_size : PAGE_SIZE;

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 724df69..30447a5 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -58,6 +58,8 @@ const struct file_operations ext3_file_operations = {
.write = do_sync_write,
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext3_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 2d0afec..ea414f0 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1853,6 +1853,70 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
return journal_try_to_free_buffers(journal, page, wait);
}

+static ssize_t ext3_journal_orphan_add(struct inode *inode)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ handle_t *handle;
+ ssize_t ret;
+
+ /* Credits for sb + inode write */
+ handle = ext3_journal_start(inode, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ ret = ext3_orphan_add(handle, inode);
+ if (ret) {
+ ext3_journal_stop(handle);
+ goto out;
+ }
+ ei->i_disksize = inode->i_size;
+ ext3_journal_stop(handle);
+out:
+ return ret;
+}
+
+static ssize_t ext3_journal_orphan_del(struct inode *inode, ssize_t ret,
+ loff_t offset)
+{
+ struct ext3_inode_info *ei = EXT3_I(inode);
+ handle_t *handle;
+ int err;
+
+ /* Credits for sb + inode write */
+ handle = ext3_journal_start(inode, 2);
+ if (IS_ERR(handle)) {
+ /* This is really bad luck. We've written the data
+ * but cannot extend i_size. Truncate allocated blocks
+ * and pretend the write failed... */
+ ext3_truncate_failed_direct_write(inode);
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ if (inode->i_nlink)
+ ext3_orphan_del(handle, inode);
+ if (ret > 0) {
+ loff_t end = offset + ret;
+ if (end > inode->i_size) {
+ ei->i_disksize = end;
+ i_size_write(inode, end);
+ /*
+ * We're going to return a positive `ret'
+ * here due to non-zero-length I/O, so there's
+ * no way of reporting error returns from
+ * ext3_mark_inode_dirty() to userspace. So
+ * ignore it.
+ */
+ ext3_mark_inode_dirty(handle, inode);
+ }
+ }
+ err = ext3_journal_stop(handle);
+ if (ret == 0)
+ ret = err;
+out:
+ return ret;
+}
+
/*
* If the O_DIRECT write will extend the file then add this inode to the
* orphan list. So recovery will truncate it back to the original size
@@ -1868,8 +1932,6 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
- struct ext3_inode_info *ei = EXT3_I(inode);
- handle_t *handle;
ssize_t ret;
int orphan = 0;
size_t count = iov_length(iov, nr_segs);
@@ -1881,20 +1943,10 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
loff_t final_size = offset + count;

if (final_size > inode->i_size) {
- /* Credits for sb + inode write */
- handle = ext3_journal_start(inode, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
+ ret = ext3_journal_orphan_add(inode);
+ if (ret)
goto out;
- }
- ret = ext3_orphan_add(handle, inode);
- if (ret) {
- ext3_journal_stop(handle);
- goto out;
- }
orphan = 1;
- ei->i_disksize = inode->i_size;
- ext3_journal_stop(handle);
}
}

@@ -1915,43 +1967,46 @@ retry:
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;

- if (orphan) {
- int err;
+ if (orphan)
+ ret = ext3_journal_orphan_del(inode, ret, offset);
+out:
+ return ret;
+}

- /* Credits for sb + inode write */
- handle = ext3_journal_start(inode, 2);
- if (IS_ERR(handle)) {
- /* This is really bad luck. We've written the data
- * but cannot extend i_size. Truncate allocated blocks
- * and pretend the write failed... */
- ext3_truncate_failed_direct_write(inode);
- ret = PTR_ERR(handle);
- goto out;
- }
- if (inode->i_nlink)
- ext3_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size) {
- ei->i_disksize = end;
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext3_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext3_mark_inode_dirty(handle, inode);
- }
+static ssize_t ext3_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+ int orphan = 0;
+ size_t count = bvec_length(bvec, bvec_len);
+ int retries = 0;
+
+ if (rw == WRITE) {
+ loff_t final_size = offset + count;
+
+ if (final_size > inode->i_size) {
+ ret = ext3_journal_orphan_add(inode);
+ if (ret)
+ goto out;
+ orphan = 1;
}
- err = ext3_journal_stop(handle);
- if (ret == 0)
- ret = err;
}
+
+retry:
+ ret = blockdev_direct_IO_bvec(rw, iocb, inode, inode->i_sb->s_bdev,
+ bvec, offset, bvec_len, ext3_get_block,
+ NULL);
+ if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ if (orphan)
+ ret = ext3_journal_orphan_del(inode, ret, offset);
out:
trace_ext3_direct_IO_exit(inode, offset,
- iov_length(iov, nr_segs), rw, ret);
+ bvec_length(bvec, bvec_len), rw, ret);
return ret;
}

@@ -1984,6 +2039,7 @@ static const struct address_space_operations ext3_ordered_aops = {
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
+ .direct_IO_bvec = ext3_direct_IO_bvec,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -1999,6 +2055,7 @@ static const struct address_space_operations ext3_writeback_aops = {
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.direct_IO = ext3_direct_IO,
+ .direct_IO_bvec = ext3_direct_IO_bvec,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
diff --git a/include/linux/loop.h b/include/linux/loop.h
index 11a41a8..5163fd3 100644
--- a/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -75,6 +75,7 @@ enum {
LO_FLAGS_READ_ONLY = 1,
LO_FLAGS_AUTOCLEAR = 4,
LO_FLAGS_PARTSCAN = 8,
+ LO_FLAGS_USE_AIO = 16,
};

#include <asm/posix_types.h> /* for __kernel_old_dev_t */
--
1.7.9.2

2012-02-27 21:20:43

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 20/22] ext4: add support for read_iter, write_iter, and direct_IO_bvec

Some helpers were broken out of ext4_ind_direct_IO() and
ext4_ext_direct_IO() in order to avoid code duplication in new
bio_vec-based functions.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: [email protected]
---
fs/ext4/ext4.h | 3 +
fs/ext4/file.c | 2 +
fs/ext4/indirect.c | 169 +++++++++++++++++++++++++++++++-----------
fs/ext4/inode.c | 206 +++++++++++++++++++++++++++++++++++-----------------
4 files changed, 268 insertions(+), 112 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 513004f..6426d43 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1905,6 +1905,9 @@ extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs);
+extern ssize_t ext4_ind_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len);
extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks, int chunk);
extern void ext4_ind_truncate(struct inode *inode);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index cb70f18..ce76745 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -234,6 +234,8 @@ const struct file_operations ext4_file_operations = {
.write = do_sync_write,
.aio_read = generic_file_aio_read,
.aio_write = ext4_file_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext4_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext4_compat_ioctl,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 830e1b2..e8ca3b9 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -760,6 +760,72 @@ out:
return err;
}

+static ssize_t ext4_journal_orphan_add(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ handle_t *handle;
+ ssize_t ret;
+
+ /* Credits for sb + inode write */
+ handle = ext4_journal_start(inode, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+ ei->i_disksize = inode->i_size;
+ ext4_journal_stop(handle);
+out:
+ return ret;
+}
+
+static ssize_t ext4_journal_orphan_del(struct inode *inode, ssize_t ret,
+ loff_t offset)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ handle_t *handle;
+ int err;
+
+ /* Credits for sb + inode write */
+ handle = ext4_journal_start(inode, 2);
+ if (IS_ERR(handle)) {
+ /* This is really bad luck. We've written the data
+ * but cannot extend i_size. Bail out and pretend
+ * the write failed... */
+ ret = PTR_ERR(handle);
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+
+ goto out;
+ }
+ if (inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ if (ret > 0) {
+ loff_t end = offset + ret;
+ if (end > inode->i_size) {
+ ei->i_disksize = end;
+ i_size_write(inode, end);
+ /*
+ * We're going to return a positive `ret'
+ * here due to non-zero-length I/O, so there's
+ * no way of reporting error returns from
+ * ext4_mark_inode_dirty() to userspace. So
+ * ignore it.
+ */
+ ext4_mark_inode_dirty(handle, inode);
+ }
+ }
+ err = ext4_journal_stop(handle);
+ if (ret == 0)
+ ret = err;
+out:
+ return ret;
+}
+
/*
* O_DIRECT for ext3 (or indirect map) based files
*
@@ -778,7 +844,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
struct ext4_inode_info *ei = EXT4_I(inode);
- handle_t *handle;
ssize_t ret;
int orphan = 0;
size_t count = iov_length(iov, nr_segs);
@@ -788,20 +853,10 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
loff_t final_size = offset + count;

if (final_size > inode->i_size) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
+ ret = ext4_journal_orphan_add(inode);
+ if (ret)
goto out;
- }
orphan = 1;
- ei->i_disksize = inode->i_size;
- ext4_journal_stop(handle);
}
}

@@ -831,42 +886,68 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;

- if (orphan) {
- int err;
+ if (orphan)
+ ret = ext4_journal_orphan_del(inode, ret, offset);
+out:
+ return ret;
+}

- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, 2);
- if (IS_ERR(handle)) {
- /* This is really bad luck. We've written the data
- * but cannot extend i_size. Bail out and pretend
- * the write failed... */
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
+/*
+ * Like ext4_ind_direct_IO, but operates on bio_vec instead of iovec
+ */
+ssize_t ext4_ind_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ ssize_t ret;
+ int orphan = 0;
+ size_t count = bvec_length(bvec, bvec_len);
+ int retries = 0;
+
+ if (rw == WRITE) {
+ loff_t final_size = offset + count;

- goto out;
+ if (final_size > inode->i_size) {
+ ret = ext4_journal_orphan_add(inode);
+ if (ret)
+ goto out;
+ orphan = 1;
}
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size) {
- ei->i_disksize = end;
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
+ }
+
+retry:
+ if (rw == READ && ext4_should_dioread_nolock(inode)) {
+ if (unlikely(!list_empty(&ei->i_completed_io_list))) {
+ mutex_lock(&inode->i_mutex);
+ ext4_flush_completed_IO(inode);
+ mutex_unlock(&inode->i_mutex);
+ }
+ ret = __blockdev_direct_IO_bvec(rw, iocb, inode,
+ inode->i_sb->s_bdev, bvec,
+ offset, bvec_len,
+ ext4_get_block, NULL, NULL, 0);
+ } else {
+ ret = blockdev_direct_IO_bvec(rw, iocb, inode,
+ inode->i_sb->s_bdev, bvec,
+ offset, bvec_len,
+ ext4_get_block, NULL);
+
+ if (unlikely((rw & WRITE) && ret < 0)) {
+ loff_t isize = i_size_read(inode);
+ loff_t end = offset + bvec_length(bvec, bvec_len);
+
+ if (end > isize)
+ ext4_truncate_failed_write(inode);
}
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
}
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ if (orphan)
+ ret = ext4_journal_orphan_del(inode, ret, offset);
out:
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index feaa82f..922b26f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2764,7 +2764,7 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,

ext_debug("ext4_end_io_dio(): io_end 0x%p "
"for inode %lu, iocb 0x%p, offset %llu, size %llu\n",
- iocb->private, io_end->inode->i_ino, iocb, offset,
+ iocb->private, io_end->inode->i_ino, iocb, offset,
size);

iocb->private = NULL;
@@ -2868,6 +2868,85 @@ retry:
return 0;
}

+static ssize_t ext4_ext_direct_IO_pre_write(struct kiocb *iocb,
+ struct inode *inode)
+{
+ /*
+ * We could direct write to holes and fallocate.
+ *
+ * Allocated blocks to fill the hole are marked as uninitialized
+ * to prevent parallel buffered read to expose the stale data
+ * before DIO complete the data IO.
+ *
+ * As to previously fallocated extents, ext4 get_block
+ * will just simply mark the buffer mapped but still
+ * keep the extents uninitialized.
+ *
+ * for non AIO case, we will convert those unwritten extents
+ * to written after return back from blockdev_direct_IO.
+ *
+ * for async DIO, the conversion needs to be defered when
+ * the IO is completed. The ext4 end_io callback function
+ * will be called to take care of the conversion work.
+ * Here for async case, we allocate an io_end structure to
+ * hook to the iocb.
+ */
+ iocb->private = NULL;
+ EXT4_I(inode)->cur_aio_dio = NULL;
+ if (!is_sync_kiocb(iocb)) {
+ iocb->private = ext4_init_io_end(inode, GFP_NOFS);
+ if (!iocb->private)
+ return -ENOMEM;
+ /*
+ * we save the io structure for current async
+ * direct IO, so that later ext4_map_blocks()
+ * could flag the io structure whether there
+ * is a unwritten extents needs to be converted
+ * when IO is completed.
+ */
+ EXT4_I(inode)->cur_aio_dio = iocb->private;
+ }
+ return 0;
+}
+
+static ssize_t ext4_ext_direct_IO_post_write(struct kiocb *iocb,
+ struct inode *inode,
+ loff_t offset, ssize_t ret)
+{
+ if (iocb->private)
+ EXT4_I(inode)->cur_aio_dio = NULL;
+ /*
+ * The io_end structure takes a reference to the inode,
+ * that structure needs to be destroyed and the
+ * reference to the inode need to be dropped, when IO is
+ * complete, even with 0 byte write, or failed.
+ *
+ * In the successful AIO DIO case, the io_end structure will be
+ * desctroyed and the reference to the inode will be dropped
+ * after the end_io call back function is called.
+ *
+ * In the case there is 0 byte write, or error case, since
+ * VFS direct IO won't invoke the end_io call back function,
+ * we need to free the end_io structure here.
+ */
+ if (ret != -EIOCBQUEUED && ret <= 0 && iocb->private) {
+ ext4_free_io_end(iocb->private);
+ iocb->private = NULL;
+ } else if (ret > 0 &&
+ ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN)) {
+ int err;
+ /*
+ * for non AIO case, since the IO is already
+ * completed, we could do the conversion right here
+ */
+ err = ext4_convert_unwritten_extents(inode, offset, ret);
+ if (err < 0)
+ ret = err;
+ ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
+ }
+ return ret;
+}
+
/*
* For ext4 extent files, ext4 will do direct-io write to holes,
* preallocated extents, and those write extend the file, no need to
@@ -2898,41 +2977,9 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,

loff_t final_size = offset + count;
if (rw == WRITE && final_size <= inode->i_size) {
- /*
- * We could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as uninitialized
- * to prevent parallel buffered read to expose the stale data
- * before DIO complete the data IO.
- *
- * As to previously fallocated extents, ext4 get_block
- * will just simply mark the buffer mapped but still
- * keep the extents uninitialized.
- *
- * for non AIO case, we will convert those unwritten extents
- * to written after return back from blockdev_direct_IO.
- *
- * for async DIO, the conversion needs to be defered when
- * the IO is completed. The ext4 end_io callback function
- * will be called to take care of the conversion work.
- * Here for async case, we allocate an io_end structure to
- * hook to the iocb.
- */
- iocb->private = NULL;
- EXT4_I(inode)->cur_aio_dio = NULL;
- if (!is_sync_kiocb(iocb)) {
- iocb->private = ext4_init_io_end(inode, GFP_NOFS);
- if (!iocb->private)
- return -ENOMEM;
- /*
- * we save the io structure for current async
- * direct IO, so that later ext4_map_blocks()
- * could flag the io structure whether there
- * is a unwritten extents needs to be converted
- * when IO is completed.
- */
- EXT4_I(inode)->cur_aio_dio = iocb->private;
- }
+ ret = ext4_ext_direct_IO_pre_write(iocb, inode);
+ if (ret)
+ return ret;

ret = __blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
@@ -2941,38 +2988,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
ext4_end_io_dio,
NULL,
DIO_LOCKING | DIO_SKIP_HOLES);
- if (iocb->private)
- EXT4_I(inode)->cur_aio_dio = NULL;
- /*
- * The io_end structure takes a reference to the inode,
- * that structure needs to be destroyed and the
- * reference to the inode need to be dropped, when IO is
- * complete, even with 0 byte write, or failed.
- *
- * In the successful AIO DIO case, the io_end structure will be
- * desctroyed and the reference to the inode will be dropped
- * after the end_io call back function is called.
- *
- * In the case there is 0 byte write, or error case, since
- * VFS direct IO won't invoke the end_io call back function,
- * we need to free the end_io structure here.
- */
- if (ret != -EIOCBQUEUED && ret <= 0 && iocb->private) {
- ext4_free_io_end(iocb->private);
- iocb->private = NULL;
- } else if (ret > 0 && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
+ ret = ext4_ext_direct_IO_post_write(iocb, inode, offset, ret);
return ret;
}

@@ -2980,6 +2996,37 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
}

+/*
+ * Like ext4_ext_direct_IO, but operates on a bio_vec rather than iovec.
+ */
+static ssize_t ext4_ext_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+ size_t count = bvec_length(bvec, bvec_len);
+
+ loff_t final_size = offset + count;
+ if (rw == WRITE && final_size <= inode->i_size) {
+ ret = ext4_ext_direct_IO_pre_write(iocb, inode);
+ if (ret)
+ return ret;
+
+ ret = blockdev_direct_IO_bvec(rw, iocb, inode,
+ inode->i_sb->s_bdev, bvec,
+ offset, bvec_len,
+ ext4_get_block_write,
+ ext4_end_io_dio);
+ ret = ext4_ext_direct_IO_post_write(iocb, inode, offset, ret);
+ return ret;
+ }
+
+ /* for write the the end of file case, we fall back to old way */
+ return ext4_ind_direct_IO_bvec(rw, iocb, bvec, offset, bvec_len);
+}
+
static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs)
@@ -3004,6 +3051,25 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
return ret;
}

+static ssize_t ext4_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+
+ trace_ext4_direct_IO_enter(inode, offset, bvec_length(bvec, bvec_len),
+ rw);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ ret = ext4_ext_direct_IO_bvec(rw, iocb, bvec, offset, bvec_len);
+ else
+ ret = ext4_ind_direct_IO_bvec(rw, iocb, bvec, offset, bvec_len);
+ trace_ext4_direct_IO_exit(inode, offset, bvec_length(bvec, bvec_len),
+ rw, ret);
+ return ret;
+}
+
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3033,6 +3099,7 @@ static const struct address_space_operations ext4_ordered_aops = {
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
.direct_IO = ext4_direct_IO,
+ .direct_IO_bvec = ext4_direct_IO_bvec,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3048,6 +3115,7 @@ static const struct address_space_operations ext4_writeback_aops = {
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
.direct_IO = ext4_direct_IO,
+ .direct_IO_bvec = ext4_direct_IO_bvec,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3064,6 +3132,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
.direct_IO = ext4_direct_IO,
+ .direct_IO_bvec = ext4_direct_IO_bvec,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3079,6 +3148,7 @@ static const struct address_space_operations ext4_da_aops = {
.invalidatepage = ext4_da_invalidatepage,
.releasepage = ext4_releasepage,
.direct_IO = ext4_direct_IO,
+ .direct_IO_bvec = ext4_direct_IO_bvec,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
1.7.9.2

2012-02-27 21:20:49

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 19/22] ocfs2: add support for read_iter, write_iter, and direct_IO_bvec

From: Zach Brown <[email protected]>

ocfs2's .aio_read and .aio_write methods are changed to take
iov_iter and pass it to generic functions. Wrappers are made to pack
the iovecs into iters and call these new functions.

ocfs2_direct_IO() is trivial enough that a new function is made which
passes the bvec down to the generic direct path.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: [email protected]
---
fs/ocfs2/aops.c | 31 ++++++++++++++++++
fs/ocfs2/file.c | 82 ++++++++++++++++++++++++++++++++++--------------
fs/ocfs2/ocfs2_trace.h | 6 +++-
3 files changed, 94 insertions(+), 25 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 78b68af..80183df 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -645,6 +645,36 @@ static ssize_t ocfs2_direct_IO(int rw,
ocfs2_dio_end_io, NULL, 0);
}

+static ssize_t ocfs2_direct_IO_bvec(int rw,
+ struct kiocb *iocb,
+ struct bio_vec *bvec,
+ loff_t offset,
+ unsigned long bvec_len)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host;
+ int ret;
+
+ /*
+ * Fallback to buffered I/O if we see an inode without
+ * extents.
+ */
+ if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
+ return 0;
+
+ /* Fallback to buffered I/O if we are appending. */
+ if (i_size_read(inode) <= offset)
+ return 0;
+
+ ret = blockdev_direct_IO_bvec_no_locking(rw, iocb, inode,
+ inode->i_sb->s_bdev, bvec,
+ offset, bvec_len,
+ ocfs2_direct_IO_get_blocks,
+ ocfs2_dio_end_io);
+
+ return ret;
+}
+
static void ocfs2_figure_cluster_boundaries(struct ocfs2_super *osb,
u32 cpos,
unsigned int *start,
@@ -2091,6 +2121,7 @@ const struct address_space_operations ocfs2_aops = {
.write_end = ocfs2_write_end,
.bmap = ocfs2_bmap,
.direct_IO = ocfs2_direct_IO,
+ .direct_IO_bvec = ocfs2_direct_IO_bvec,
.invalidatepage = ocfs2_invalidatepage,
.releasepage = ocfs2_releasepage,
.migratepage = buffer_migrate_page,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 061591a..f636813 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2233,15 +2233,13 @@ out:
return ret;
}

-static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs,
- loff_t pos)
+static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
+ struct iov_iter *iter,
+ loff_t pos)
{
int ret, direct_io, appending, rw_level, have_alloc_sem = 0;
int can_do_direct, has_refcount = 0;
ssize_t written = 0;
- size_t ocount; /* original count */
size_t count; /* after file limit checks */
loff_t old_size, *ppos = &iocb->ki_pos;
u32 old_clusters;
@@ -2252,11 +2250,11 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
OCFS2_MOUNT_COHERENCY_BUFFERED);
int unaligned_dio = 0;

- trace_ocfs2_file_aio_write(inode, file, file->f_path.dentry,
+ trace_ocfs2_file_write_iter(inode, file, file->f_path.dentry,
(unsigned long long)OCFS2_I(inode)->ip_blkno,
file->f_path.dentry->d_name.len,
file->f_path.dentry->d_name.name,
- (unsigned int)nr_segs);
+ (unsigned long long)pos);

if (iocb->ki_left == 0)
return 0;
@@ -2358,28 +2356,24 @@ relock:
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_rw_locked(iocb, rw_level);

- ret = generic_segment_checks(iov, &nr_segs, &ocount,
- VERIFY_READ);
- if (ret)
- goto out_dio;

- count = ocount;
+ count = iov_iter_count(iter);
ret = generic_write_checks(file, ppos, &count,
S_ISBLK(inode->i_mode));
if (ret)
goto out_dio;

if (direct_io) {
- written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
- ppos, count, ocount);
+ written = generic_file_direct_write_iter(iocb, iter, *ppos,
+ ppos, count);
if (written < 0) {
ret = written;
goto out_dio;
}
} else {
current->backing_dev_info = file->f_mapping->backing_dev_info;
- written = generic_file_buffered_write(iocb, iov, nr_segs, *ppos,
- ppos, count, 0);
+ written = generic_file_buffered_write_iter(iocb, iter, *ppos,
+ ppos, 0);
current->backing_dev_info = NULL;
}

@@ -2440,6 +2434,25 @@ out_sems:
return ret;
}

+static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
+ const struct iovec *iov,
+ unsigned long nr_segs,
+ loff_t pos)
+{
+ struct iov_iter iter;
+ size_t count;
+ int ret;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
+ if (ret)
+ return ret;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+
+ return ocfs2_file_write_iter(iocb, &iter, pos);
+}
+
static int ocfs2_splice_to_file(struct pipe_inode_info *pipe,
struct file *out,
struct splice_desc *sd)
@@ -2553,19 +2566,18 @@ bail:
return ret;
}

-static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs,
+static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
+ struct iov_iter *iter,
loff_t pos)
{
int ret = 0, rw_level = -1, have_alloc_sem = 0, lock_level = 0;
struct file *filp = iocb->ki_filp;
struct inode *inode = filp->f_path.dentry->d_inode;

- trace_ocfs2_file_aio_read(inode, filp, filp->f_path.dentry,
+ trace_ocfs2_file_read_iter(inode, filp, filp->f_path.dentry,
(unsigned long long)OCFS2_I(inode)->ip_blkno,
filp->f_path.dentry->d_name.len,
- filp->f_path.dentry->d_name.name, nr_segs);
+ filp->f_path.dentry->d_name.name, pos);


if (!inode) {
@@ -2601,7 +2613,7 @@ static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
*
* Take and drop the meta data lock to update inode fields
* like i_size. This allows the checks down below
- * generic_file_aio_read() a chance of actually working.
+ * generic_file_read_iter() a chance of actually working.
*/
ret = ocfs2_inode_lock_atime(inode, filp->f_vfsmnt, &lock_level);
if (ret < 0) {
@@ -2610,8 +2622,8 @@ static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
}
ocfs2_inode_unlock(inode, lock_level);

- ret = generic_file_aio_read(iocb, iov, nr_segs, iocb->ki_pos);
- trace_generic_file_aio_read_ret(ret);
+ ret = generic_file_read_iter(iocb, iter, iocb->ki_pos);
+ trace_generic_file_read_iter_ret(ret);

/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT));
@@ -2683,6 +2695,24 @@ out:
return offset;
}

+static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
+ const struct iovec *iov,
+ unsigned long nr_segs,
+ loff_t pos)
+{
+ struct iov_iter iter;
+ size_t count;
+ int ret;
+
+ ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
+ if (ret)
+ return ret;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+
+ return ocfs2_file_read_iter(iocb, &iter, pos);
+}
+
const struct inode_operations ocfs2_file_iops = {
.setattr = ocfs2_setattr,
.getattr = ocfs2_getattr,
@@ -2716,6 +2746,8 @@ const struct file_operations ocfs2_fops = {
.open = ocfs2_file_open,
.aio_read = ocfs2_file_aio_read,
.aio_write = ocfs2_file_aio_write,
+ .read_iter = ocfs2_file_read_iter,
+ .write_iter = ocfs2_file_write_iter,
.unlocked_ioctl = ocfs2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ocfs2_compat_ioctl,
@@ -2764,6 +2796,8 @@ const struct file_operations ocfs2_fops_no_plocks = {
.open = ocfs2_file_open,
.aio_read = ocfs2_file_aio_read,
.aio_write = ocfs2_file_aio_write,
+ .read_iter = ocfs2_file_read_iter,
+ .write_iter = ocfs2_file_write_iter,
.unlocked_ioctl = ocfs2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ocfs2_compat_ioctl,
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index 3b481f4..8409f00 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1312,12 +1312,16 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_sync_file);

DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_write);

+DEFINE_OCFS2_FILE_OPS(ocfs2_file_write_iter);
+
DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);

DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);

DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_read);

+DEFINE_OCFS2_FILE_OPS(ocfs2_file_read_iter);
+
DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);

DEFINE_OCFS2_ULL_ULL_EVENT(ocfs2_truncate_file_error);
@@ -1474,7 +1478,7 @@ TRACE_EVENT(ocfs2_prepare_inode_for_write,
__entry->direct_io, __entry->has_refcount)
);

-DEFINE_OCFS2_INT_EVENT(generic_file_aio_read_ret);
+DEFINE_OCFS2_INT_EVENT(generic_file_read_iter_ret);

/* End of trace events for fs/ocfs2/file.c. */

--
1.7.9.2

2012-02-27 21:20:32

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 12/22] dio: add dio_post_submission() helper function

From: Zach Brown <[email protected]>

This creates a function that contains all the code that is executed
after IO is submitted. It takes code from the end of
do_blockdev_direct_IO(). This will be called by another entry point
that will be added in an upcoming patch.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 127 ++++++++++++++++++++++++++++++--------------------------
1 file changed, 68 insertions(+), 59 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index e75b8d7..20bb84c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1187,6 +1187,73 @@ static int dio_lock_and_flush(struct dio *dio, loff_t offset, loff_t end)
return 0;
}

+static ssize_t dio_post_submission(int rw, loff_t offset, struct dio *dio,
+ struct dio_submit *sdio,
+ struct buffer_head *map_bh, ssize_t ret)
+{
+ if (ret == -ENOTBLK) {
+ /*
+ * The remaining part of the request will be
+ * be handled by buffered I/O when we return
+ */
+ ret = 0;
+ }
+ /*
+ * There may be some unwritten disk at the end of a part-written
+ * fs-block-sized block. Go zero that now.
+ */
+ dio_zero_block(dio, sdio, 1, map_bh);
+
+ if (sdio->cur_page) {
+ ssize_t ret2;
+
+ ret2 = dio_send_cur_page(dio, sdio, map_bh);
+ if (ret == 0)
+ ret = ret2;
+ page_cache_release(sdio->cur_page);
+ sdio->cur_page = NULL;
+ }
+ if (sdio->bio)
+ dio_bio_submit(dio, sdio);
+
+ /*
+ * It is possible that, we return short IO due to end of file.
+ * In that case, we need to release all the pages we got hold on.
+ */
+ dio_cleanup(dio, sdio);
+
+ /*
+ * All block lookups have been performed. For READ requests
+ * we can let i_mutex go now that its achieved its purpose
+ * of protecting us from looking up uninitialized blocks.
+ */
+ if (rw == READ && (dio->flags & DIO_LOCKING))
+ mutex_unlock(&dio->inode->i_mutex);
+
+ /*
+ * The only time we want to leave bios in flight is when a successful
+ * partial aio read or full aio write have been setup. In that case
+ * bio completion will call aio_complete. The only time it's safe to
+ * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
+ * This had *better* be the only place that raises -EIOCBQUEUED.
+ */
+ BUG_ON(ret == -EIOCBQUEUED);
+ if (dio->is_async && ret == 0 && dio->result &&
+ ((rw & READ) || (dio->result == sdio->size)))
+ ret = -EIOCBQUEUED;
+
+ if (ret != -EIOCBQUEUED)
+ dio_await_completion(dio);
+
+ if (drop_refcount(dio) == 0) {
+ ret = dio_complete(dio, offset, ret, false);
+ kmem_cache_free(dio_cache, dio);
+ } else
+ BUG_ON(ret != -EIOCBQUEUED);
+
+ return ret;
+}
+
/*
* This is a library function for use by filesystem drivers.
*
@@ -1302,65 +1369,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
}
} /* end iovec loop */

- if (retval == -ENOTBLK) {
- /*
- * The remaining part of the request will be
- * be handled by buffered I/O when we return
- */
- retval = 0;
- }
- /*
- * There may be some unwritten disk at the end of a part-written
- * fs-block-sized block. Go zero that now.
- */
- dio_zero_block(dio, &sdio, 1, &map_bh);
-
- if (sdio.cur_page) {
- ssize_t ret2;
-
- ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
- if (retval == 0)
- retval = ret2;
- page_cache_release(sdio.cur_page);
- sdio.cur_page = NULL;
- }
- if (sdio.bio)
- dio_bio_submit(dio, &sdio);
-
- /*
- * It is possible that, we return short IO due to end of file.
- * In that case, we need to release all the pages we got hold on.
- */
- dio_cleanup(dio, &sdio);
-
- /*
- * All block lookups have been performed. For READ requests
- * we can let i_mutex go now that its achieved its purpose
- * of protecting us from looking up uninitialized blocks.
- */
- if (rw == READ && (dio->flags & DIO_LOCKING))
- mutex_unlock(&dio->inode->i_mutex);
-
- /*
- * The only time we want to leave bios in flight is when a successful
- * partial aio read or full aio write have been setup. In that case
- * bio completion will call aio_complete. The only time it's safe to
- * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
- * This had *better* be the only place that raises -EIOCBQUEUED.
- */
- BUG_ON(retval == -EIOCBQUEUED);
- if (dio->is_async && retval == 0 && dio->result &&
- ((rw & READ) || (dio->result == sdio.size)))
- retval = -EIOCBQUEUED;
-
- if (retval != -EIOCBQUEUED)
- dio_await_completion(dio);
-
- if (drop_refcount(dio) == 0) {
- retval = dio_complete(dio, offset, retval, false);
- kmem_cache_free(dio_cache, dio);
- } else
- BUG_ON(retval != -EIOCBQUEUED);
+ retval = dio_post_submission(rw, offset, dio, &sdio, &map_bh, retval);

out:
return retval;
--
1.7.9.2

2012-02-27 21:21:49

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 21/22] btrfs: add support for read_iter, write_iter, and direct_IO_bvec

Some helpers were broken out of btrfs_direct_IO() in order to avoid code
duplication in new bio_vec-based function.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: [email protected]
---
fs/btrfs/file.c | 2 +
fs/btrfs/inode.c | 116 +++++++++++++++++++++++++++++++++++++-----------------
2 files changed, 82 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..7a2fbc0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1880,6 +1880,8 @@ const struct file_operations btrfs_file_operations = {
.aio_read = generic_file_aio_read,
.splice_read = generic_file_splice_read,
.aio_write = btrfs_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.mmap = btrfs_file_mmap,
.open = generic_file_open,
.release = btrfs_release_file,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..52199e7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6151,24 +6151,14 @@ static ssize_t check_direct_IO(struct btrfs_root *root, int rw, struct kiocb *io
out:
return retval;
}
-static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+
+static ssize_t btrfs_pre_direct_IO(int writing, loff_t offset, size_t count,
+ struct inode *inode, int *write_bits)
{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
struct btrfs_ordered_extent *ordered;
struct extent_state *cached_state = NULL;
u64 lockstart, lockend;
ssize_t ret;
- int writing = rw & WRITE;
- int write_bits = 0;
- size_t count = iov_length(iov, nr_segs);
-
- if (check_direct_IO(BTRFS_I(inode)->root, rw, iocb, iov,
- offset, nr_segs)) {
- return 0;
- }

lockstart = offset;
lockend = offset + count - 1;
@@ -6176,7 +6166,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
if (writing) {
ret = btrfs_delalloc_reserve_space(inode, count);
if (ret)
- goto out;
+ return ret;
}

while (1) {
@@ -6191,8 +6181,8 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
lockend - lockstart + 1);
if (!ordered)
break;
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend,
- &cached_state, GFP_NOFS);
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
+ lockend, &cached_state, GFP_NOFS);
btrfs_start_ordered_extent(inode, ordered, 1);
btrfs_put_ordered_extent(ordered);
cond_resched();
@@ -6203,46 +6193,99 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
* the dirty or uptodate bits
*/
if (writing) {
- write_bits = EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING;
- ret = set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
- EXTENT_DELALLOC, 0, NULL, &cached_state,
- GFP_NOFS);
- if (ret) {
+ *write_bits = EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING;
+ ret = set_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
+ lockend, EXTENT_DELALLOC, 0, NULL,
+ &cached_state, GFP_NOFS);
+ if (ret)
clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart,
- lockend, EXTENT_LOCKED | write_bits,
+ lockend, EXTENT_LOCKED | *write_bits,
1, 0, &cached_state, GFP_NOFS);
- goto out;
- }
}
-
free_extent_state(cached_state);
- cached_state = NULL;

- ret = __blockdev_direct_IO(rw, iocb, inode,
- BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
- iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
- btrfs_submit_direct, 0);
+ return ret;
+}
+
+static ssize_t btrfs_post_direct_IO(ssize_t ret, loff_t offset, size_t count,
+ struct inode *inode, int *write_bits)
+{
+ struct extent_state *cached_state = NULL;

if (ret < 0 && ret != -EIOCBQUEUED) {
clear_extent_bit(&BTRFS_I(inode)->io_tree, offset,
- offset + iov_length(iov, nr_segs) - 1,
- EXTENT_LOCKED | write_bits, 1, 0,
+ offset + count - 1,
+ EXTENT_LOCKED | *write_bits, 1, 0,
&cached_state, GFP_NOFS);
- } else if (ret >= 0 && ret < iov_length(iov, nr_segs)) {
+ } else if (ret >= 0 && ret < count) {
/*
* We're falling back to buffered, unlock the section we didn't
* do IO on.
*/
clear_extent_bit(&BTRFS_I(inode)->io_tree, offset + ret,
- offset + iov_length(iov, nr_segs) - 1,
- EXTENT_LOCKED | write_bits, 1, 0,
+ offset + count - 1,
+ EXTENT_LOCKED | *write_bits, 1, 0,
&cached_state, GFP_NOFS);
}
-out:
free_extent_state(cached_state);
return ret;
}

+static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
+ const struct iovec *iov, loff_t offset,
+ unsigned long nr_segs)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+ int writing = rw & WRITE;
+ int write_bits = 0;
+ size_t count = iov_length(iov, nr_segs);
+
+ if (check_direct_IO(BTRFS_I(inode)->root, rw, iocb, iov,
+ offset, nr_segs)) {
+ return 0;
+ }
+
+ ret = btrfs_pre_direct_IO(writing, offset, count, inode, &write_bits);
+ if (ret)
+ return ret;
+
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
+ iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
+ btrfs_submit_direct, 0);
+
+ ret = btrfs_post_direct_IO(ret, offset, iov_length(iov, nr_segs),
+ inode, &write_bits);
+ return ret;
+}
+
+static ssize_t btrfs_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+ int writing = rw & WRITE;
+ int write_bits = 0;
+ size_t count = bvec_length(bvec, bvec_len);
+
+ ret = btrfs_pre_direct_IO(writing, offset, count, inode, &write_bits);
+ if (ret)
+ return ret;
+
+ ret = __blockdev_direct_IO_bvec(rw, iocb, inode,
+ BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
+ bvec, offset, bvec_len, btrfs_get_blocks_direct, NULL,
+ btrfs_submit_direct, 0);
+
+ ret = btrfs_post_direct_IO(ret, offset, bvec_length(bvec, bvec_len),
+ inode, &write_bits);
+ return ret;
+}
+
static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
__u64 start, __u64 len)
{
@@ -7433,6 +7476,7 @@ static const struct address_space_operations btrfs_aops = {
.writepages = btrfs_writepages,
.readpages = btrfs_readpages,
.direct_IO = btrfs_direct_IO,
+ .direct_IO_bvec = btrfs_direct_IO_bvec,
.invalidatepage = btrfs_invalidatepage,
.releasepage = btrfs_releasepage,
.set_page_dirty = btrfs_set_page_dirty,
--
1.7.9.2

2012-02-27 21:21:47

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 03/22] fuse: convert fuse to use iov_iter_copy_[to|from]_user

A future patch hides the internals of struct iov_iter, so fuse should
be using the supported interface.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: [email protected]
---
fs/fuse/file.c | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4a199fd..877cee0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1582,30 +1582,17 @@ static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
while (iov_iter_count(&ii)) {
struct page *page = pages[page_idx++];
size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
- void *kaddr;
+ size_t left;

- kaddr = kmap(page);
-
- while (todo) {
- char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
- size_t iov_len = ii.iov->iov_len - ii.iov_offset;
- size_t copy = min(todo, iov_len);
- size_t left;
-
- if (!to_user)
- left = copy_from_user(kaddr, uaddr, copy);
- else
- left = copy_to_user(uaddr, kaddr, copy);
-
- if (unlikely(left))
- return -EFAULT;
+ if (!to_user)
+ left = iov_iter_copy_from_user(page, &ii, 0, todo);
+ else
+ left = iov_iter_copy_to_user(page, &ii, 0, todo);

- iov_iter_advance(&ii, copy);
- todo -= copy;
- kaddr += copy;
- }
+ if (unlikely(left))
+ return -EFAULT;

- kunmap(page);
+ iov_iter_advance(&ii, todo);
}

return 0;
--
1.7.9.2

2012-02-27 21:22:30

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 22/22] nfs: add support for read_iter, write_iter

This patch implements the read_iter and write_iter file operations which
allow kernel code to initiate directIO. This allows the loop device to
read and write directly to the server, bypassing the page cache.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: [email protected]
---
fs/nfs/direct.c | 508 +++++++++++++++++++++++++++++++++++++++---------
fs/nfs/file.c | 80 ++++++++
include/linux/nfs_fs.h | 4 +
3 files changed, 497 insertions(+), 95 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1940f1a..fc2c5c3 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -46,6 +46,7 @@
#include <linux/kref.h>
#include <linux/slab.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/bio.h>

#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
@@ -87,6 +88,7 @@ struct nfs_direct_req {
int flags;
#define NFS_ODIRECT_DO_COMMIT (1) /* an unstable reply was received */
#define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */
+#define NFS_ODIRECT_MARK_DIRTY (4) /* mark read pages dirty */
struct nfs_writeverf verf; /* unstable write verifier */
};

@@ -253,9 +255,10 @@ static void nfs_direct_read_release(void *calldata)
} else {
dreq->count += data->res.count;
spin_unlock(&dreq->lock);
- nfs_direct_dirty_pages(data->pagevec,
- data->args.pgbase,
- data->res.count);
+ if (dreq->flags & NFS_ODIRECT_MARK_DIRTY)
+ nfs_direct_dirty_pages(data->pagevec,
+ data->args.pgbase,
+ data->res.count);
}
nfs_direct_release_pages(data->pagevec, data->npages);

@@ -273,21 +276,15 @@ static const struct rpc_call_ops nfs_read_direct_ops = {
};

/*
- * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
- * operation. If nfs_readdata_alloc() or get_user_pages() fails,
- * bail and stop sending more reads. Read length accounting is
- * handled automatically by nfs_direct_read_result(). Otherwise, if
- * no requests have been sent, just return an error.
+ * upon entry, data->pagevec contains pinned pages
*/
-static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
- const struct iovec *iov,
- loff_t pos)
+static ssize_t nfs_direct_read_schedule_helper(struct nfs_direct_req *dreq,
+ struct nfs_read_data *data,
+ size_t addr, size_t count,
+ loff_t pos)
{
struct nfs_open_context *ctx = dreq->ctx;
struct inode *inode = ctx->dentry->d_inode;
- unsigned long user_addr = (unsigned long)iov->iov_base;
- size_t count = iov->iov_len;
- size_t rsize = NFS_SERVER(inode)->rsize;
struct rpc_task *task;
struct rpc_message msg = {
.rpc_cred = ctx->cred,
@@ -299,6 +296,61 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
.workqueue = nfsiod_workqueue,
.flags = RPC_TASK_ASYNC,
};
+ unsigned int pgbase = addr & ~PAGE_MASK;
+
+ get_dreq(dreq);
+
+ data->req = (struct nfs_page *) dreq;
+ data->inode = inode;
+ data->cred = msg.rpc_cred;
+ data->args.fh = NFS_FH(inode);
+ data->args.context = ctx;
+ data->args.lock_context = dreq->l_ctx;
+ data->args.offset = pos;
+ data->args.pgbase = pgbase;
+ data->args.pages = data->pagevec;
+ data->args.count = count;
+ data->res.fattr = &data->fattr;
+ data->res.eof = 0;
+ data->res.count = count;
+ nfs_fattr_init(&data->fattr);
+ msg.rpc_argp = &data->args;
+ msg.rpc_resp = &data->res;
+
+ task_setup_data.task = &data->task;
+ task_setup_data.callback_data = data;
+ NFS_PROTO(inode)->read_setup(data, &msg);
+
+ task = rpc_run_task(&task_setup_data);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+ rpc_put_task(task);
+
+ dprintk("NFS: %5u initiated direct read call "
+ "(req %s/%Ld, %zu bytes @ offset %Lu)\n",
+ data->task.tk_pid, inode->i_sb->s_id,
+ (long long)NFS_FILEID(inode), count,
+ (unsigned long long)data->args.offset);
+
+ return count;
+}
+
+/*
+ * For each rsize'd chunk of the user's buffer, dispatch an NFS READ
+ * operation. If nfs_readdata_alloc() or get_user_pages() fails,
+ * bail and stop sending more reads. Read length accounting is
+ * handled automatically by nfs_direct_read_result(). Otherwise, if
+ * no requests have been sent, just return an error.
+ */
+static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
+ const struct iovec *iov,
+ loff_t pos)
+{
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->dentry->d_inode;
+ unsigned long user_addr = (unsigned long)iov->iov_base;
+ size_t count = iov->iov_len;
+ size_t rsize = NFS_SERVER(inode)->rsize;
unsigned int pgbase;
int result;
ssize_t started = 0;
@@ -334,41 +386,10 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
data->npages = result;
}

- get_dreq(dreq);
-
- data->req = (struct nfs_page *) dreq;
- data->inode = inode;
- data->cred = msg.rpc_cred;
- data->args.fh = NFS_FH(inode);
- data->args.context = ctx;
- data->args.lock_context = dreq->l_ctx;
- data->args.offset = pos;
- data->args.pgbase = pgbase;
- data->args.pages = data->pagevec;
- data->args.count = bytes;
- data->res.fattr = &data->fattr;
- data->res.eof = 0;
- data->res.count = bytes;
- nfs_fattr_init(&data->fattr);
- msg.rpc_argp = &data->args;
- msg.rpc_resp = &data->res;
-
- task_setup_data.task = &data->task;
- task_setup_data.callback_data = data;
- NFS_PROTO(inode)->read_setup(data, &msg);
-
- task = rpc_run_task(&task_setup_data);
- if (IS_ERR(task))
+ bytes = nfs_direct_read_schedule_helper(dreq, data, user_addr,
+ bytes, pos);
+ if (bytes < 0)
break;
- rpc_put_task(task);
-
- dprintk("NFS: %5u initiated direct read call "
- "(req %s/%Ld, %zu bytes @ offset %Lu)\n",
- data->task.tk_pid,
- inode->i_sb->s_id,
- (long long)NFS_FILEID(inode),
- bytes,
- (unsigned long long)data->args.offset);

started += bytes;
user_addr += bytes;
@@ -440,6 +461,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
goto out_release;
if (!is_sync_kiocb(iocb))
dreq->iocb = iocb;
+ dreq->flags = NFS_ODIRECT_MARK_DIRTY;

result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos);
if (!result)
@@ -450,6 +472,90 @@ out:
return result;
}

+static ssize_t nfs_direct_read_schedule_bvec(struct nfs_direct_req *dreq,
+ struct bio_vec *bvec,
+ unsigned long nr_segs,
+ loff_t pos)
+{
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->dentry->d_inode;
+ size_t rsize = NFS_SERVER(inode)->rsize;
+ struct nfs_read_data *data;
+ ssize_t result = 0;
+ size_t requested_bytes = 0;
+ int seg;
+ size_t addr;
+ size_t count;
+
+ get_dreq(dreq);
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ data = nfs_readdata_alloc(1);
+ if (unlikely(!data)) {
+ result = -ENOMEM;
+ break;
+ }
+ page_cache_get(bvec[seg].bv_page);
+ data->pagevec[0] = bvec[seg].bv_page;
+ addr = bvec[seg].bv_offset;
+ count = bvec[seg].bv_len;
+ do {
+ size_t bytes = min(rsize, count);
+ result = nfs_direct_read_schedule_helper(dreq, data,
+ addr, bytes,
+ pos);
+ if (result < 0)
+ goto out;
+
+ requested_bytes += bytes;
+ addr += bytes;
+ pos += bytes;
+ count -= bytes;
+ } while (count);
+ }
+out:
+ /*
+ * If no bytes were started, return the error, and let the
+ * generic layer handle the completion.
+ */
+ if (requested_bytes == 0) {
+ nfs_direct_req_release(dreq);
+ return result < 0 ? result : -EIO;
+ }
+
+ if (put_dreq(dreq))
+ nfs_direct_complete(dreq);
+ return 0;
+}
+
+static ssize_t nfs_direct_read_bvec(struct kiocb *iocb, struct bio_vec *bvec,
+ unsigned long nr_segs, loff_t pos)
+{
+ ssize_t result = -ENOMEM;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ struct nfs_direct_req *dreq;
+
+ dreq = nfs_direct_req_alloc();
+ if (dreq == NULL)
+ goto out;
+
+ dreq->inode = inode;
+ dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+ dreq->l_ctx = nfs_get_lock_context(dreq->ctx);
+ if (dreq->l_ctx == NULL)
+ goto out_release;
+ if (!is_sync_kiocb(iocb))
+ dreq->iocb = iocb;
+
+ result = nfs_direct_read_schedule_bvec(dreq, bvec, nr_segs, pos);
+ if (!result)
+ result = nfs_direct_wait(dreq);
+out_release:
+ nfs_direct_req_release(dreq);
+out:
+ return result;
+}
+
static void nfs_direct_free_writedata(struct nfs_direct_req *dreq)
{
while (!list_empty(&dreq->rewrite_list)) {
@@ -704,20 +810,15 @@ static const struct rpc_call_ops nfs_write_direct_ops = {
};

/*
- * For each wsize'd chunk of the user's buffer, dispatch an NFS WRITE
- * operation. If nfs_writedata_alloc() or get_user_pages() fails,
- * bail and stop sending more writes. Write length accounting is
- * handled automatically by nfs_direct_write_result(). Otherwise, if
- * no requests have been sent, just return an error.
+ * upon entry, data->pagevec contains pinned pages
*/
-static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
- const struct iovec *iov,
- loff_t pos, int sync)
+static ssize_t nfs_direct_write_schedule_helper(struct nfs_direct_req *dreq,
+ struct nfs_write_data *data,
+ size_t addr, size_t count,
+ loff_t pos, int sync)
{
struct nfs_open_context *ctx = dreq->ctx;
struct inode *inode = ctx->dentry->d_inode;
- unsigned long user_addr = (unsigned long)iov->iov_base;
- size_t count = iov->iov_len;
struct rpc_task *task;
struct rpc_message msg = {
.rpc_cred = ctx->cred,
@@ -729,6 +830,63 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
.workqueue = nfsiod_workqueue,
.flags = RPC_TASK_ASYNC,
};
+ unsigned int pgbase = addr & ~PAGE_MASK;
+
+ get_dreq(dreq);
+
+ list_move_tail(&data->pages, &dreq->rewrite_list);
+
+ data->req = (struct nfs_page *) dreq;
+ data->inode = inode;
+ data->cred = msg.rpc_cred;
+ data->args.fh = NFS_FH(inode);
+ data->args.context = ctx;
+ data->args.lock_context = dreq->l_ctx;
+ data->args.offset = pos;
+ data->args.pgbase = pgbase;
+ data->args.pages = data->pagevec;
+ data->args.count = count;
+ data->args.stable = sync;
+ data->res.fattr = &data->fattr;
+ data->res.count = count;
+ data->res.verf = &data->verf;
+ nfs_fattr_init(&data->fattr);
+
+ task_setup_data.task = &data->task;
+ task_setup_data.callback_data = data;
+ msg.rpc_argp = &data->args;
+ msg.rpc_resp = &data->res;
+ NFS_PROTO(inode)->write_setup(data, &msg);
+
+ task = rpc_run_task(&task_setup_data);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+ rpc_put_task(task);
+
+ dprintk("NFS: %5u initiated direct write call "
+ "(req %s/%Ld, %zu bytes @ offset %Lu)\n",
+ data->task.tk_pid, inode->i_sb->s_id,
+ (long long)NFS_FILEID(inode), count,
+ (unsigned long long)data->args.offset);
+
+ return count;
+}
+
+/*
+ * For each wsize'd chunk of the user's buffer, dispatch an NFS WRITE
+ * operation. If nfs_writedata_alloc() or get_user_pages() fails,
+ * bail and stop sending more writes. Write length accounting is
+ * handled automatically by nfs_direct_write_result(). Otherwise, if
+ * no requests have been sent, just return an error.
+ */
+static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
+ const struct iovec *iov,
+ loff_t pos, int sync)
+{
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->dentry->d_inode;
+ unsigned long user_addr = (unsigned long)iov->iov_base;
+ size_t count = iov->iov_len;
size_t wsize = NFS_SERVER(inode)->wsize;
unsigned int pgbase;
int result;
@@ -765,44 +923,10 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
data->npages = result;
}

- get_dreq(dreq);
-
- list_move_tail(&data->pages, &dreq->rewrite_list);
-
- data->req = (struct nfs_page *) dreq;
- data->inode = inode;
- data->cred = msg.rpc_cred;
- data->args.fh = NFS_FH(inode);
- data->args.context = ctx;
- data->args.lock_context = dreq->l_ctx;
- data->args.offset = pos;
- data->args.pgbase = pgbase;
- data->args.pages = data->pagevec;
- data->args.count = bytes;
- data->args.stable = sync;
- data->res.fattr = &data->fattr;
- data->res.count = bytes;
- data->res.verf = &data->verf;
- nfs_fattr_init(&data->fattr);
-
- task_setup_data.task = &data->task;
- task_setup_data.callback_data = data;
- msg.rpc_argp = &data->args;
- msg.rpc_resp = &data->res;
- NFS_PROTO(inode)->write_setup(data, &msg);
-
- task = rpc_run_task(&task_setup_data);
- if (IS_ERR(task))
+ result = nfs_direct_write_schedule_helper(dreq, data, user_addr,
+ bytes, pos, sync);
+ if (result < 0)
break;
- rpc_put_task(task);
-
- dprintk("NFS: %5u initiated direct write call "
- "(req %s/%Ld, %zu bytes @ offset %Lu)\n",
- data->task.tk_pid,
- inode->i_sb->s_id,
- (long long)NFS_FILEID(inode),
- bytes,
- (unsigned long long)data->args.offset);

started += bytes;
user_addr += bytes;
@@ -858,6 +982,98 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
return 0;
}

+static ssize_t nfs_direct_write_schedule_bvec(struct nfs_direct_req *dreq,
+ struct bio_vec *bvec,
+ size_t nr_segs, loff_t pos,
+ int sync)
+{
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->dentry->d_inode;
+ size_t wsize = NFS_SERVER(inode)->wsize;
+ struct nfs_write_data *data;
+ ssize_t result = 0;
+ size_t requested_bytes = 0;
+ unsigned long seg;
+ size_t addr;
+ size_t count;
+
+ get_dreq(dreq);
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ data = nfs_writedata_alloc(1);
+ if (unlikely(!data)) {
+ result = -ENOMEM;
+ break;
+ }
+
+ page_cache_get(bvec[seg].bv_page);
+ data->pagevec[0] = bvec[seg].bv_page;
+ addr = bvec[seg].bv_offset;
+ count = bvec[seg].bv_len;
+ do {
+ size_t bytes = min(wsize, count);
+ result = nfs_direct_write_schedule_helper(dreq, data,
+ addr, bytes,
+ pos, sync);
+ if (result < 0)
+ goto out;
+
+ requested_bytes += bytes;
+ addr += bytes;
+ pos += bytes;
+ count -= bytes;
+ } while (count);
+ }
+out:
+ /*
+ * If no bytes were started, return the error, and let the
+ * generic layer handle the completion.
+ */
+ if (requested_bytes == 0) {
+ nfs_direct_req_release(dreq);
+ return result < 0 ? result : -EIO;
+ }
+
+ if (put_dreq(dreq))
+ nfs_direct_write_complete(dreq, dreq->inode);
+ return 0;
+}
+
+static ssize_t nfs_direct_write_bvec(struct kiocb *iocb, struct bio_vec *bvec,
+ unsigned long nr_segs, loff_t pos,
+ size_t count)
+{
+ ssize_t result = -ENOMEM;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ struct nfs_direct_req *dreq;
+ size_t wsize = NFS_SERVER(inode)->wsize;
+ int sync = NFS_UNSTABLE;
+
+ dreq = nfs_direct_req_alloc();
+ if (!dreq)
+ goto out;
+ nfs_alloc_commit_data(dreq);
+
+ if (dreq->commit_data == NULL || count <= wsize)
+ sync = NFS_FILE_SYNC;
+
+ dreq->inode = inode;
+ dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+ dreq->l_ctx = nfs_get_lock_context(dreq->ctx);
+ if (dreq->l_ctx == NULL)
+ goto out_release;
+ if (!is_sync_kiocb(iocb))
+ dreq->iocb = iocb;
+
+ result = nfs_direct_write_schedule_bvec(dreq, bvec, nr_segs, pos, sync);
+ if (!result)
+ result = nfs_direct_wait(dreq);
+out_release:
+ nfs_direct_req_release(dreq);
+out:
+ return result;
+}
+
static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos,
size_t count)
@@ -948,6 +1164,53 @@ out:
return retval;
}

+ssize_t nfs_file_direct_read_bvec(struct kiocb *iocb, struct bio_vec *bvec,
+ unsigned long nr_segs, loff_t pos)
+{
+ ssize_t retval = -EINVAL;
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ size_t count;
+
+ count = bvec_length(bvec, nr_segs);
+ nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
+
+ dfprintk(FILE, "NFS: direct read bvec(%s/%s, %zd@%Ld)\n",
+ file->f_path.dentry->d_parent->d_name.name,
+ file->f_path.dentry->d_name.name,
+ count, (long long) pos);
+
+ retval = 0;
+ if (!count)
+ goto out;
+
+ retval = nfs_sync_mapping(mapping);
+ if (retval)
+ goto out;
+
+ task_io_account_read(count);
+
+ retval = nfs_direct_read_bvec(iocb, bvec, nr_segs, pos);
+ if (retval > 0)
+ iocb->ki_pos = pos + retval;
+
+out:
+ return retval;
+}
+
+ssize_t nfs_file_direct_read_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ if (iov_iter_has_iovec(iter))
+ return nfs_file_direct_read(iocb, iov_iter_iovec(iter),
+ iter->nr_segs, pos);
+ else if (iov_iter_has_bvec(iter))
+ return nfs_file_direct_read_bvec(iocb, iov_iter_bvec(iter),
+ iter->nr_segs, pos);
+ else
+ BUG();
+}
+
/**
* nfs_file_direct_write - file direct write operation for NFS files
* @iocb: target I/O control block
@@ -1012,6 +1275,61 @@ out:
return retval;
}

+ssize_t nfs_file_direct_write_bvec(struct kiocb *iocb, struct bio_vec *bvec,
+ unsigned long nr_segs, loff_t pos)
+{
+ ssize_t retval = -EINVAL;
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ size_t count;
+
+ count = bvec_length(bvec, nr_segs);
+ nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
+
+ dfprintk(FILE, "NFS: direct write(%s/%s, %zd@%Ld)\n",
+ file->f_path.dentry->d_parent->d_name.name,
+ file->f_path.dentry->d_name.name,
+ count, (long long) pos);
+
+ retval = generic_write_checks(file, &pos, &count, 0);
+ if (retval)
+ goto out;
+
+ retval = -EINVAL;
+ if ((ssize_t) count < 0)
+ goto out;
+ retval = 0;
+ if (!count)
+ goto out;
+
+ retval = nfs_sync_mapping(mapping);
+ if (retval)
+ goto out;
+
+ task_io_account_write(count);
+
+ retval = nfs_direct_write_bvec(iocb, bvec, nr_segs, pos, count);
+
+ if (retval > 0)
+ iocb->ki_pos = pos + retval;
+
+out:
+ return retval;
+}
+
+ssize_t nfs_file_direct_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ if (iov_iter_has_iovec(iter))
+ return nfs_file_direct_write(iocb, iov_iter_iovec(iter),
+ iter->nr_segs, pos);
+ else if (iov_iter_has_bvec(iter))
+ return nfs_file_direct_write_bvec(iocb, iov_iter_bvec(iter),
+ iter->nr_segs, pos);
+ else
+ BUG();
+}
+
/**
* nfs_init_directcache - create a slab cache for nfs_direct_req structures
*
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index c43a452..6fdb674 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -646,6 +646,82 @@ static ssize_t nfs_file_splice_write(struct pipe_inode_info *pipe,
return ret;
}

+ssize_t nfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ struct dentry *dentry = iocb->ki_filp->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+ ssize_t result;
+ size_t count = iov_iter_count(iter);
+
+ if (iocb->ki_filp->f_flags & O_DIRECT)
+ return nfs_file_direct_read_iter(iocb, iter, pos);
+
+ dprintk("NFS: read_iter(%s/%s, %lu@%lu)\n",
+ dentry->d_parent->d_name.name, dentry->d_name.name,
+ (unsigned long) count, (unsigned long) pos);
+
+ result = nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping);
+ if (!result) {
+ result = generic_file_read_iter(iocb, iter, pos);
+ if (result > 0)
+ nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
+ }
+ return result;
+}
+
+ssize_t nfs_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ struct dentry *dentry = iocb->ki_filp->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+ unsigned long written = 0;
+ ssize_t result;
+ size_t count = iov_iter_count(iter);
+
+ if (iocb->ki_filp->f_flags & O_DIRECT)
+ return nfs_file_direct_write_iter(iocb, iter, pos);
+
+ dprintk("NFS: write_iter(%s/%s, %lu@%Ld)\n",
+ dentry->d_parent->d_name.name, dentry->d_name.name,
+ (unsigned long) count, (long long) pos);
+
+ result = -EBUSY;
+ if (IS_SWAPFILE(inode))
+ goto out_swapfile;
+ /*
+ * O_APPEND implies that we must revalidate the file length.
+ */
+ if (iocb->ki_filp->f_flags & O_APPEND) {
+ result = nfs_revalidate_file_size(inode, iocb->ki_filp);
+ if (result)
+ goto out;
+ }
+
+ result = count;
+ if (!count)
+ goto out;
+
+ result = generic_file_write_iter(iocb, iter, pos);
+ if (result > 0)
+ written = result;
+
+ /* Return error values for O_DSYNC and IS_SYNC() */
+ if (result >= 0 && nfs_need_sync_write(iocb->ki_filp, inode)) {
+ int err = vfs_fsync(iocb->ki_filp, 0);
+ if (err < 0)
+ result = err;
+ }
+ if (result > 0)
+ nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
+out:
+ return result;
+
+out_swapfile:
+ printk(KERN_INFO "NFS: attempt to write to active swap file!\n");
+ goto out;
+}
+
static int
do_getlk(struct file *filp, int cmd, struct file_lock *fl, int is_local)
{
@@ -853,6 +929,8 @@ const struct file_operations nfs_file_operations = {
.write = do_sync_write,
.aio_read = nfs_file_read,
.aio_write = nfs_file_write,
+ .read_iter = nfs_file_read_iter,
+ .write_iter = nfs_file_write_iter,
.mmap = nfs_file_mmap,
.open = nfs_file_open,
.flush = nfs_file_flush,
@@ -884,6 +962,8 @@ const struct file_operations nfs4_file_operations = {
.write = do_sync_write,
.aio_read = nfs_file_read,
.aio_write = nfs_file_write,
+ .read_iter = nfs_file_read_iter,
+ .write_iter = nfs_file_write_iter,
.mmap = nfs_file_mmap,
.open = nfs4_file_open,
.flush = nfs_file_flush,
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 8c29950..6bda672 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -459,6 +459,10 @@ extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
const struct iovec *iov, unsigned long nr_segs,
loff_t pos);
+extern ssize_t nfs_file_direct_read_iter(struct kiocb *iocb,
+ struct iov_iter *iter, loff_t pos);
+extern ssize_t nfs_file_direct_write_iter(struct kiocb *iocb,
+ struct iov_iter *iter, loff_t pos);

/*
* linux/fs/nfs/dir.c
--
1.7.9.2

2012-02-27 21:22:44

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 17/22] bio: add bvec_length(), like iov_length()

From: Zach Brown <[email protected]>

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/bio.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 129a9c0..913087d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -268,6 +268,14 @@ extern struct bio_vec *bvec_alloc_bs(gfp_t, int, unsigned long *, struct bio_set
extern void bvec_free_bs(struct bio_set *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);

+static inline ssize_t bvec_length(const struct bio_vec *bvec, unsigned long nr)
+{
+ ssize_t bytes = 0;
+ while (nr--)
+ bytes += (bvec++)->bv_len;
+ return bytes;
+}
+
/*
* bio_set is used to allow other portions of the IO system to
* allocate their own private memory pools for bio and iovec structures.
--
1.7.9.2

2012-02-27 21:20:28

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 11/22] dio: add dio_lock_and_flush() helper

From: Zach Brown <[email protected]>

This creates a helper function which performs locking based on
DIO_LOCKING and flushes dirty pages. This will be called by another
entry point like __blockdev_direct_IO() in an upcoming patch.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 56 +++++++++++++++++++++++++++++++++-----------------------
1 file changed, 33 insertions(+), 23 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 1efe4f1..e75b8d7 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1158,6 +1158,35 @@ static void sdio_init(struct dio_submit *sdio, struct inode *inode,
sdio->pages_in_io = 2;
}

+static int dio_lock_and_flush(struct dio *dio, loff_t offset, loff_t end)
+{
+ struct inode *inode = dio->inode;
+ int ret;
+
+ if (dio->flags & DIO_LOCKING) {
+ /* watch out for a 0 len io from a tricksy fs */
+ if (dio->rw == READ && end > offset) {
+
+ /* will be released by do_blockdev_direct_IO */
+ mutex_lock(&inode->i_mutex);
+
+ ret = filemap_write_and_wait_range(inode->i_mapping,
+ offset, end - 1);
+ if (ret) {
+ mutex_unlock(&inode->i_mutex);
+ return ret;
+ }
+ }
+ }
+
+ /*
+ * Will be decremented at I/O completion time.
+ */
+ atomic_inc(&inode->i_dio_count);
+
+ return 0;
+}
+
/*
* This is a library function for use by filesystem drivers.
*
@@ -1225,31 +1254,12 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
if (!dio)
goto out;

- if (dio->flags & DIO_LOCKING) {
- if (rw == READ) {
- struct address_space *mapping =
- iocb->ki_filp->f_mapping;
-
- /* will be released by direct_io_worker */
- mutex_lock(&inode->i_mutex);
-
- retval = filemap_write_and_wait_range(mapping, offset,
- end - 1);
- if (retval) {
- mutex_unlock(&inode->i_mutex);
- kmem_cache_free(dio_cache, dio);
- goto out;
- }
- }
+ retval = dio_lock_and_flush(dio, offset, end);
+ if (retval) {
+ kmem_cache_free(dio_cache, dio);
+ goto out;
}

- /*
- * Will be decremented at I/O completion time.
- */
- atomic_inc(&inode->i_dio_count);
-
- retval = 0;
-
sdio_init(&sdio, inode, offset, blkbits, get_block, submit_io);

for (seg = 0; seg < nr_segs; seg++) {
--
1.7.9.2

2012-02-27 21:23:23

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 13/22] dio: add __blockdev_direct_IO_bdev()

From: Zach Brown <[email protected]>

Previous patches refactored __blockdev_direct_IO() to call helper
functions while iterating over the user's iovec. This adds a
__blockdev_direct_IO() which is the same except that it iterates over
the pages in a bio_vec instead of user addresses in an iovec.

The trick here is to initialize the dio state so that do_direct_IO()
consumes the pages we provide and never tries to map user pages. This
is done by making sure that final_block_in_request covers the page that
we set in the dio. do_direct_IO() will return before running out of
pages.

The caller is responsible for dirtying these pages, if needed. We add
an option to the dio struct that makes sure we only dirty pages when
we're operating on iovecs of user addresses.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/fs.h | 26 ++++++++++++++++
2 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 20bb84c..2fef85f 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -126,6 +126,7 @@ struct dio {
spinlock_t bio_lock; /* protects BIO fields below */
int page_errors; /* errno from get_user_pages() */
int is_async; /* is IO async ? */
+ int should_dirty; /* should we mark read pages dirty? */
int io_error; /* IO error in completion path */
unsigned long refcount; /* direct_io_worker() and bios */
struct bio *bio_list; /* singly linked via bi_private */
@@ -420,7 +421,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
dio->refcount++;
spin_unlock_irqrestore(&dio->bio_lock, flags);

- if (dio->is_async && dio->rw == READ)
+ if (dio->is_async && dio->rw == READ && dio->should_dirty)
bio_set_pages_dirty(bio);

if (sdio->submit_io)
@@ -491,13 +492,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
if (!uptodate)
dio->io_error = -EIO;

- if (dio->is_async && dio->rw == READ) {
+ if (dio->is_async && dio->rw == READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
for (page_no = 0; page_no < bio->bi_vcnt; page_no++) {
struct page *page = bvec[page_no].bv_page;

- if (dio->rw == READ && !PageCompound(page))
+ if (dio->rw == READ && !PageCompound(page) &&
+ dio->should_dirty)
set_page_dirty_lock(page);
page_cache_release(page);
}
@@ -1336,6 +1338,8 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
PAGE_SIZE - user_addr / PAGE_SIZE);
}

+ dio->should_dirty = 1;
+
for (seg = 0; seg < nr_segs; seg++) {
user_addr = (unsigned long)iov[seg].iov_base;
sdio.size += bytes = iov[seg].iov_len;
@@ -1400,6 +1404,84 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,

EXPORT_SYMBOL(__blockdev_direct_IO);

+ssize_t
+__blockdev_direct_IO_bvec(int rw, struct kiocb *iocb, struct inode *inode,
+ struct block_device *bdev, struct bio_vec *bvec, loff_t offset,
+ unsigned long bvec_len, get_block_t get_block,
+ dio_iodone_t end_io, dio_submit_t submit_io, int flags)
+{
+ unsigned blkbits = inode->i_blkbits;
+ ssize_t retval = -EINVAL;
+ loff_t end = offset;
+ struct dio *dio;
+ struct dio_submit sdio = { 0, };
+ unsigned long i;
+ struct buffer_head map_bh = { 0, };
+
+ if (rw & WRITE)
+ rw = WRITE_ODIRECT;
+
+ if (!dio_aligned(offset, &blkbits, bdev))
+ goto out;
+
+ /* Check the memory alignment. Blocks cannot straddle pages */
+ for (i = 0; i < bvec_len; i++) {
+ end += bvec[i].bv_len;
+ if (!dio_aligned(bvec[i].bv_len | bvec[i].bv_offset,
+ &blkbits, bdev))
+ goto out;
+ }
+
+ dio = dio_alloc_init(flags, rw, iocb, inode, end_io, end);
+ retval = -ENOMEM;
+ if (!dio)
+ goto out;
+
+ retval = dio_lock_and_flush(dio, offset, end);
+ if (retval) {
+ kmem_cache_free(dio_cache, dio);
+ goto out;
+ }
+
+ sdio_init(&sdio, inode, offset, blkbits, get_block, submit_io);
+
+ sdio.pages_in_io = bvec_len;
+
+ for (i = 0; i < bvec_len; i++) {
+ sdio.size += bvec[i].bv_len;
+
+ /* Index into the first page of the first block */
+ sdio.first_block_in_page = bvec[i].bv_offset >> blkbits;
+ sdio.final_block_in_request = sdio.block_in_file +
+ (bvec[i].bv_len >> blkbits);
+ /* Page fetching state */
+ sdio.curr_page = 0;
+ page_cache_get(bvec[i].bv_page);
+ dio->pages[0] = bvec[i].bv_page;
+ sdio.head = 0;
+ sdio.tail = 1;
+
+ sdio.total_pages = 1;
+ sdio.curr_user_address = 0;
+
+ retval = do_direct_IO(dio, &sdio, &map_bh);
+
+ dio->result += bvec[i].bv_len -
+ ((sdio.final_block_in_request - sdio.block_in_file) <<
+ blkbits);
+
+ if (retval) {
+ dio_cleanup(dio, &sdio);
+ break;
+ }
+ }
+
+ retval = dio_post_submission(rw, offset, dio, &sdio, &map_bh, retval);
+out:
+ return retval;
+}
+EXPORT_SYMBOL(__blockdev_direct_IO_bvec);
+
static __init int dio_init(void)
{
dio_cache = KMEM_CACHE(dio, SLAB_PANIC);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4750933..94f2d0a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -692,6 +692,8 @@ struct address_space_operations {
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
+ ssize_t (*direct_IO_bvec)(int, struct kiocb *, struct bio_vec *bvec,
+ loff_t offset, unsigned long bvec_len);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
void **, unsigned long *);
/*
@@ -2530,6 +2532,30 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
offset, nr_segs, get_block, NULL, NULL,
DIO_LOCKING | DIO_SKIP_HOLES);
}
+
+ssize_t __blockdev_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct inode *inode, struct block_device *bdev, struct bio_vec *bvec,
+ loff_t offset, unsigned long bvec_len, get_block_t get_block,
+ dio_iodone_t end_io, dio_submit_t submit_io, int flags);
+
+static inline ssize_t blockdev_direct_IO_bvec(int rw, struct kiocb *iocb,
+ struct inode *inode, struct block_device *bdev, struct bio_vec *bvec,
+ loff_t offset, unsigned long bvec_len, get_block_t get_block,
+ dio_iodone_t end_io)
+{
+ return __blockdev_direct_IO_bvec(rw, iocb, inode, bdev, bvec, offset,
+ bvec_len, get_block, end_io, NULL,
+ DIO_LOCKING | DIO_SKIP_HOLES);
+}
+
+static inline ssize_t blockdev_direct_IO_bvec_no_locking(int rw,
+ struct kiocb *iocb, struct inode *inode, struct block_device *bdev,
+ struct bio_vec *bvec, loff_t offset, unsigned long bvec_len,
+ get_block_t get_block, dio_iodone_t end_io)
+{
+ return __blockdev_direct_IO_bvec(rw, iocb, inode, bdev, bvec, offset,
+ bvec_len, get_block, end_io, NULL, 0);
+}
#else
static inline void inode_dio_wait(struct inode *inode)
{
--
1.7.9.2

2012-02-27 21:23:27

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 06/22] iov_iter: add a shorten call

From: Zach Brown <[email protected]>

The generic direct write path wants to shorten its memory vector. It
does this when it finds that it has to perform a partial write due to
LIMIT_FSIZE. .direct_IO() always performs IO on all of the referenced
memory because it doesn't have an argument to specify the length of the
IO.

We add an iov_iter operation for this so that the generic path can ask
to shorten the memory vector without having to know what kind it is.
We're happy to shorten the kernel copy of the iovec array, but we refuse
to shorten the bio_vec array and return an error in this case.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 5 +++++
mm/iov-iter.c | 14 ++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3689d58..101afc1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -548,6 +548,7 @@ struct iov_iter_ops {
void (*ii_advance)(struct iov_iter *, size_t);
int (*ii_fault_in_readable)(struct iov_iter *, size_t);
size_t (*ii_single_seg_count)(struct iov_iter *);
+ int (*ii_shorten)(struct iov_iter *, size_t);
};

static inline size_t iov_iter_copy_to_user_atomic(struct page *page,
@@ -582,6 +583,10 @@ static inline size_t iov_iter_single_seg_count(struct iov_iter *i)
{
return i->ops->ii_single_seg_count(i);
}
+static inline int iov_iter_shorten(struct iov_iter *i, size_t count)
+{
+ return i->ops->ii_shorten(i, count);
+}

extern struct iov_iter_ops ii_bvec_ops;

diff --git a/mm/iov-iter.c b/mm/iov-iter.c
index 5b35f23..361e00f 100644
--- a/mm/iov-iter.c
+++ b/mm/iov-iter.c
@@ -197,6 +197,11 @@ size_t ii_bvec_single_seg_count(struct iov_iter *i)
return min(i->count, bvec->bv_len - i->iov_offset);
}

+static int ii_bvec_shorten(struct iov_iter *i, size_t count)
+{
+ return -EINVAL;
+}
+
struct iov_iter_ops ii_bvec_ops = {
.ii_copy_to_user_atomic = ii_bvec_copy_to_user_atomic,
.ii_copy_to_user = ii_bvec_copy_to_user,
@@ -205,6 +210,7 @@ struct iov_iter_ops ii_bvec_ops = {
.ii_advance = ii_bvec_advance,
.ii_fault_in_readable = ii_bvec_fault_in_readable,
.ii_single_seg_count = ii_bvec_single_seg_count,
+ .ii_shorten = ii_bvec_shorten,
};
EXPORT_SYMBOL(ii_bvec_ops);

@@ -351,6 +357,13 @@ size_t ii_iovec_single_seg_count(struct iov_iter *i)
return min(i->count, iov->iov_len - i->iov_offset);
}

+static int ii_iovec_shorten(struct iov_iter *i, size_t count)
+{
+ struct iovec *iov = (struct iovec *)i->data;
+ i->nr_segs = iov_shorten(iov, i->nr_segs, count);
+ return 0;
+}
+
struct iov_iter_ops ii_iovec_ops = {
.ii_copy_to_user_atomic = ii_iovec_copy_to_user_atomic,
.ii_copy_to_user = ii_iovec_copy_to_user,
@@ -359,5 +372,6 @@ struct iov_iter_ops ii_iovec_ops = {
.ii_advance = ii_iovec_advance,
.ii_fault_in_readable = ii_iovec_fault_in_readable,
.ii_single_seg_count = ii_iovec_single_seg_count,
+ .ii_shorten = ii_iovec_shorten,
};
EXPORT_SYMBOL(ii_iovec_ops);
--
1.7.9.2

2012-02-27 21:23:29

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 07/22] iov_iter: let callers extract iovecs and bio_vecs

From: Zach Brown <[email protected]>

direct IO treats memory from user iovecs and memory from arrays of
kernel pages very differently. User memory is pinned and worked with in
batches while kernel pages are always pinned and don't require
additional processing.

Rather than try and provide an absctraction that includes these
different behaviours we let direct IO extract the memory structs and
hand them to the existing code.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 101afc1..4750933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -604,6 +604,15 @@ static inline void iov_iter_init_bvec(struct iov_iter *i,

iov_iter_advance(i, written);
}
+static inline int iov_iter_has_bvec(struct iov_iter *i)
+{
+ return i->ops == &ii_bvec_ops;
+}
+static inline struct bio_vec *iov_iter_bvec(struct iov_iter *i)
+{
+ BUG_ON(!iov_iter_has_bvec(i));
+ return (struct bio_vec *)i->data;
+}

extern struct iov_iter_ops ii_iovec_ops;

@@ -619,6 +628,15 @@ static inline void iov_iter_init(struct iov_iter *i,

iov_iter_advance(i, written);
}
+static inline int iov_iter_has_iovec(struct iov_iter *i)
+{
+ return i->ops == &ii_iovec_ops;
+}
+static inline struct iovec *iov_iter_iovec(struct iov_iter *i)
+{
+ BUG_ON(!iov_iter_has_iovec(i));
+ return (struct iovec *)i->data;
+}

static inline size_t iov_iter_count(struct iov_iter *i)
{
--
1.7.9.2

2012-02-27 21:23:24

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 09/22] dio: add dio_alloc_init() helper function

From: Zach Brown <[email protected]>

This adds a helper function which allocates and initializes the dio
structure. We'll be calling this from another entry point like
__blockdev_direct_IO() in an upcoming patch.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 68 ++++++++++++++++++++++++++++++++++----------------------
1 file changed, 42 insertions(+), 26 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 21a1412..1fbb4ab 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1096,6 +1096,47 @@ static int dio_aligned(unsigned long offset, unsigned *blkbits,
return 1;
}

+static struct dio *dio_alloc_init(int flags, int rw, struct kiocb *iocb,
+ struct inode *inode, dio_iodone_t end_io,
+ loff_t end)
+{
+ struct dio *dio;
+
+ dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
+ if (!dio)
+ return NULL;
+
+ /*
+ * Believe it or not, zeroing out the page array caused a .5%
+ * performance regression in a database benchmark. So, we take
+ * care to only zero out what's needed.
+ */
+ memset(dio, 0, offsetof(struct dio, pages));
+
+ dio->flags = flags;
+ /*
+ * For file extending writes updating i_size before data
+ * writeouts complete can expose uninitialized blocks. So
+ * even for AIO, we need to wait for i/o to complete before
+ * returning in this case.
+ */
+ dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
+ (end > i_size_read(inode)));
+
+ dio->inode = inode;
+ dio->rw = rw;
+
+ dio->end_io = end_io;
+
+ dio->iocb = iocb;
+ dio->i_size = i_size_read(inode);
+
+ spin_lock_init(&dio->bio_lock);
+ dio->refcount = 1;
+
+ return dio;
+}
+
/*
* This is a library function for use by filesystem drivers.
*
@@ -1158,18 +1199,11 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
if (rw == READ && end == offset)
return 0;

- dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
+ dio = dio_alloc_init(flags, rw, iocb, inode, end_io, end);
retval = -ENOMEM;
if (!dio)
goto out;
- /*
- * Believe it or not, zeroing out the page array caused a .5%
- * performance regression in a database benchmark. So, we take
- * care to only zero out what's needed.
- */
- memset(dio, 0, offsetof(struct dio, pages));

- dio->flags = flags;
if (dio->flags & DIO_LOCKING) {
if (rw == READ) {
struct address_space *mapping =
@@ -1193,35 +1227,17 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
*/
atomic_inc(&inode->i_dio_count);

- /*
- * For file extending writes updating i_size before data
- * writeouts complete can expose uninitialized blocks. So
- * even for AIO, we need to wait for i/o to complete before
- * returning in this case.
- */
- dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
- (end > i_size_read(inode)));
-
retval = 0;

- dio->inode = inode;
- dio->rw = rw;
sdio.blkbits = blkbits;
sdio.blkfactor = inode->i_blkbits - blkbits;
sdio.block_in_file = offset >> blkbits;

sdio.get_block = get_block;
- dio->end_io = end_io;
sdio.submit_io = submit_io;
sdio.final_block_in_bio = -1;
sdio.next_block_for_io = -1;

- dio->iocb = iocb;
- dio->i_size = i_size_read(inode);
-
- spin_lock_init(&dio->bio_lock);
- dio->refcount = 1;
-
/*
* In case of non-aligned buffers, we may need 2 more
* pages since we need to zero out first and last block.
--
1.7.9.2

2012-02-27 21:23:21

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 08/22] dio: create a dio_aligned() helper function

From: Zach Brown <[email protected]>

__blockdev_direct_IO() had two instances of the same code to determine
if a given offset wasn't aligned first to the inode's blkbits and then
to the underlying device's blkbits. This was confusing enough but
we're about to add code that performs the same check on offsets in bvec
arrays. Rather than add yet more copies of this code let's have
everyone call a helper.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 59 +++++++++++++++++++++++++++++++++++---------------------
1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4a588db..21a1412 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1064,6 +1064,39 @@ static inline int drop_refcount(struct dio *dio)
}

/*
+ * Returns true if the given offset is aligned to either the IO size
+ * specified by the given blkbits or by the logical block size of the
+ * given block device.
+ *
+ * If the given offset isn't aligned to the blkbits arguments as this is
+ * called then blkbits is set to the block size of the specified block
+ * device. The call can then return either true or false.
+ *
+ * This bizarre calling convention matches the code paths that
+ * duplicated the functionality that this helper was built from. We
+ * reproduce the behaviour to avoid introducing subtle bugs.
+ */
+static int dio_aligned(unsigned long offset, unsigned *blkbits,
+ struct block_device *bdev)
+{
+ unsigned mask = (1 << *blkbits) - 1;
+
+ /*
+ * Avoid references to bdev if not absolutely needed to give
+ * the early prefetch in the caller enough time.
+ */
+
+ if (offset & mask) {
+ if (bdev)
+ *blkbits = blksize_bits(bdev_logical_block_size(bdev));
+ mask = (1 << *blkbits) - 1;
+ return !(offset & mask);
+ }
+
+ return 1;
+}
+
+/*
* This is a library function for use by filesystem drivers.
*
* The locking rules are governed by the flags parameter:
@@ -1098,7 +1131,6 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
size_t size;
unsigned long addr;
unsigned blkbits = inode->i_blkbits;
- unsigned blocksize_mask = (1 << blkbits) - 1;
ssize_t retval = -EINVAL;
loff_t end = offset;
struct dio *dio;
@@ -1110,33 +1142,16 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
if (rw & WRITE)
rw = WRITE_ODIRECT;

- /*
- * Avoid references to bdev if not absolutely needed to give
- * the early prefetch in the caller enough time.
- */
-
- if (offset & blocksize_mask) {
- if (bdev)
- blkbits = blksize_bits(bdev_logical_block_size(bdev));
- blocksize_mask = (1 << blkbits) - 1;
- if (offset & blocksize_mask)
- goto out;
- }
+ if (!dio_aligned(offset, &blkbits, bdev))
+ goto out;

/* Check the memory alignment. Blocks cannot straddle pages */
for (seg = 0; seg < nr_segs; seg++) {
addr = (unsigned long)iov[seg].iov_base;
size = iov[seg].iov_len;
end += size;
- if (unlikely((addr & blocksize_mask) ||
- (size & blocksize_mask))) {
- if (bdev)
- blkbits = blksize_bits(
- bdev_logical_block_size(bdev));
- blocksize_mask = (1 << blkbits) - 1;
- if ((addr & blocksize_mask) || (size & blocksize_mask))
- goto out;
- }
+ if (!dio_aligned(addr|size, &blkbits, bdev))
+ goto out;
}

/* watch out for a 0 len io from a tricksy fs */
--
1.7.9.2

2012-02-27 21:23:19

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 10/22] dio: add sdio_init() helper function

From: Zach Brown <[email protected]>

This adds a helper function which initializes the sdio structure.
We'll be calling this from another entry point in an upcoming patch.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 37 ++++++++++++++++++++++---------------
1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 1fbb4ab..1efe4f1 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1137,6 +1137,27 @@ static struct dio *dio_alloc_init(int flags, int rw, struct kiocb *iocb,
return dio;
}

+static void sdio_init(struct dio_submit *sdio, struct inode *inode,
+ loff_t offset, unsigned blkbits, get_block_t get_block,
+ dio_submit_t *submit_io)
+{
+ sdio->blkbits = blkbits;
+ sdio->blkfactor = inode->i_blkbits - blkbits;
+ sdio->block_in_file = offset >> blkbits;
+
+ sdio->get_block = get_block;
+ sdio->submit_io = submit_io;
+ sdio->final_block_in_bio = -1;
+ sdio->next_block_for_io = -1;
+
+ /*
+ * In case of non-aligned buffers, we may need 2 more
+ * pages since we need to zero out first and last block.
+ */
+ if (unlikely(sdio->blkfactor))
+ sdio->pages_in_io = 2;
+}
+
/*
* This is a library function for use by filesystem drivers.
*
@@ -1229,21 +1250,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,

retval = 0;

- sdio.blkbits = blkbits;
- sdio.blkfactor = inode->i_blkbits - blkbits;
- sdio.block_in_file = offset >> blkbits;
-
- sdio.get_block = get_block;
- sdio.submit_io = submit_io;
- sdio.final_block_in_bio = -1;
- sdio.next_block_for_io = -1;
-
- /*
- * In case of non-aligned buffers, we may need 2 more
- * pages since we need to zero out first and last block.
- */
- if (unlikely(sdio.blkfactor))
- sdio.pages_in_io = 2;
+ sdio_init(&sdio, inode, offset, blkbits, get_block, submit_io);

for (seg = 0; seg < nr_segs; seg++) {
user_addr = (unsigned long)iov[seg].iov_base;
--
1.7.9.2

2012-02-27 21:23:17

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 15/22] aio: add aio_kernel_() interface

From: Zach Brown <[email protected]>

This adds an interface that lets kernel callers submit aio iocbs without
going through the user space syscalls. This lets kernel callers avoid
the management limits and overhead of the context. It will also let us
integrate aio operations with other kernel apis that the user space
interface doesn't have access to.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/aio.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/aio.h | 11 ++++++
2 files changed, 103 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 969beb0..2e2d358 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -997,6 +997,10 @@ int aio_complete(struct kiocb *iocb, long res, long res2)
iocb->ki_users = 0;
wake_up_process(iocb->ki_obj.tsk);
return 1;
+ } else if (is_kernel_kiocb(iocb)) {
+ iocb->ki_obj.complete(iocb->ki_user_data, res);
+ aio_kernel_free(iocb);
+ return 0;
}

info = &ctx->ring_info;
@@ -1594,6 +1598,94 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
return 0;
}

+ /*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.
+ *
+ * The caller is expected to call the appropriate aio_kernel_init_() functions
+ * and then call aio_kernel_submit(). From that point forward progress is
+ * guaranteed by the file system aio method. Eventually the caller's
+ * completion callback will be called.
+ *
+ * These iocbs are special. They don't have a context, we don't limit the
+ * number pending, they can't be canceled, and can't be retried. In the short
+ * term callers need to be careful not to call operations which might retry by
+ * only calling new ops which never add retry support. In the long term
+ * retry-based AIO should be removed.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp)
+{
+ struct kiocb *iocb = kzalloc(sizeof(struct kiocb), gfp);
+ if (iocb)
+ iocb->ki_key = KIOCB_KERNEL_KEY;
+ return iocb;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+ kfree(iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+ unsigned short op, void *ptr, size_t nr, loff_t off)
+{
+ iocb->ki_filp = filp;
+ iocb->ki_opcode = op;
+ iocb->ki_buf = (char __user *)(unsigned long)ptr;
+ iocb->ki_left = nr;
+ iocb->ki_nbytes = nr;
+ iocb->ki_pos = off;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+void aio_kernel_init_callback(struct kiocb *iocb,
+ void (*complete)(u64 user_data, long res),
+ u64 user_data)
+{
+ iocb->ki_obj.complete = complete;
+ iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_callback);
+
+/*
+ * The iocb is our responsibility once this is called. The caller must not
+ * reference it. This comes from aio_setup_iocb() modifying the iocb.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function. The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb)
+{
+ int ret;
+
+ BUG_ON(!is_kernel_kiocb(iocb));
+ BUG_ON(!iocb->ki_obj.complete);
+ BUG_ON(!iocb->ki_filp);
+
+ ret = aio_setup_iocb(iocb, 0);
+ if (ret) {
+ aio_kernel_free(iocb);
+ return ret;
+ }
+
+ ret = iocb->ki_retry(iocb);
+ BUG_ON(ret == -EIOCBRETRY);
+ if (ret != -EIOCBQUEUED)
+ aio_complete(iocb, ret, 0);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
+
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
struct iocb *iocb, struct kiocb_batch *batch,
bool compat)
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 2314ad8..96e8e69 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -24,6 +24,7 @@ struct kioctx;
#define KIOCB_C_COMPLETE 0x02

#define KIOCB_SYNC_KEY (~0U)
+#define KIOCB_KERNEL_KEY (~1U)

/* ki_flags bits */
/*
@@ -99,6 +100,7 @@ struct kiocb {
union {
void __user *user;
struct task_struct *tsk;
+ void (*complete)(u64 user_data, long res);
} ki_obj;

__u64 ki_user_data; /* user's data for completion */
@@ -127,6 +129,7 @@ struct kiocb {
};

#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
+#define is_kernel_kiocb(iocb) ((iocb)->ki_key == KIOCB_KERNEL_KEY)
#define init_sync_kiocb(x, filp) \
do { \
struct task_struct *tsk = current; \
@@ -215,6 +218,14 @@ struct mm_struct;
extern void exit_aio(struct mm_struct *mm);
extern long do_io_submit(aio_context_t ctx_id, long nr,
struct iocb __user *__user *iocbpp, bool compat);
+struct kiocb *aio_kernel_alloc(gfp_t gfp);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+ unsigned short op, void *ptr, size_t nr, loff_t off);
+void aio_kernel_init_callback(struct kiocb *iocb,
+ void (*complete)(u64 user_data, long res),
+ u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
static inline int aio_put_req(struct kiocb *iocb) { return 0; }
--
1.7.9.2

2012-02-27 21:23:16

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 16/22] aio: add aio support for iov_iter arguments

From: Zach Brown <[email protected]>

This adds iocb cmds which specify that memory is held in iov_iter
structures. This lets kernel callers specify memory that can be
expressed in an iov_iter, which includes pages in bio_vec arrays.

Only kernel callers can provide an iov_iter so it doesn't make a lot of
sense to expose the IOCB_CMD values for this as part of the user space
ABI.

But kernel callers should also be able to perform the usual aio
operations which suggests using the the existing operation namespace and
support code.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/aio.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/aio.h | 3 +++
include/linux/aio_abi.h | 2 ++
3 files changed, 69 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 2e2d358..1f9282a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1502,6 +1502,26 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
return 0;
}

+static ssize_t aio_read_iter(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+
+ if (file->f_op->read_iter)
+ ret = file->f_op->read_iter(iocb, iocb->ki_iter, iocb->ki_pos);
+ return ret;
+}
+
+static ssize_t aio_write_iter(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret = -EINVAL;
+
+ if (file->f_op->write_iter)
+ ret = file->f_op->write_iter(iocb, iocb->ki_iter, iocb->ki_pos);
+ return ret;
+}
+
/*
* aio_setup_iocb:
* Performs the initial checks and aio retry method
@@ -1577,6 +1597,34 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
if (file->f_op->aio_write)
kiocb->ki_retry = aio_rw_vect_retry;
break;
+ case IOCB_CMD_READ_ITER:
+ ret = -EINVAL;
+ if (unlikely(!is_kernel_kiocb(kiocb)))
+ break;
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_READ)))
+ break;
+ ret = security_file_permission(file, MAY_READ);
+ if (unlikely(ret))
+ break;
+ ret = -EINVAL;
+ if (file->f_op->read_iter)
+ kiocb->ki_retry = aio_read_iter;
+ break;
+ case IOCB_CMD_WRITE_ITER:
+ ret = -EINVAL;
+ if (unlikely(!is_kernel_kiocb(kiocb)))
+ break;
+ ret = -EBADF;
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
+ break;
+ ret = security_file_permission(file, MAY_WRITE);
+ if (unlikely(ret))
+ break;
+ ret = -EINVAL;
+ if (file->f_op->write_iter)
+ kiocb->ki_retry = aio_write_iter;
+ break;
case IOCB_CMD_FDSYNC:
ret = -EINVAL;
if (file->f_op->aio_fsync)
@@ -1643,6 +1691,22 @@ void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
}
EXPORT_SYMBOL_GPL(aio_kernel_init_rw);

+/*
+ * The iter count must be set before calling here. Some filesystems uses
+ * iocb->ki_left as an indicator of the size of an IO.
+ */
+void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
+ unsigned short op, struct iov_iter *iter, loff_t off)
+{
+ iocb->ki_filp = filp;
+ iocb->ki_iter = iter;
+ iocb->ki_opcode = op;
+ iocb->ki_pos = off;
+ iocb->ki_nbytes = iov_iter_count(iter);
+ iocb->ki_left = iocb->ki_nbytes;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_iter);
+
void aio_kernel_init_callback(struct kiocb *iocb,
void (*complete)(u64 user_data, long res),
u64 user_data)
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 96e8e69..a32d57f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -126,6 +126,7 @@ struct kiocb {
* this is the underlying eventfd context to deliver events to.
*/
struct eventfd_ctx *ki_eventfd;
+ struct iov_iter *ki_iter;
};

#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
@@ -222,6 +223,8 @@ struct kiocb *aio_kernel_alloc(gfp_t gfp);
void aio_kernel_free(struct kiocb *iocb);
void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
unsigned short op, void *ptr, size_t nr, loff_t off);
+void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
+ unsigned short op, struct iov_iter *iter, loff_t off);
void aio_kernel_init_callback(struct kiocb *iocb,
void (*complete)(u64 user_data, long res),
u64 user_data);
diff --git a/include/linux/aio_abi.h b/include/linux/aio_abi.h
index 2c87316..2c97a2d 100644
--- a/include/linux/aio_abi.h
+++ b/include/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
+ IOCB_CMD_READ_ITER = 9,
+ IOCB_CMD_WRITE_ITER = 10,
};

/*
--
1.7.9.2

2012-02-27 21:26:04

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 04/22] iov_iter: hide iovec details behind ops function pointers

From: Zach Brown <[email protected]>

This moves the current iov_iter functions behind an ops struct of
function pointers. The current iov_iter functions all work with memory
which is specified by iovec arrays of user space pointers.

This patch is part of a series that lets us specify memory with bio_vec
arrays of page pointers. By moving to an iov_iter operation struct we
can add that support in later patches in this series by adding another
set of function pointers.

I only came to this after having initialy tried to teach the current
iov_iter functions about bio_vecs by introducing conditional branches
that dealt with bio_vecs in all the functions. It wasn't pretty. This
approach seems to be the lesser evil.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 65 ++++++++++++++++++++++++++++++++++++++++-----------
mm/iov-iter.c | 66 ++++++++++++++++++++++++++++++----------------------
2 files changed, 90 insertions(+), 41 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c66aa4b..1a64eda 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -529,29 +529,68 @@ struct address_space;
struct writeback_control;

struct iov_iter {
- const struct iovec *iov;
+ struct iov_iter_ops *ops;
+ unsigned long data;
unsigned long nr_segs;
size_t iov_offset;
size_t count;
};

-size_t iov_iter_copy_to_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_to_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_from_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_from_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
-size_t iov_iter_single_seg_count(struct iov_iter *i);
+struct iov_iter_ops {
+ size_t (*ii_copy_to_user_atomic)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ size_t (*ii_copy_to_user)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ size_t (*ii_copy_from_user_atomic)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ size_t (*ii_copy_from_user)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ void (*ii_advance)(struct iov_iter *, size_t);
+ int (*ii_fault_in_readable)(struct iov_iter *, size_t);
+ size_t (*ii_single_seg_count)(struct iov_iter *);
+};
+
+static inline size_t iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_to_user_atomic(page, i, offset, bytes);
+}
+static inline size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_to_user(page, i, offset, bytes);
+}
+static inline size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_from_user_atomic(page, i, offset, bytes);
+}
+static inline size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_from_user(page, i, offset, bytes);
+}
+static inline void iov_iter_advance(struct iov_iter *i, size_t bytes)
+{
+ return i->ops->ii_advance(i, bytes);
+}
+static inline int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+{
+ return i->ops->ii_fault_in_readable(i, bytes);
+}
+static inline size_t iov_iter_single_seg_count(struct iov_iter *i)
+{
+ return i->ops->ii_single_seg_count(i);
+}
+
+extern struct iov_iter_ops ii_iovec_ops;

static inline void iov_iter_init(struct iov_iter *i,
const struct iovec *iov, unsigned long nr_segs,
size_t count, size_t written)
{
- i->iov = iov;
+ i->ops = &ii_iovec_ops;
+ i->data = (unsigned long)iov;
i->nr_segs = nr_segs;
i->iov_offset = 0;
i->count = count + written;
diff --git a/mm/iov-iter.c b/mm/iov-iter.c
index eea21ea..83f0db7 100644
--- a/mm/iov-iter.c
+++ b/mm/iov-iter.c
@@ -33,9 +33,10 @@ static size_t __iovec_copy_to_user_inatomic(char *vaddr,
* were sucessfully copied. If a fault is encountered then return the number of
* bytes which were copied.
*/
-size_t iov_iter_copy_to_user_atomic(struct page *page,
+size_t ii_iovec_copy_to_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

@@ -43,45 +44,44 @@ size_t iov_iter_copy_to_user_atomic(struct page *page,
kaddr = kmap_atomic(page, KM_USER0);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = __copy_to_user_inatomic(buf, kaddr + offset, bytes);
copied = bytes - left;
} else {
copied = __iovec_copy_to_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
+ iov, i->iov_offset, bytes);
}
kunmap_atomic(kaddr, KM_USER0);

return copied;
}
-EXPORT_SYMBOL(iov_iter_copy_to_user_atomic);

/*
* This has the same sideeffects and return value as
- * iov_iter_copy_to_user_atomic().
+ * ii_iovec_copy_to_user_atomic().
* The difference is that it attempts to resolve faults.
* Page must not be locked.
*/
-size_t iov_iter_copy_to_user(struct page *page,
+size_t ii_iovec_copy_to_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

kaddr = kmap(page);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = copy_to_user(buf, kaddr + offset, bytes);
copied = bytes - left;
} else {
copied = __iovec_copy_to_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
+ iov, i->iov_offset, bytes);
}
kunmap(page);
return copied;
}
-EXPORT_SYMBOL(iov_iter_copy_to_user);


static size_t __iovec_copy_from_user_inatomic(char *vaddr,
@@ -111,9 +111,10 @@ static size_t __iovec_copy_from_user_inatomic(char *vaddr,
* were successfully copied. If a fault is encountered then return the number
* of bytes which were copied.
*/
-size_t iov_iter_copy_from_user_atomic(struct page *page,
+size_t ii_iovec_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

@@ -121,12 +122,12 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
kaddr = kmap_atomic(page, KM_USER0);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
+ iov, i->iov_offset, bytes);
}
kunmap_atomic(kaddr, KM_USER0);

@@ -136,32 +137,32 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);

/*
* This has the same sideeffects and return value as
- * iov_iter_copy_from_user_atomic().
+ * ii_iovec_copy_from_user_atomic().
* The difference is that it attempts to resolve faults.
* Page must not be locked.
*/
-size_t iov_iter_copy_from_user(struct page *page,
+size_t ii_iovec_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

kaddr = kmap(page);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = __copy_from_user(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
+ iov, i->iov_offset, bytes);
}
kunmap(page);
return copied;
}
-EXPORT_SYMBOL(iov_iter_copy_from_user);

-void iov_iter_advance(struct iov_iter *i, size_t bytes)
+void ii_iovec_advance(struct iov_iter *i, size_t bytes)
{
BUG_ON(i->count < bytes);

@@ -169,7 +170,7 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
i->iov_offset += bytes;
i->count -= bytes;
} else {
- const struct iovec *iov = i->iov;
+ struct iovec *iov = (struct iovec *)i->data;
size_t base = i->iov_offset;
unsigned long nr_segs = i->nr_segs;

@@ -191,12 +192,11 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
base = 0;
}
}
- i->iov = iov;
+ i->data = (unsigned long)iov;
i->iov_offset = base;
i->nr_segs = nr_segs;
}
}
-EXPORT_SYMBOL(iov_iter_advance);

/*
* Fault in the first iovec of the given iov_iter, to a maximum length
@@ -207,23 +207,33 @@ EXPORT_SYMBOL(iov_iter_advance);
* would be possible (callers must not rely on the fact that _only_ the
* first iovec will be faulted with the current implementation).
*/
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+int ii_iovec_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
+ struct iovec *iov = (struct iovec *)i->data;
+ char __user *buf = iov->iov_base + i->iov_offset;
+ bytes = min(bytes, iov->iov_len - i->iov_offset);
return fault_in_pages_readable(buf, bytes);
}
-EXPORT_SYMBOL(iov_iter_fault_in_readable);

/*
* Return the count of just the current iov_iter segment.
*/
-size_t iov_iter_single_seg_count(struct iov_iter *i)
+size_t ii_iovec_single_seg_count(struct iov_iter *i)
{
- const struct iovec *iov = i->iov;
+ struct iovec *iov = (struct iovec *)i->data;
if (i->nr_segs == 1)
return i->count;
else
return min(i->count, iov->iov_len - i->iov_offset);
}
-EXPORT_SYMBOL(iov_iter_single_seg_count);
+
+struct iov_iter_ops ii_iovec_ops = {
+ .ii_copy_to_user_atomic = ii_iovec_copy_to_user_atomic,
+ .ii_copy_to_user = ii_iovec_copy_to_user,
+ .ii_copy_from_user_atomic = ii_iovec_copy_from_user_atomic,
+ .ii_copy_from_user = ii_iovec_copy_from_user,
+ .ii_advance = ii_iovec_advance,
+ .ii_fault_in_readable = ii_iovec_fault_in_readable,
+ .ii_single_seg_count = ii_iovec_single_seg_count,
+};
+EXPORT_SYMBOL(ii_iovec_ops);
--
1.7.9.2

2012-02-27 21:26:24

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 01/22] iov_iter: move into its own file

From: Zach Brown <[email protected]>

This moves the iov_iter functions in to their own file. We're going to
be working on them in upcoming patches. They become sufficiently large,
and remain self-contained, to justify seperating them from the rest of
the huge mm/filemap.c.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
mm/Makefile | 2 +-
mm/filemap.c | 144 ------------------------------------------------------
mm/iov-iter.c | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 152 insertions(+), 145 deletions(-)
create mode 100644 mm/iov-iter.c

diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..652f053 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
page_isolation.o mm_init.o mmu_context.o percpu.o \
- $(mmu-y)
+ iov-iter.o $(mmu-y)
obj-y += init-mm.o

ifdef CONFIG_NO_BOOTMEM
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..0533a71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2011,150 +2011,6 @@ int file_remove_suid(struct file *file)
}
EXPORT_SYMBOL(file_remove_suid);

-static size_t __iovec_copy_from_user_inatomic(char *vaddr,
- const struct iovec *iov, size_t base, size_t bytes)
-{
- size_t copied = 0, left = 0;
-
- while (bytes) {
- char __user *buf = iov->iov_base + base;
- int copy = min(bytes, iov->iov_len - base);
-
- base = 0;
- left = __copy_from_user_inatomic(vaddr, buf, copy);
- copied += copy;
- bytes -= copy;
- vaddr += copy;
- iov++;
-
- if (unlikely(left))
- break;
- }
- return copied - left;
-}
-
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were successfully copied. If a fault is encountered then return the number of
- * bytes which were copied.
- */
-size_t iov_iter_copy_from_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- BUG_ON(!in_atomic());
- kaddr = kmap_atomic(page, KM_USER0);
- if (likely(i->nr_segs == 1)) {
- int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
- left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
- copied = bytes - left;
- } else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
- }
- kunmap_atomic(kaddr, KM_USER0);
-
- return copied;
-}
-EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
-
-/*
- * This has the same sideeffects and return value as
- * iov_iter_copy_from_user_atomic().
- * The difference is that it attempts to resolve faults.
- * Page must not be locked.
- */
-size_t iov_iter_copy_from_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- kaddr = kmap(page);
- if (likely(i->nr_segs == 1)) {
- int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
- left = __copy_from_user(kaddr + offset, buf, bytes);
- copied = bytes - left;
- } else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
- }
- kunmap(page);
- return copied;
-}
-EXPORT_SYMBOL(iov_iter_copy_from_user);
-
-void iov_iter_advance(struct iov_iter *i, size_t bytes)
-{
- BUG_ON(i->count < bytes);
-
- if (likely(i->nr_segs == 1)) {
- i->iov_offset += bytes;
- i->count -= bytes;
- } else {
- const struct iovec *iov = i->iov;
- size_t base = i->iov_offset;
- unsigned long nr_segs = i->nr_segs;
-
- /*
- * The !iov->iov_len check ensures we skip over unlikely
- * zero-length segments (without overruning the iovec).
- */
- while (bytes || unlikely(i->count && !iov->iov_len)) {
- int copy;
-
- copy = min(bytes, iov->iov_len - base);
- BUG_ON(!i->count || i->count < copy);
- i->count -= copy;
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- nr_segs--;
- base = 0;
- }
- }
- i->iov = iov;
- i->iov_offset = base;
- i->nr_segs = nr_segs;
- }
-}
-EXPORT_SYMBOL(iov_iter_advance);
-
-/*
- * Fault in the first iovec of the given iov_iter, to a maximum length
- * of bytes. Returns 0 on success, or non-zero if the memory could not be
- * accessed (ie. because it is an invalid address).
- *
- * writev-intensive code may want this to prefault several iovecs -- that
- * would be possible (callers must not rely on the fact that _only_ the
- * first iovec will be faulted with the current implementation).
- */
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
-{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
- return fault_in_pages_readable(buf, bytes);
-}
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
-
-/*
- * Return the count of just the current iov_iter segment.
- */
-size_t iov_iter_single_seg_count(struct iov_iter *i)
-{
- const struct iovec *iov = i->iov;
- if (i->nr_segs == 1)
- return i->count;
- else
- return min(i->count, iov->iov_len - i->iov_offset);
-}
-EXPORT_SYMBOL(iov_iter_single_seg_count);
-
/*
* Performs necessary checks before doing a write
*
diff --git a/mm/iov-iter.c b/mm/iov-iter.c
new file mode 100644
index 0000000..596fcf0
--- /dev/null
+++ b/mm/iov-iter.c
@@ -0,0 +1,151 @@
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/uio.h>
+#include <linux/hardirq.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
+ const struct iovec *iov, size_t base, size_t bytes)
+{
+ size_t copied = 0, left = 0;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + base;
+ int copy = min(bytes, iov->iov_len - base);
+
+ base = 0;
+ left = __copy_from_user_inatomic(vaddr, buf, copy);
+ copied += copy;
+ bytes -= copy;
+ vaddr += copy;
+ iov++;
+
+ if (unlikely(left))
+ break;
+ }
+ return copied - left;
+}
+
+/*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were successfully copied. If a fault is encountered then return the number
+ * of bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ BUG_ON(!in_atomic());
+ kaddr = kmap_atomic(page, KM_USER0);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap_atomic(kaddr, KM_USER0);
+
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ kaddr = kmap(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_from_user(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap(page);
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_from_user);
+
+void iov_iter_advance(struct iov_iter *i, size_t bytes)
+{
+ BUG_ON(i->count < bytes);
+
+ if (likely(i->nr_segs == 1)) {
+ i->iov_offset += bytes;
+ i->count -= bytes;
+ } else {
+ const struct iovec *iov = i->iov;
+ size_t base = i->iov_offset;
+ unsigned long nr_segs = i->nr_segs;
+
+ /*
+ * The !iov->iov_len check ensures we skip over unlikely
+ * zero-length segments (without overruning the iovec).
+ */
+ while (bytes || unlikely(i->count && !iov->iov_len)) {
+ int copy;
+
+ copy = min(bytes, iov->iov_len - base);
+ BUG_ON(!i->count || i->count < copy);
+ i->count -= copy;
+ bytes -= copy;
+ base += copy;
+ if (iov->iov_len == base) {
+ iov++;
+ nr_segs--;
+ base = 0;
+ }
+ }
+ i->iov = iov;
+ i->iov_offset = base;
+ i->nr_segs = nr_segs;
+ }
+}
+EXPORT_SYMBOL(iov_iter_advance);
+
+/*
+ * Fault in the first iovec of the given iov_iter, to a maximum length
+ * of bytes. Returns 0 on success, or non-zero if the memory could not be
+ * accessed (ie. because it is an invalid address).
+ *
+ * writev-intensive code may want this to prefault several iovecs -- that
+ * would be possible (callers must not rely on the fact that _only_ the
+ * first iovec will be faulted with the current implementation).
+ */
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+{
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ bytes = min(bytes, i->iov->iov_len - i->iov_offset);
+ return fault_in_pages_readable(buf, bytes);
+}
+EXPORT_SYMBOL(iov_iter_fault_in_readable);
+
+/*
+ * Return the count of just the current iov_iter segment.
+ */
+size_t iov_iter_single_seg_count(struct iov_iter *i)
+{
+ const struct iovec *iov = i->iov;
+ if (i->nr_segs == 1)
+ return i->count;
+ else
+ return min(i->count, iov->iov_len - i->iov_offset);
+}
+EXPORT_SYMBOL(iov_iter_single_seg_count);
--
1.7.9.2

2012-02-27 21:26:42

by Dave Kleikamp

[permalink] [raw]
Subject: [RFC PATCH 05/22] iov_iter: add bvec support

From: Zach Brown <[email protected]>

This adds a set of iov_iter_ops calls which work with memory which is
specified by an array of bio_vec structs instead of an array of iovec
structs.

The big difference is that the pages referenced by the bio_vec elements
are pinned. They don't need to be faulted in and we can always use
kmap_atomic() to map them one at a time.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 17 +++++++
mm/iov-iter.c | 124 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 141 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a64eda..3689d58 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -583,6 +583,23 @@ static inline size_t iov_iter_single_seg_count(struct iov_iter *i)
return i->ops->ii_single_seg_count(i);
}

+extern struct iov_iter_ops ii_bvec_ops;
+
+struct bio_vec;
+static inline void iov_iter_init_bvec(struct iov_iter *i,
+ struct bio_vec *bvec,
+ unsigned long nr_segs,
+ size_t count, size_t written)
+{
+ i->ops = &ii_bvec_ops;
+ i->data = (unsigned long)bvec;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count + written;
+
+ iov_iter_advance(i, written);
+}
+
extern struct iov_iter_ops ii_iovec_ops;

static inline void iov_iter_init(struct iov_iter *i,
diff --git a/mm/iov-iter.c b/mm/iov-iter.c
index 83f0db7..5b35f23 100644
--- a/mm/iov-iter.c
+++ b/mm/iov-iter.c
@@ -5,6 +5,7 @@
#include <linux/hardirq.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/bio.h>

static size_t __iovec_copy_to_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
@@ -83,6 +84,129 @@ size_t ii_iovec_copy_to_user(struct page *page,
return copied;
}

+/*
+ * As an easily verifiable first pass, we implement all the methods that
+ * copy data to and from bvec pages with one function. We implement it
+ * all with kmap_atomic().
+ */
+static size_t bvec_copy_tofrom_page(struct iov_iter *iter, struct page *page,
+ unsigned long page_offset, size_t bytes,
+ int topage)
+{
+ struct bio_vec *bvec = (struct bio_vec *)iter->data;
+ size_t bvec_offset = iter->iov_offset;
+ size_t remaining = bytes;
+ void *bvec_map;
+ void *page_map;
+ size_t copy;
+
+ page_map = kmap_atomic(page, KM_USER0);
+
+ BUG_ON(bytes > iter->count);
+ while (remaining) {
+ BUG_ON(bvec->bv_len == 0);
+ BUG_ON(bvec_offset >= bvec->bv_len);
+ copy = min(remaining, bvec->bv_len - bvec_offset);
+ bvec_map = kmap_atomic(bvec->bv_page, KM_USER1);
+ if (topage)
+ memcpy(page_map + page_offset,
+ bvec_map + bvec->bv_offset + bvec_offset,
+ copy);
+ else
+ memcpy(bvec_map + bvec->bv_offset + bvec_offset,
+ page_map + page_offset,
+ copy);
+ kunmap_atomic(bvec_map, KM_USER1);
+ remaining -= copy;
+ bvec_offset += copy;
+ page_offset += copy;
+ if (bvec_offset == bvec->bv_len) {
+ bvec_offset = 0;
+ bvec++;
+ }
+ }
+
+ kunmap_atomic(page_map, KM_USER0);
+
+ return bytes;
+}
+
+size_t ii_bvec_copy_to_user_atomic(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 0);
+}
+size_t ii_bvec_copy_to_user(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 0);
+}
+size_t ii_bvec_copy_from_user_atomic(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 1);
+}
+size_t ii_bvec_copy_from_user(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 1);
+}
+
+/*
+ * bio_vecs have a stricter structure than iovecs that might have
+ * come from userspace. There are no zero length bio_vec elements.
+ */
+void ii_bvec_advance(struct iov_iter *i, size_t bytes)
+{
+ struct bio_vec *bvec = (struct bio_vec *)i->data;
+ size_t offset = i->iov_offset;
+ size_t delta;
+
+ BUG_ON(i->count < bytes);
+ while (bytes) {
+ BUG_ON(bvec->bv_len == 0);
+ BUG_ON(bvec->bv_len <= offset);
+ delta = min(bytes, bvec->bv_len - offset);
+ offset += delta;
+ i->count -= delta;
+ bytes -= delta;
+ if (offset == bvec->bv_len) {
+ bvec++;
+ offset = 0;
+ }
+ }
+
+ i->data = (unsigned long)bvec;
+ i->iov_offset = offset;
+}
+
+/*
+ * pages pointed to by bio_vecs are always pinned.
+ */
+int ii_bvec_fault_in_readable(struct iov_iter *i, size_t bytes)
+{
+ return 0;
+}
+
+size_t ii_bvec_single_seg_count(struct iov_iter *i)
+{
+ const struct bio_vec *bvec = (struct bio_vec *)i->data;
+ if (i->nr_segs == 1)
+ return i->count;
+ else
+ return min(i->count, bvec->bv_len - i->iov_offset);
+}
+
+struct iov_iter_ops ii_bvec_ops = {
+ .ii_copy_to_user_atomic = ii_bvec_copy_to_user_atomic,
+ .ii_copy_to_user = ii_bvec_copy_to_user,
+ .ii_copy_from_user_atomic = ii_bvec_copy_from_user_atomic,
+ .ii_copy_from_user = ii_bvec_copy_from_user,
+ .ii_advance = ii_bvec_advance,
+ .ii_fault_in_readable = ii_bvec_fault_in_readable,
+ .ii_single_seg_count = ii_bvec_single_seg_count,
+};
+EXPORT_SYMBOL(ii_bvec_ops);

static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
--
1.7.9.2

2012-02-27 22:08:54

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [RFC PATCH 22/22] nfs: add support for read_iter, write_iter

On Mon, 2012-02-27 at 15:19 -0600, Dave Kleikamp wrote:
> This patch implements the read_iter and write_iter file operations which
> allow kernel code to initiate directIO. This allows the loop device to
> read and write directly to the server, bypassing the page cache.
>
> Signed-off-by: Dave Kleikamp <[email protected]>
> Cc: Zach Brown <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Cc: [email protected]

Performance is going to be absolutely terrible for O_DIRECT bvecs if you
send just one page per RPC call. We are working on merging the O_DIRECT
and page cache code in order to give O_DIRECT the ability to coalesce
requests and do pNFS, and I'm hoping that code will be available soon.

In the meantime, wouldn't it be possible to add basic coalescing to
nfs_direct_read_schedule_bvec/nfs_direct_write_schedule_bvec more or
less in the same way that we do for multi-page iovec segments?
i.e. if the next bvec is contiguous with the previous, and the resulting
RPC read length < rsize / write length < wsize, then add it to the same
RPC call.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-02-27 22:19:34

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 16/22] aio: add aio support for iov_iter arguments


> Only kernel callers can provide an iov_iter so it doesn't make a lot of
> sense to expose the IOCB_CMD values for this as part of the user space
> ABI.
>
> But kernel callers should also be able to perform the usual aio
> operations which suggests using the the existing operation namespace and
> support code.

> --- a/include/linux/aio_abi.h
> +++ b/include/linux/aio_abi.h
> @@ -44,6 +44,8 @@ enum {
> IOCB_CMD_NOOP = 6,
> IOCB_CMD_PREADV = 7,
> IOCB_CMD_PWRITEV = 8,
> + IOCB_CMD_READ_ITER = 9,
> + IOCB_CMD_WRITE_ITER = 10,
> };

Bleh, yeah, I was never very satisfied with this. It still feels pretty
gross to be using _CMD_ definitions for these in-kernel iocbs. We'll
need to be verifying that these don't come from userspace iocbs forever
more.

I wonder if we can come up with something that feels less clumsy.

- z

2012-02-27 22:24:35

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 13/22] dio: add __blockdev_direct_IO_bdev()


> int is_async; /* is IO async ? */
> + int should_dirty; /* should we mark read pages dirty? */

This seems wasteful. Maybe its time for some bit flags?

- z

2012-02-27 22:27:57

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 00/22] loop: Issue O_DIRECT aio with pages

On 02/27/2012 04:19 PM, Dave Kleikamp wrote:
> This patchset was begun by Zach Brown and was originally submitted for
> review in October, 2009.

Man, it's been a while. I remembered almost none of the details of this
work when I read your introductory message. As I read the patches it
all came flooding back, though. Yikes :).

My biggest fear about this patch series is the sheer amount of very
fiddly code motion. I remember spending a lot of time verifying that
the patches didn't accidentally lose new changes as the patches were
ported to newer kernels.

Has someone gone through these most recent patches with an absurdly fine
toothed comb? The patches that touch fs/direct-io.c and mm/filemap.c
are the most risky, by far.

> This series was written to reduce the current overhead loop imposes by
> performing synchronus buffered file system IO from a kernel thread. These
> patches turn loop into a light weight layer that translates bios into iocbs.

It'd be worth including some simple benchmark results, I think. I
remember testing with concurrent O_DIRECT aio with fio on a loopback
device on top of files in each underlying file system.

- z

2012-02-27 22:34:25

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 18/22] ext3: add support for .read_iter and .write_iter


> drivers/block/loop.c | 55 ++++++++++++++++++-
> fs/ext3/file.c | 2 +
> fs/ext3/inode.c | 149 ++++++++++++++++++++++++++++++++++----------------
> include/linux/loop.h | 1 +
> 4 files changed, 160 insertions(+), 47 deletions(-)

It looks like the patch that teaches loop to use the kernel aio
interface got combined with the patch that adds the _bvec entry points
to ext3.

> + if (file->f_op->write_iter&& file->f_op->read_iter) {
> + file->f_flags |= O_DIRECT;
> + lo_flags |= LO_FLAGS_USE_AIO;
> + }

This manual setting of f_flags still looks very fishy to me. I remember
finding that pattern somewhere else but that's not very comforting :).

- z

2012-02-27 22:53:17

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC PATCH 00/22] loop: Issue O_DIRECT aio with pages

On 02/27/2012 04:27 PM, Zach Brown wrote:
> On 02/27/2012 04:19 PM, Dave Kleikamp wrote:
>> This patchset was begun by Zach Brown and was originally submitted for
>> review in October, 2009.
>
> Man, it's been a while. I remembered almost none of the details of this
> work when I read your introductory message. As I read the patches it
> all came flooding back, though. Yikes :).
>
> My biggest fear about this patch series is the sheer amount of very
> fiddly code motion. I remember spending a lot of time verifying that
> the patches didn't accidentally lose new changes as the patches were
> ported to newer kernels.
>
> Has someone gone through these most recent patches with an absurdly fine
> toothed comb? The patches that touch fs/direct-io.c and mm/filemap.c
> are the most risky, by far.

I was pretty careful porting these patches, trying not to lose the
effects of any newer changes to the affected functions. It wouldn't hurt
to go through them again, but I am putting it out as an RFC since I
don't think these are ready to merge quite yet.

>> This series was written to reduce the current overhead loop imposes by
>> performing synchronus buffered file system IO from a kernel thread.
>> These
>> patches turn loop into a light weight layer that translates bios into
>> iocbs.
>
> It'd be worth including some simple benchmark results, I think. I
> remember testing with concurrent O_DIRECT aio with fio on a loopback
> device on top of files in each underlying file system.

Actually, some preliminary tests I ran showed a significant slowdown. I
suspect it may have to do with the unconditional setting of O_DIRECT,
but I haven't verified that yet. I'll do further performance analysis,
but I wanted to get some eyes on this in the mean time.

Thanks,
Shaggy

2012-02-27 23:14:39

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC PATCH 18/22] ext3: add support for .read_iter and .write_iter

On 02/27/2012 04:34 PM, Zach Brown wrote:
>
>> drivers/block/loop.c | 55 ++++++++++++++++++-
>> fs/ext3/file.c | 2 +
>> fs/ext3/inode.c | 149
>> ++++++++++++++++++++++++++++++++++----------------
>> include/linux/loop.h | 1 +
>> 4 files changed, 160 insertions(+), 47 deletions(-)
>
> It looks like the patch that teaches loop to use the kernel aio
> interface got combined with the patch that adds the _bvec entry points
> to ext3.

Okay, looking back, your patchset had them separate. This was my error.
I'll separate them again.

>> + if (file->f_op->write_iter&& file->f_op->read_iter) {
>> + file->f_flags |= O_DIRECT;
>> + lo_flags |= LO_FLAGS_USE_AIO;
>> + }
>
> This manual setting of f_flags still looks very fishy to me. I remember
> finding that pattern somewhere else but that's not very comforting :).
>
> - z

2012-02-27 23:18:09

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [RFC PATCH 22/22] nfs: add support for read_iter, write_iter

On 02/27/2012 04:08 PM, Myklebust, Trond wrote:
> On Mon, 2012-02-27 at 15:19 -0600, Dave Kleikamp wrote:
>> This patch implements the read_iter and write_iter file operations which
>> allow kernel code to initiate directIO. This allows the loop device to
>> read and write directly to the server, bypassing the page cache.
>>
>> Signed-off-by: Dave Kleikamp <[email protected]>
>> Cc: Zach Brown <[email protected]>
>> Cc: Trond Myklebust <[email protected]>
>> Cc: [email protected]
>
> Performance is going to be absolutely terrible for O_DIRECT bvecs if you
> send just one page per RPC call. We are working on merging the O_DIRECT
> and page cache code in order to give O_DIRECT the ability to coalesce
> requests and do pNFS, and I'm hoping that code will be available soon.
>
> In the meantime, wouldn't it be possible to add basic coalescing to
> nfs_direct_read_schedule_bvec/nfs_direct_write_schedule_bvec more or
> less in the same way that we do for multi-page iovec segments?
> i.e. if the next bvec is contiguous with the previous, and the resulting
> RPC read length < rsize / write length < wsize, then add it to the same
> RPC call.

I basically followed the example of what the block layer was doing, but
coalescing makes more sense for nfs. I'll rework it to do that.

Thanks,
Shaggy

2012-02-28 09:09:43

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 03/22] fuse: convert fuse to use iov_iter_copy_[to|from]_user

Dave Kleikamp <[email protected]> writes:

> A future patch hides the internals of struct iov_iter, so fuse should
> be using the supported interface.
>
> Signed-off-by: Dave Kleikamp <[email protected]>

Acked-by: Miklos Szeredi <[email protected]>

Thanks,
Miklos

> Cc: Miklos Szeredi <[email protected]>
> Cc: [email protected]
> ---
> fs/fuse/file.c | 29 ++++++++---------------------
> 1 file changed, 8 insertions(+), 21 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 4a199fd..877cee0 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1582,30 +1582,17 @@ static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
> while (iov_iter_count(&ii)) {
> struct page *page = pages[page_idx++];
> size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
> - void *kaddr;
> + size_t left;
>
> - kaddr = kmap(page);
> -
> - while (todo) {
> - char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
> - size_t iov_len = ii.iov->iov_len - ii.iov_offset;
> - size_t copy = min(todo, iov_len);
> - size_t left;
> -
> - if (!to_user)
> - left = copy_from_user(kaddr, uaddr, copy);
> - else
> - left = copy_to_user(uaddr, kaddr, copy);
> -
> - if (unlikely(left))
> - return -EFAULT;
> + if (!to_user)
> + left = iov_iter_copy_from_user(page, &ii, 0, todo);
> + else
> + left = iov_iter_copy_to_user(page, &ii, 0, todo);
>
> - iov_iter_advance(&ii, copy);
> - todo -= copy;
> - kaddr += copy;
> - }
> + if (unlikely(left))
> + return -EFAULT;
>
> - kunmap(page);
> + iov_iter_advance(&ii, todo);
> }
>
> return 0;

2012-02-28 09:29:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 00/22] loop: Issue O_DIRECT aio with pages

I had a very brief look over the series, and in general I like it.

But there is one thing that needs a major revision, and that is the
filesyste, interface.

For one you make ->aio_read/write trivial wrappers around ->read_iter
and ->write_iter. Instead of keeping this duplication around please
make sure to entirely kill ->aio_read/write and always use your new
methods. Without that we'll get into a complete mess like the old
->aio_read/write vs ->readv/writev again.

A similar thing applies to the ->direct_IO/direct_IO_bvec interface -
instead of duplicating it I'd rather change the ->direct_IO interface to:

ssize_t (*direct_IO)(int rw, struct kiocb *iocb,
struct iov_iter *iter, loff_t pos)

and let only fs/direct-io.c care about the difference when using user vs
kernel pages.

2012-02-28 15:14:38

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH 00/22] loop: Issue O_DIRECT aio with pages


> For one you make ->aio_read/write trivial wrappers around ->read_iter
> and ->write_iter. Instead of keeping this duplication around please
> make sure to entirely kill ->aio_read/write and always use your new
> methods. Without that we'll get into a complete mess like the old
> ->aio_read/write vs ->readv/writev again.
>
> A similar thing applies to the ->direct_IO/direct_IO_bvec interface -

Yeah, that's reasonable. Perhaps obviously, we started with new entry
points to minimize the amount of churn we'd have to go through to test
the change in behaviour.

It's going to be messy to try and abstract away the pinning and dirtying
of the iter regions from direct IO through the iter interface, but maybe
not horribly so.

- z

2012-02-29 09:08:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 00/22] loop: Issue O_DIRECT aio with pages

On Tue, Feb 28, 2012 at 10:14:34AM -0500, Zach Brown wrote:
> Yeah, that's reasonable. Perhaps obviously, we started with new entry
> points to minimize the amount of churn we'd have to go through to test
> the change in behaviour.
>
> It's going to be messy to try and abstract away the pinning and dirtying
> of the iter regions from direct IO through the iter interface, but maybe
> not horribly so.

I don't really care to much about the implementation inside fs/direct-io.c
(at least for now - once I see it I might still scream "bloody
murder!").

The point is to pass the iov_iter all the way down to a common entry
point in fs/direct-io.c, so that the filesystems don't have to care
for that difference.