2013-07-25 18:11:17

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

This patch series adds a kernel interface to fs/aio.c so that kernel code can
issue concurrent asynchronous IO to file systems. It adds an aio command and
file system methods which specify io memory with pages instead of userspace
addresses.

This series was written to reduce the current overhead loop imposes by
performing synchronus buffered file system IO from a kernel thread. These
patches turn loop into a light weight layer that translates bios into iocbs.

It introduces new file ops, read_iter() and write_iter(), that replace the
aio_read() and aio_write() operations. The iov_iter structure can now contain
either a user-space iovec or a kernel-space bio_vec. Since it would be
overly complicated to replace every instance of aio_read() and aio_write(),
the old operations are not removed, but file systems implementing the new
ones need not keep the old ones.

Verion 8 is little changed from Version 7 that I send out in March, just
updated to the latest kernel. These patches apply to 3.11-rc2 and can
also be found at:

git://github.com/kleikamp/linux-shaggy.git aio_loop

Asias He (1):
block_dev: add support for read_iter, write_iter

Dave Kleikamp (22):
iov_iter: iov_iter_copy_from_user() should use non-atomic copy
iov_iter: add __iovec_copy_to_user()
fuse: convert fuse to use iov_iter_copy_[to|from]_user
iov_iter: ii_iovec_copy_to_user should pre-fault user pages
dio: Convert direct_IO to use iov_iter
dio: add bio_vec support to __blockdev_direct_IO()
aio: add aio_kernel_() interface
aio: add aio support for iov_iter arguments
fs: create file_readable() and file_writable() functions
fs: use read_iter and write_iter rather than aio_read and aio_write
fs: add read_iter and write_iter to several file systems
ocfs2: add support for read_iter and write_iter
ext4: add support for read_iter and write_iter
nfs: add support for read_iter, write_iter
nfs: simplify swap
btrfs: add support for read_iter and write_iter
xfs: add support for read_iter and write_iter
gfs2: Convert aio_read/write ops to read/write_iter
udf: convert file ops from aio_read/write to read/write_iter
afs: add support for read_iter and write_iter
ecrpytfs: Convert aio_read/write ops to read/write_iter
ubifs: convert file ops from aio_read/write to read/write_iter

Hugh Dickins (1):
tmpfs: add support for read_iter and write_iter

Zach Brown (9):
iov_iter: move into its own file
iov_iter: add copy_to_user support
iov_iter: hide iovec details behind ops function pointers
iov_iter: add bvec support
iov_iter: add a shorten call
iov_iter: let callers extract iovecs and bio_vecs
fs: pull iov_iter use higher up the stack
bio: add bvec_length(), like iov_length()
loop: use aio to perform io on the underlying file

Documentation/filesystems/Locking | 6 +-
Documentation/filesystems/vfs.txt | 12 +-
drivers/block/loop.c | 148 ++++++++----
drivers/char/raw.c | 4 +-
drivers/mtd/nand/nandsim.c | 4 +-
drivers/usb/gadget/storage_common.c | 4 +-
fs/9p/vfs_addr.c | 12 +-
fs/9p/vfs_file.c | 8 +-
fs/Makefile | 2 +-
fs/adfs/file.c | 4 +-
fs/affs/file.c | 4 +-
fs/afs/file.c | 4 +-
fs/afs/internal.h | 3 +-
fs/afs/write.c | 9 +-
fs/aio.c | 152 ++++++++++++-
fs/bad_inode.c | 14 ++
fs/bfs/file.c | 4 +-
fs/block_dev.c | 27 ++-
fs/btrfs/file.c | 42 ++--
fs/btrfs/inode.c | 63 +++---
fs/ceph/addr.c | 3 +-
fs/cifs/file.c | 4 +-
fs/direct-io.c | 223 +++++++++++++------
fs/ecryptfs/file.c | 15 +-
fs/exofs/file.c | 4 +-
fs/ext2/file.c | 4 +-
fs/ext2/inode.c | 8 +-
fs/ext3/file.c | 4 +-
fs/ext3/inode.c | 15 +-
fs/ext4/ext4.h | 3 +-
fs/ext4/file.c | 34 +--
fs/ext4/indirect.c | 16 +-
fs/ext4/inode.c | 23 +-
fs/f2fs/data.c | 4 +-
fs/f2fs/file.c | 4 +-
fs/fat/file.c | 4 +-
fs/fat/inode.c | 10 +-
fs/fuse/cuse.c | 10 +-
fs/fuse/file.c | 90 ++++----
fs/fuse/fuse_i.h | 5 +-
fs/gfs2/aops.c | 7 +-
fs/gfs2/file.c | 21 +-
fs/hfs/inode.c | 11 +-
fs/hfsplus/inode.c | 10 +-
fs/hostfs/hostfs_kern.c | 4 +-
fs/hpfs/file.c | 4 +-
fs/internal.h | 4 +
fs/iov-iter.c | 411 ++++++++++++++++++++++++++++++++++
fs/jffs2/file.c | 8 +-
fs/jfs/file.c | 4 +-
fs/jfs/inode.c | 7 +-
fs/logfs/file.c | 4 +-
fs/minix/file.c | 4 +-
fs/nfs/direct.c | 302 ++++++++++++++++---------
fs/nfs/file.c | 33 ++-
fs/nfs/internal.h | 4 +-
fs/nfs/nfs4file.c | 4 +-
fs/nilfs2/file.c | 4 +-
fs/nilfs2/inode.c | 8 +-
fs/ocfs2/aops.c | 8 +-
fs/ocfs2/aops.h | 2 +-
fs/ocfs2/file.c | 55 ++---
fs/ocfs2/ocfs2_trace.h | 6 +-
fs/omfs/file.c | 4 +-
fs/ramfs/file-mmu.c | 4 +-
fs/ramfs/file-nommu.c | 4 +-
fs/read_write.c | 78 +++++--
fs/reiserfs/file.c | 4 +-
fs/reiserfs/inode.c | 7 +-
fs/romfs/mmap-nommu.c | 2 +-
fs/sysv/file.c | 4 +-
fs/ubifs/file.c | 12 +-
fs/udf/file.c | 13 +-
fs/udf/inode.c | 10 +-
fs/ufs/file.c | 4 +-
fs/xfs/xfs_aops.c | 13 +-
fs/xfs/xfs_file.c | 51 ++---
include/linux/aio.h | 20 +-
include/linux/bio.h | 8 +
include/linux/blk_types.h | 2 -
include/linux/fs.h | 165 ++++++++++++--
include/linux/nfs_fs.h | 13 +-
include/uapi/linux/aio_abi.h | 2 +
include/uapi/linux/loop.h | 1 +
mm/filemap.c | 433 ++++++++++++++----------------------
mm/page_io.c | 15 +-
mm/shmem.c | 61 ++---
87 files changed, 1862 insertions(+), 1002 deletions(-)
create mode 100644 fs/iov-iter.c

--
1.8.3.4


2013-07-25 17:52:03

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 18/33] fs: create file_readable() and file_writable() functions

Create functions to simplify if file_ops contain either a read
or aio_read op, or likewise write or aio_write. We will be adding
read_iter and write_iter and don't need to be complicating the code
in multiple places.

Signed-off-by: Dave Kleikamp <[email protected]>
---
drivers/mtd/nand/nandsim.c | 4 ++--
drivers/usb/gadget/storage_common.c | 4 ++--
fs/read_write.c | 14 +++++++-------
include/linux/fs.h | 10 ++++++++++
4 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
index cb38f3d..6a44ba6 100644
--- a/drivers/mtd/nand/nandsim.c
+++ b/drivers/mtd/nand/nandsim.c
@@ -576,12 +576,12 @@ static int alloc_device(struct nandsim *ns)
cfile = filp_open(cache_file, O_CREAT | O_RDWR | O_LARGEFILE, 0600);
if (IS_ERR(cfile))
return PTR_ERR(cfile);
- if (!cfile->f_op || (!cfile->f_op->read && !cfile->f_op->aio_read)) {
+ if (!file_readable(cfile)) {
NS_ERR("alloc_device: cache file not readable\n");
err = -EINVAL;
goto err_close;
}
- if (!cfile->f_op->write && !cfile->f_op->aio_write) {
+ if (!file_writable(cfile)) {
NS_ERR("alloc_device: cache file not writeable\n");
err = -EINVAL;
goto err_close;
diff --git a/drivers/usb/gadget/storage_common.c b/drivers/usb/gadget/storage_common.c
index dbce3a9..a801a00 100644
--- a/drivers/usb/gadget/storage_common.c
+++ b/drivers/usb/gadget/storage_common.c
@@ -450,11 +450,11 @@ static int fsg_lun_open(struct fsg_lun *curlun, const char *filename)
* If we can't read the file, it's no good.
* If we can't write the file, use it read-only.
*/
- if (!(filp->f_op->read || filp->f_op->aio_read)) {
+ if (!file_readable(filp)) {
LINFO(curlun, "file not readable: %s\n", filename);
goto out;
}
- if (!(filp->f_op->write || filp->f_op->aio_write))
+ if (!file_writable(filp))
ro = 1;

size = i_size_read(inode->i_mapping->host);
diff --git a/fs/read_write.c b/fs/read_write.c
index 122a384..022929c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -385,7 +385,7 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)

if (!(file->f_mode & FMODE_READ))
return -EBADF;
- if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
+ if (!file_readable(file))
return -EINVAL;
if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
return -EFAULT;
@@ -435,7 +435,7 @@ ssize_t __kernel_write(struct file *file, const char *buf, size_t count, loff_t
const char __user *p;
ssize_t ret;

- if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+ if (!file_writable(file))
return -EINVAL;

old_fs = get_fs();
@@ -462,7 +462,7 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_

if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
- if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+ if (!file_writable(file))
return -EINVAL;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;
@@ -781,7 +781,7 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
{
if (!(file->f_mode & FMODE_READ))
return -EBADF;
- if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
+ if (!file_readable(file))
return -EINVAL;

return do_readv_writev(READ, file, vec, vlen, pos);
@@ -794,7 +794,7 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
{
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
- if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
+ if (!file_writable(file))
return -EINVAL;

return do_readv_writev(WRITE, file, vec, vlen, pos);
@@ -968,7 +968,7 @@ static size_t compat_readv(struct file *file,
goto out;

ret = -EINVAL;
- if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
+ if (!file_readable(file))
goto out;

ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
@@ -1035,7 +1035,7 @@ static size_t compat_writev(struct file *file,
goto out;

ret = -EINVAL;
- if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
+ if (!file_writable(file))
goto out;

ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d716a29..fc1e2a8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1652,6 +1652,16 @@ struct file_operations {
int (*show_fdinfo)(struct seq_file *m, struct file *f);
};

+static inline int file_readable(struct file *filp)
+{
+ return filp && (filp->f_op->read || filp->f_op->aio_read);
+}
+
+static inline int file_writable(struct file *filp)
+{
+ return filp && (filp->f_op->write || filp->f_op->aio_write);
+}
+
struct inode_operations {
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
void * (*follow_link) (struct dentry *, struct nameidata *);
--
1.8.3.4

2013-07-25 17:52:07

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 04/33] iov_iter: add __iovec_copy_to_user()

This patch adds __iovec_copy_to_user() which doesn't verify write access
to the user memory to be called from code where that verification has
already been done.

Signed-off-by: Dave Kleikamp <[email protected]>
---
fs/iov-iter.c | 14 ++++++++++++--
include/linux/fs.h | 4 +++-
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 0b2407e..6cecab4 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -19,7 +19,7 @@ static size_t __iovec_copy_to_user(char *vaddr, const struct iovec *iov,
if (atomic)
left = __copy_to_user_inatomic(buf, vaddr, copy);
else
- left = copy_to_user(buf, vaddr, copy);
+ left = __copy_to_user(buf, vaddr, copy);
copied += copy;
bytes -= copy;
vaddr += copy;
@@ -65,7 +65,7 @@ EXPORT_SYMBOL(iov_iter_copy_to_user_atomic);
* The difference is that it attempts to resolve faults.
* Page must not be locked.
*/
-size_t iov_iter_copy_to_user(struct page *page,
+size_t __iov_iter_copy_to_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
char *kaddr;
@@ -84,6 +84,16 @@ size_t iov_iter_copy_to_user(struct page *page,
kunmap(page);
return copied;
}
+EXPORT_SYMBOL(__iov_iter_copy_to_user);
+
+size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ might_sleep();
+ if (generic_segment_checks(i->iov, &i->nr_segs, &bytes, VERIFY_WRITE))
+ return 0;
+ return __iov_iter_copy_to_user(page, i, offset, bytes);
+}
EXPORT_SYMBOL(iov_iter_copy_to_user);

static size_t __iovec_copy_from_user(char *vaddr, const struct iovec *iov,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 80f71df..bfc6eb0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -296,7 +296,9 @@ struct iov_iter {
size_t count;
};

-size_t iov_iter_copy_to_user_atomic(struct page *page,
+size_t __iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t __iov_iter_copy_to_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
size_t iov_iter_copy_to_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
--
1.8.3.4

2013-07-25 17:52:20

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 24/33] nfs: simplify swap

swap_writepage can now call nfs's write_iter f_op, eliminating the need to
implement for the special-case direct_IO a_op. There is no longer a need to
pass the uio flag through the direct write path.

Signed-off-by: Dave Kleikamp <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: [email protected]
---
fs/nfs/direct.c | 94 ++++++++++++++++-------------------------------
fs/nfs/file.c | 4 +-
include/linux/blk_types.h | 2 -
include/linux/fs.h | 2 -
include/linux/nfs_fs.h | 4 +-
mm/page_io.c | 13 +++----
6 files changed, 42 insertions(+), 77 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 2b0ebcb..239c2fe 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -118,29 +118,18 @@ static inline int put_dreq(struct nfs_direct_req *dreq)
* @nr_segs: size of iovec array
*
* The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O. However, for most direct IO, we
- * shunt off direct read and write requests before the VFS gets them,
- * so this method is only ever called for swap.
+ * the NFS client supports direct I/O. However, we shunt off direct
+ * read and write requests before the VFS gets them, so this method
+ * should never be called.
*/
ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
loff_t pos)
{
-#ifndef CONFIG_NFS_SWAP
dprintk("NFS: nfs_direct_IO (%s) off/no(%Ld/%lu) EINVAL\n",
iocb->ki_filp->f_path.dentry->d_name.name,
(long long) pos, iter->nr_segs);

return -EINVAL;
-#else
- VM_BUG_ON(iocb->ki_left != PAGE_SIZE);
- VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE);
-
- if (rw == READ || rw == KERNEL_READ)
- return nfs_file_direct_read(iocb, iter, pos,
- rw == READ ? true : false);
- return nfs_file_direct_write(iocb, iter, pos,
- rw == WRITE ? true : false);
-#endif /* CONFIG_NFS_SWAP */
}

static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
@@ -312,7 +301,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_read_completion_ops = {
*/
static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *desc,
const struct iovec *iov,
- loff_t pos, bool uio)
+ loff_t pos)
{
struct nfs_direct_req *dreq = desc->pg_dreq;
struct nfs_open_context *ctx = dreq->ctx;
@@ -340,20 +329,12 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
GFP_KERNEL);
if (!pagevec)
break;
- if (uio) {
- down_read(&current->mm->mmap_sem);
- result = get_user_pages(current, current->mm, user_addr,
+ down_read(&current->mm->mmap_sem);
+ result = get_user_pages(current, current->mm, user_addr,
npages, 1, 0, pagevec, NULL);
- up_read(&current->mm->mmap_sem);
- if (result < 0)
- break;
- } else {
- WARN_ON(npages != 1);
- result = get_kernel_page(user_addr, 1, pagevec);
- if (WARN_ON(result != 1))
- break;
- }
-
+ up_read(&current->mm->mmap_sem);
+ if (result < 0)
+ break;
if ((unsigned)result < npages) {
bytes = result * PAGE_SIZE;
if (bytes <= pgbase) {
@@ -403,7 +384,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de

static ssize_t nfs_direct_do_schedule_read_iovec(
struct nfs_pageio_descriptor *desc, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, bool uio)
+ unsigned long nr_segs, loff_t pos)
{
ssize_t result = -EINVAL;
size_t requested_bytes = 0;
@@ -411,7 +392,7 @@ static ssize_t nfs_direct_do_schedule_read_iovec(

for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];
- result = nfs_direct_read_schedule_segment(desc, vec, pos, uio);
+ result = nfs_direct_read_schedule_segment(desc, vec, pos);
if (result < 0)
break;
requested_bytes += result;
@@ -468,8 +449,7 @@ static ssize_t nfs_direct_do_schedule_read_bvec(
#endif /* CONFIG_BLOCK */

static ssize_t nfs_direct_read_schedule(struct nfs_direct_req *dreq,
- struct iov_iter *iter, loff_t pos,
- bool uio)
+ struct iov_iter *iter, loff_t pos)
{
struct nfs_pageio_descriptor desc;
ssize_t result;
@@ -480,10 +460,8 @@ static ssize_t nfs_direct_read_schedule(struct nfs_direct_req *dreq,
desc.pg_dreq = dreq;

if (iov_iter_has_iovec(iter)) {
- if (uio)
- dreq->flags = NFS_ODIRECT_MARK_DIRTY;
result = nfs_direct_do_schedule_read_iovec(&desc,
- iov_iter_iovec(iter), iter->nr_segs, pos, uio);
+ iov_iter_iovec(iter), iter->nr_segs, pos);
#ifdef CONFIG_BLOCK
} else if (iov_iter_has_bvec(iter)) {
result = nfs_direct_do_schedule_read_bvec(&desc,
@@ -509,7 +487,7 @@ static ssize_t nfs_direct_read_schedule(struct nfs_direct_req *dreq,
}

static ssize_t nfs_direct_read(struct kiocb *iocb, struct iov_iter *iter,
- loff_t pos, bool uio)
+ loff_t pos)
{
ssize_t result = -ENOMEM;
struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -533,7 +511,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, struct iov_iter *iter,
dreq->iocb = iocb;

NFS_I(inode)->read_io += iov_iter_count(iter);
- result = nfs_direct_read_schedule(dreq, iter, pos, uio);
+ result = nfs_direct_read_schedule(dreq, iter, pos);
if (!result)
result = nfs_direct_wait(dreq);
out_release:
@@ -698,7 +676,7 @@ static void nfs_direct_write_complete(struct nfs_direct_req *dreq, struct inode
*/
static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *desc,
const struct iovec *iov,
- loff_t pos, bool uio)
+ loff_t pos)
{
struct nfs_direct_req *dreq = desc->pg_dreq;
struct nfs_open_context *ctx = dreq->ctx;
@@ -726,19 +704,12 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *d
if (!pagevec)
break;

- if (uio) {
- down_read(&current->mm->mmap_sem);
- result = get_user_pages(current, current->mm, user_addr,
- npages, 0, 0, pagevec, NULL);
- up_read(&current->mm->mmap_sem);
- if (result < 0)
- break;
- } else {
- WARN_ON(npages != 1);
- result = get_kernel_page(user_addr, 0, pagevec);
- if (WARN_ON(result != 1))
- break;
- }
+ down_read(&current->mm->mmap_sem);
+ result = get_user_pages(current, current->mm, user_addr,
+ npages, 0, 0, pagevec, NULL);
+ up_read(&current->mm->mmap_sem);
+ if (result < 0)
+ break;

if ((unsigned)result < npages) {
bytes = result * PAGE_SIZE;
@@ -869,7 +840,7 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {

static ssize_t nfs_direct_do_schedule_write_iovec(
struct nfs_pageio_descriptor *desc, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, bool uio)
+ unsigned long nr_segs, loff_t pos)
{
ssize_t result = -EINVAL;
size_t requested_bytes = 0;
@@ -878,7 +849,7 @@ static ssize_t nfs_direct_do_schedule_write_iovec(
for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];
result = nfs_direct_write_schedule_segment(desc, vec,
- pos, uio);
+ pos);
if (result < 0)
break;
requested_bytes += result;
@@ -936,8 +907,7 @@ static ssize_t nfs_direct_do_schedule_write_bvec(
#endif /* CONFIG_BLOCK */

static ssize_t nfs_direct_write_schedule(struct nfs_direct_req *dreq,
- struct iov_iter *iter, loff_t pos,
- bool uio)
+ struct iov_iter *iter, loff_t pos)
{
struct nfs_pageio_descriptor desc;
struct inode *inode = dreq->inode;
@@ -953,7 +923,7 @@ static ssize_t nfs_direct_write_schedule(struct nfs_direct_req *dreq,

if (iov_iter_has_iovec(iter)) {
result = nfs_direct_do_schedule_write_iovec(&desc,
- iov_iter_iovec(iter), iter->nr_segs, pos, uio);
+ iov_iter_iovec(iter), iter->nr_segs, pos);
#ifdef CONFIG_BLOCK
} else if (iov_iter_has_bvec(iter)) {
result = nfs_direct_do_schedule_write_bvec(&desc,
@@ -980,7 +950,7 @@ static ssize_t nfs_direct_write_schedule(struct nfs_direct_req *dreq,
}

static ssize_t nfs_direct_write(struct kiocb *iocb, struct iov_iter *iter,
- loff_t pos, bool uio)
+ loff_t pos)
{
ssize_t result = -ENOMEM;
struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -1003,7 +973,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, struct iov_iter *iter,
if (!is_sync_kiocb(iocb))
dreq->iocb = iocb;

- result = nfs_direct_write_schedule(dreq, iter, pos, uio);
+ result = nfs_direct_write_schedule(dreq, iter, pos);
if (!result)
result = nfs_direct_wait(dreq);
out_release:
@@ -1033,7 +1003,7 @@ out:
* cache.
*/
ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
- loff_t pos, bool uio)
+ loff_t pos)
{
ssize_t retval = -EINVAL;
struct file *file = iocb->ki_filp;
@@ -1058,7 +1028,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,

task_io_account_read(count);

- retval = nfs_direct_read(iocb, iter, pos, uio);
+ retval = nfs_direct_read(iocb, iter, pos);
if (retval > 0)
iocb->ki_pos = pos + retval;

@@ -1088,7 +1058,7 @@ out:
* is no atomic O_APPEND write facility in the NFS protocol.
*/
ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
- loff_t pos, bool uio)
+ loff_t pos)
{
ssize_t retval = -EINVAL;
struct file *file = iocb->ki_filp;
@@ -1120,7 +1090,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,

task_io_account_write(count);

- retval = nfs_direct_write(iocb, iter, pos, uio);
+ retval = nfs_direct_write(iocb, iter, pos);
if (retval > 0) {
struct inode *inode = mapping->host;

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index bbff2f9..3e210ca 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -179,7 +179,7 @@ nfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
ssize_t result;

if (iocb->ki_filp->f_flags & O_DIRECT)
- return nfs_file_direct_read(iocb, iter, pos, true);
+ return nfs_file_direct_read(iocb, iter, pos);

dprintk("NFS: read_iter(%s/%s, %lu@%lu)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
@@ -651,7 +651,7 @@ ssize_t nfs_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
size_t count = iov_iter_count(iter);

if (iocb->ki_filp->f_flags & O_DIRECT)
- return nfs_file_direct_write(iocb, iter, pos, true);
+ return nfs_file_direct_write(iocb, iter, pos);

dprintk("NFS: write_iter(%s/%s, %lu@%lld)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index fa1abeb..1bea25f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -176,7 +176,6 @@ enum rq_flag_bits {
__REQ_FLUSH_SEQ, /* request for flush sequence */
__REQ_IO_STAT, /* account I/O stat */
__REQ_MIXED_MERGE, /* merge of different types, fail separately */
- __REQ_KERNEL, /* direct IO to kernel pages */
__REQ_PM, /* runtime pm request */
__REQ_NR_BITS, /* stops here */
};
@@ -227,7 +226,6 @@ enum rq_flag_bits {
#define REQ_IO_STAT (1 << __REQ_IO_STAT)
#define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE)
#define REQ_SECURE (1 << __REQ_SECURE)
-#define REQ_KERNEL (1 << __REQ_KERNEL)
#define REQ_PM (1 << __REQ_PM)

#endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26d9d8d4..06f2290 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -181,8 +181,6 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define READ 0
#define WRITE RW_MASK
#define READA RWA_MASK
-#define KERNEL_READ (READ|REQ_KERNEL)
-#define KERNEL_WRITE (WRITE|REQ_KERNEL)

#define READ_SYNC (READ | REQ_SYNC)
#define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE)
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index b2324be..1f6a332 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -459,9 +459,9 @@ extern int nfs3_removexattr (struct dentry *, const char *name);
*/
extern ssize_t nfs_direct_IO(int, struct kiocb *, struct iov_iter *, loff_t);
extern ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
- loff_t pos, bool uio);
+ loff_t pos);
extern ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
- loff_t pos, bool uio);
+ loff_t pos);

/*
* linux/fs/nfs/dir.c
diff --git a/mm/page_io.c b/mm/page_io.c
index 0c1db1a..21023df 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -258,14 +258,14 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
if (sis->flags & SWP_FILE) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
- struct address_space *mapping = swap_file->f_mapping;
- struct iovec iov = {
- .iov_base = kmap(page),
- .iov_len = PAGE_SIZE,
+ struct bio_vec bvec = {
+ .bv_page = kmap(page),
+ .bv_len = PAGE_SIZE,
+ .bv_offset = 0,
};
struct iov_iter iter;

- iov_iter_init(&iter, &iov, 1, PAGE_SIZE, 0);
+ iov_iter_init_bvec(&iter, &bvec, 1, PAGE_SIZE, 0);

init_sync_kiocb(&kiocb, swap_file);
kiocb.ki_pos = page_file_offset(page);
@@ -274,8 +274,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

set_page_writeback(page);
unlock_page(page);
- ret = mapping->a_ops->direct_IO(KERNEL_WRITE, &kiocb, &iter,
- kiocb.ki_pos);
+ ret = swap_file->f_op->write_iter(&kiocb, &iter, kiocb.ki_pos);
kunmap(page);
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
--
1.8.3.4

2013-07-25 17:52:22

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 10/33] iov_iter: let callers extract iovecs and bio_vecs

From: Zach Brown <[email protected]>

direct IO treats memory from user iovecs and memory from arrays of
kernel pages very differently. User memory is pinned and worked with in
batches while kernel pages are always pinned and don't require
additional processing.

Rather than try and provide an abstraction that includes these
different behaviours we let direct IO extract the memory structs and
hand them to the existing code.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/fs.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index a89bcb9..322d585 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -371,6 +371,17 @@ static inline void iov_iter_init_bvec(struct iov_iter *i,

iov_iter_advance(i, written);
}
+
+static inline int iov_iter_has_bvec(struct iov_iter *i)
+{
+ return i->ops == &ii_bvec_ops;
+}
+
+static inline struct bio_vec *iov_iter_bvec(struct iov_iter *i)
+{
+ BUG_ON(!iov_iter_has_bvec(i));
+ return (struct bio_vec *)i->data;
+}
#endif

extern struct iov_iter_ops ii_iovec_ops;
@@ -388,8 +399,14 @@ static inline void iov_iter_init(struct iov_iter *i,
iov_iter_advance(i, written);
}

+static inline int iov_iter_has_iovec(struct iov_iter *i)
+{
+ return i->ops == &ii_iovec_ops;
+}
+
static inline struct iovec *iov_iter_iovec(struct iov_iter *i)
{
+ BUG_ON(!iov_iter_has_iovec(i));
return (struct iovec *)i->data;
}

--
1.8.3.4

2013-07-25 17:52:25

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 27/33] xfs: add support for read_iter and write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Ben Myers <[email protected]>
Cc: Alex Elder <[email protected]>
Cc: [email protected]
---
fs/xfs/xfs_file.c | 51 ++++++++++++++++++++-------------------------------
1 file changed, 20 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de3dc98..1716b6a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -226,10 +226,9 @@ xfs_file_fsync(
}

STATIC ssize_t
-xfs_file_aio_read(
+xfs_file_read_iter(
struct kiocb *iocb,
- const struct iovec *iovp,
- unsigned long nr_segs,
+ struct iov_iter *iter,
loff_t pos)
{
struct file *file = iocb->ki_filp;
@@ -250,9 +249,7 @@ xfs_file_aio_read(
if (file->f_mode & FMODE_NOCMTIME)
ioflags |= IO_INVIS;

- ret = generic_segment_checks(iovp, &nr_segs, &size, VERIFY_WRITE);
- if (ret < 0)
- return ret;
+ size = iov_iter_count(iter);

if (unlikely(ioflags & IO_ISDIRECT)) {
xfs_buftarg_t *target =
@@ -305,7 +302,7 @@ xfs_file_aio_read(

trace_xfs_file_read(ip, size, pos, ioflags);

- ret = generic_file_aio_read(iocb, iovp, nr_segs, pos);
+ ret = generic_file_read_iter(iocb, iter, pos);
if (ret > 0)
XFS_STATS_ADD(xs_read_bytes, ret);

@@ -621,10 +618,9 @@ restart:
STATIC ssize_t
xfs_file_dio_aio_write(
struct kiocb *iocb,
- const struct iovec *iovp,
- unsigned long nr_segs,
+ struct iov_iter *iter,
loff_t pos,
- size_t ocount)
+ size_t count)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -632,7 +628,6 @@ xfs_file_dio_aio_write(
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
ssize_t ret = 0;
- size_t count = ocount;
int unaligned_io = 0;
int iolock;
struct xfs_buftarg *target = XFS_IS_REALTIME_INODE(ip) ?
@@ -692,8 +687,8 @@ xfs_file_dio_aio_write(
}

trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
- ret = generic_file_direct_write(iocb, iovp,
- &nr_segs, pos, &iocb->ki_pos, count, ocount);
+ ret = generic_file_direct_write_iter(iocb, iter,
+ pos, &iocb->ki_pos, count);

out:
xfs_rw_iunlock(ip, iolock);
@@ -706,10 +701,9 @@ out:
STATIC ssize_t
xfs_file_buffered_aio_write(
struct kiocb *iocb,
- const struct iovec *iovp,
- unsigned long nr_segs,
+ struct iov_iter *iter,
loff_t pos,
- size_t ocount)
+ size_t count)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -718,7 +712,6 @@ xfs_file_buffered_aio_write(
ssize_t ret;
int enospc = 0;
int iolock = XFS_IOLOCK_EXCL;
- size_t count = ocount;

xfs_rw_ilock(ip, iolock);

@@ -731,7 +724,7 @@ xfs_file_buffered_aio_write(

write_retry:
trace_xfs_file_buffered_write(ip, count, iocb->ki_pos, 0);
- ret = generic_file_buffered_write(iocb, iovp, nr_segs,
+ ret = generic_file_buffered_write_iter(iocb, iter,
pos, &iocb->ki_pos, count, 0);

/*
@@ -752,10 +745,9 @@ out:
}

STATIC ssize_t
-xfs_file_aio_write(
+xfs_file_write_iter(
struct kiocb *iocb,
- const struct iovec *iovp,
- unsigned long nr_segs,
+ struct iov_iter *iter,
loff_t pos)
{
struct file *file = iocb->ki_filp;
@@ -763,17 +755,15 @@ xfs_file_aio_write(
struct inode *inode = mapping->host;
struct xfs_inode *ip = XFS_I(inode);
ssize_t ret;
- size_t ocount = 0;
+ size_t count = 0;

XFS_STATS_INC(xs_write_calls);

BUG_ON(iocb->ki_pos != pos);

- ret = generic_segment_checks(iovp, &nr_segs, &ocount, VERIFY_READ);
- if (ret)
- return ret;
+ count = iov_iter_count(iter);

- if (ocount == 0)
+ if (count == 0)
return 0;

if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
@@ -782,10 +772,9 @@ xfs_file_aio_write(
}

if (unlikely(file->f_flags & O_DIRECT))
- ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
+ ret = xfs_file_dio_aio_write(iocb, iter, pos, count);
else
- ret = xfs_file_buffered_aio_write(iocb, iovp, nr_segs, pos,
- ocount);
+ ret = xfs_file_buffered_aio_write(iocb, iter, pos, count);

if (ret > 0) {
ssize_t err;
@@ -1410,8 +1399,8 @@ const struct file_operations xfs_file_operations = {
.llseek = xfs_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = xfs_file_aio_read,
- .aio_write = xfs_file_aio_write,
+ .read_iter = xfs_file_read_iter,
+ .write_iter = xfs_file_write_iter,
.splice_read = xfs_file_splice_read,
.splice_write = xfs_file_splice_write,
.unlocked_ioctl = xfs_file_ioctl,
--
1.8.3.4

2013-07-25 17:52:59

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 32/33] ubifs: convert file ops from aio_read/write to read/write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: [email protected]
---
fs/ubifs/file.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b..22924e0 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -44,7 +44,7 @@
* 'ubifs_writepage()' we are only guaranteed that the page is locked.
*
* Similarly, @i_mutex is not always locked in 'ubifs_readpage()', e.g., the
- * read-ahead path does not lock it ("sys_read -> generic_file_aio_read ->
+ * read-ahead path does not lock it ("sys_read -> generic_file_read_iter ->
* ondemand_readahead -> readpage"). In case of readahead, @I_SYNC flag is not
* set as well. However, UBIFS disables readahead.
*/
@@ -1396,8 +1396,8 @@ static int update_mctime(struct ubifs_info *c, struct inode *inode)
return 0;
}

-static ssize_t ubifs_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+static ssize_t ubifs_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
{
int err;
struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -1407,7 +1407,7 @@ static ssize_t ubifs_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (err)
return err;

- return generic_file_aio_write(iocb, iov, nr_segs, pos);
+ return generic_file_write_iter(iocb, iter, pos);
}

static int ubifs_set_page_dirty(struct page *page)
@@ -1583,8 +1583,8 @@ const struct file_operations ubifs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = ubifs_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = ubifs_write_iter,
.mmap = ubifs_file_mmap,
.fsync = ubifs_fsync,
.unlocked_ioctl = ubifs_ioctl,
--
1.8.3.4

2013-07-25 17:59:47

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 31/33] ecrpytfs: Convert aio_read/write ops to read/write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Tyler Hicks <[email protected]>
Cc: Dustin Kirkland <[email protected]>
Cc: [email protected]
---
fs/ecryptfs/file.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 992cf95..3ed6e5f 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -37,22 +37,21 @@
/**
* ecryptfs_read_update_atime
*
- * generic_file_read updates the atime of upper layer inode. But, it
+ * generic_file_read_iter updates the atime of upper layer inode. But, it
* doesn't give us a chance to update the atime of the lower layer
- * inode. This function is a wrapper to generic_file_read. It
- * updates the atime of the lower level inode if generic_file_read
+ * inode. This function is a wrapper to generic_file_read_iter. It
+ * updates the atime of the lower level inode if generic_file_read_iter
* returns without any errors. This is to be used only for file reads.
* The function to be used for directory reads is ecryptfs_read.
*/
static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ struct iov_iter *iter, loff_t pos)
{
ssize_t rc;
struct path *path;
struct file *file = iocb->ki_filp;

- rc = generic_file_aio_read(iocb, iov, nr_segs, pos);
+ rc = generic_file_read_iter(iocb, iter, pos);
/*
* Even though this is a async interface, we need to wait
* for IO to finish to update atime
@@ -357,9 +356,9 @@ const struct file_operations ecryptfs_dir_fops = {
const struct file_operations ecryptfs_main_fops = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = ecryptfs_read_update_atime,
+ .read_iter = ecryptfs_read_update_atime,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.iterate = ecryptfs_readdir,
.unlocked_ioctl = ecryptfs_unlocked_ioctl,
#ifdef CONFIG_COMPAT
--
1.8.3.4

2013-07-25 17:59:51

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 33/33] tmpfs: add support for read_iter and write_iter

From: Hugh Dickins <[email protected]>

Convert tmpfs do_shmem_file_read() to shmem_file_read_iter().
Make file_read_iter_actor() global so tmpfs can use it too: delete
file_read_actor(), which was made global in 2.4.4 for use by tmpfs.
Replace tmpfs generic_file_aio_write() by generic_file_write_iter().

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Dave Kleikamp <[email protected]>
---
include/linux/fs.h | 5 +++--
mm/filemap.c | 42 ++-----------------------------------
mm/shmem.c | 61 +++++++++++++++++-------------------------------------
3 files changed, 24 insertions(+), 84 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 269ef07..48209c8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2499,8 +2499,9 @@ extern int sb_min_blocksize(struct super_block *, int);
extern int generic_file_mmap(struct file *, struct vm_area_struct *);
extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr,
- unsigned long size, pgoff_t pgoff);
-extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
+ unsigned long size, pgoff_t pgoff);
+extern int file_read_iter_actor(read_descriptor_t *desc, struct page *page,
+ unsigned long offset, unsigned long size);
int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *,
diff --git a/mm/filemap.c b/mm/filemap.c
index 41b9672..45cfcfc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1313,44 +1313,6 @@ out:
file_accessed(filp);
}

-int file_read_actor(read_descriptor_t *desc, struct page *page,
- unsigned long offset, unsigned long size)
-{
- char *kaddr;
- unsigned long left, count = desc->count;
-
- if (size > count)
- size = count;
-
- /*
- * Faults on the destination of a read are common, so do it before
- * taking the kmap.
- */
- if (!fault_in_pages_writeable(desc->arg.buf, size)) {
- kaddr = kmap_atomic(page);
- left = __copy_to_user_inatomic(desc->arg.buf,
- kaddr + offset, size);
- kunmap_atomic(kaddr);
- if (left == 0)
- goto success;
- }
-
- /* Do it the slow way */
- kaddr = kmap(page);
- left = __copy_to_user(desc->arg.buf, kaddr + offset, size);
- kunmap(page);
-
- if (left) {
- size -= left;
- desc->error = -EFAULT;
- }
-success:
- desc->count = count - size;
- desc->written += size;
- desc->arg.buf += size;
- return size;
-}
-
/*
* Performs necessary checks before doing a write
* @iov: io vector request
@@ -1390,8 +1352,8 @@ int generic_segment_checks(const struct iovec *iov,
}
EXPORT_SYMBOL(generic_segment_checks);

-static int file_read_iter_actor(read_descriptor_t *desc, struct page *page,
- unsigned long offset, unsigned long size)
+int file_read_iter_actor(read_descriptor_t *desc, struct page *page,
+ unsigned long offset, unsigned long size)
{
struct iov_iter *iter = desc->arg.data;
unsigned long copied = 0;
diff --git a/mm/shmem.c b/mm/shmem.c
index a87990c..67cd5e5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1464,14 +1464,23 @@ shmem_write_end(struct file *file, struct address_space *mapping,
return copied;
}

-static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor)
+static ssize_t shmem_file_read_iter(struct kiocb *iocb,
+ struct iov_iter *iter, loff_t pos)
{
+ read_descriptor_t desc;
+ loff_t *ppos = &iocb->ki_pos;
+ struct file *filp = iocb->ki_filp;
struct inode *inode = file_inode(filp);
struct address_space *mapping = inode->i_mapping;
pgoff_t index;
unsigned long offset;
enum sgp_type sgp = SGP_READ;

+ desc.written = 0;
+ desc.count = iov_iter_count(iter);
+ desc.arg.data = iter;
+ desc.error = 0;
+
/*
* Might this read be for a stacking filesystem? Then when reading
* holes of a sparse file, we actually need to allocate those pages,
@@ -1498,10 +1507,10 @@ static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_
break;
}

- desc->error = shmem_getpage(inode, index, &page, sgp, NULL);
- if (desc->error) {
- if (desc->error == -EINVAL)
- desc->error = 0;
+ desc.error = shmem_getpage(inode, index, &page, sgp, NULL);
+ if (desc.error) {
+ if (desc.error == -EINVAL)
+ desc.error = 0;
break;
}
if (page)
@@ -1552,13 +1561,13 @@ static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = file_read_iter_actor(&desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;

page_cache_release(page);
- if (ret != nr || !desc->count)
+ if (ret != nr || !desc.count)
break;

cond_resched();
@@ -1566,40 +1575,8 @@ static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_

*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
file_accessed(filp);
-}
-
-static ssize_t shmem_file_aio_read(struct kiocb *iocb,
- const struct iovec *iov, unsigned long nr_segs, loff_t pos)
-{
- struct file *filp = iocb->ki_filp;
- ssize_t retval;
- unsigned long seg;
- size_t count;
- loff_t *ppos = &iocb->ki_pos;

- retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
- if (retval)
- return retval;
-
- for (seg = 0; seg < nr_segs; seg++) {
- read_descriptor_t desc;
-
- desc.written = 0;
- desc.arg.buf = iov[seg].iov_base;
- desc.count = iov[seg].iov_len;
- if (desc.count == 0)
- continue;
- desc.error = 0;
- do_shmem_file_read(filp, ppos, &desc, file_read_actor);
- retval += desc.written;
- if (desc.error) {
- retval = retval ?: desc.error;
- break;
- }
- if (desc.count > 0)
- break;
- }
- return retval;
+ return desc.written ? desc.written : desc.error;
}

static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
@@ -2721,8 +2698,8 @@ static const struct file_operations shmem_file_operations = {
.llseek = shmem_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = shmem_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = shmem_file_read_iter,
+ .write_iter = generic_file_write_iter,
.fsync = noop_fsync,
.splice_read = shmem_file_splice_read,
.splice_write = generic_file_splice_write,
--
1.8.3.4

2013-07-25 17:59:59

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 19/33] fs: use read_iter and write_iter rather than aio_read and aio_write

File systems implementing read_iter & write_iter should not be required
to keep aio_read and aio_write as well. The vfs should always call
read/write_iter if they exist. This will make it easier to remove the
aio_read/write operation in the future.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: [email protected]
---
fs/aio.c | 4 ++--
fs/bad_inode.c | 14 ++++++++++++
fs/internal.h | 4 ++++
fs/read_write.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++------
include/linux/fs.h | 6 +++--
5 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 0da82c0..9fa03a1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1051,14 +1051,14 @@ static ssize_t aio_run_iocb(struct kiocb *req, bool compat)
case IOCB_CMD_PREADV:
mode = FMODE_READ;
rw = READ;
- rw_op = file->f_op->aio_read;
+ rw_op = do_aio_read;
goto rw_common;

case IOCB_CMD_PWRITE:
case IOCB_CMD_PWRITEV:
mode = FMODE_WRITE;
rw = WRITE;
- rw_op = file->f_op->aio_write;
+ rw_op = do_aio_write;
goto rw_common;
rw_common:
if (unlikely(!(file->f_mode & mode)))
diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 7c93953..38651e5 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -39,12 +39,24 @@ static ssize_t bad_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
return -EIO;
}

+static ssize_t bad_file_read_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ return -EIO;
+}
+
static ssize_t bad_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
return -EIO;
}

+static ssize_t bad_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ return -EIO;
+}
+
static int bad_file_readdir(struct file *file, struct dir_context *ctx)
{
return -EIO;
@@ -151,7 +163,9 @@ static const struct file_operations bad_file_ops =
.read = bad_file_read,
.write = bad_file_write,
.aio_read = bad_file_aio_read,
+ .read_iter = bad_file_read_iter,
.aio_write = bad_file_aio_write,
+ .write_iter = bad_file_write_iter,
.iterate = bad_file_readdir,
.poll = bad_file_poll,
.unlocked_ioctl = bad_file_unlocked_ioctl,
diff --git a/fs/internal.h b/fs/internal.h
index 7c5f01c..143a903 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -132,6 +132,10 @@ extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
*/
extern ssize_t __kernel_write(struct file *, const char *, size_t, loff_t *);
extern int rw_verify_area(int, struct file *, const loff_t *, size_t);
+extern ssize_t do_aio_read(struct kiocb *kiocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos);
+extern ssize_t do_aio_write(struct kiocb *kiocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos);

/*
* splice.c
diff --git a/fs/read_write.c b/fs/read_write.c
index 022929c..c3579a9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -29,7 +29,7 @@ typedef ssize_t (*iov_fn_t)(struct kiocb *, const struct iovec *,
const struct file_operations generic_ro_fops = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.mmap = generic_file_readonly_mmap,
.splice_read = generic_file_splice_read,
};
@@ -359,6 +359,29 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
}

+ssize_t do_aio_read(struct kiocb *kiocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct file *file = kiocb->ki_filp;
+
+ if (file->f_op->read_iter) {
+ size_t count;
+ struct iov_iter iter;
+ int ret;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count,
+ VERIFY_WRITE);
+ if (ret)
+ return ret;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+ return file->f_op->read_iter(kiocb, &iter, pos);
+ }
+
+ return file->f_op->aio_read(kiocb, iov, nr_segs, pos);
+}
+
ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
struct iovec iov = { .iov_base = buf, .iov_len = len };
@@ -370,7 +393,7 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp
kiocb.ki_left = len;
kiocb.ki_nbytes = len;

- ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
+ ret = do_aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
@@ -409,6 +432,29 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)

EXPORT_SYMBOL(vfs_read);

+ssize_t do_aio_write(struct kiocb *kiocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct file *file = kiocb->ki_filp;
+
+ if (file->f_op->write_iter) {
+ size_t count;
+ struct iov_iter iter;
+ int ret;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count,
+ VERIFY_READ);
+ if (ret)
+ return ret;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+ return file->f_op->write_iter(kiocb, &iter, pos);
+ }
+
+ return file->f_op->aio_write(kiocb, iov, nr_segs, pos);
+}
+
ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
{
struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len };
@@ -420,7 +466,7 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof
kiocb.ki_left = len;
kiocb.ki_nbytes = len;

- ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
+ ret = do_aio_write(&kiocb, &iov, 1, kiocb.ki_pos);
if (-EIOCBQUEUED == ret)
ret = wait_on_sync_kiocb(&kiocb);
*ppos = kiocb.ki_pos;
@@ -748,10 +794,12 @@ static ssize_t do_readv_writev(int type, struct file *file,
fnv = NULL;
if (type == READ) {
fn = file->f_op->read;
- fnv = file->f_op->aio_read;
+ if (file->f_op->aio_read || file->f_op->read_iter)
+ fnv = do_aio_read;
} else {
fn = (io_fn_t)file->f_op->write;
- fnv = file->f_op->aio_write;
+ if (file->f_op->aio_write || file->f_op->write_iter)
+ fnv = do_aio_write;
file_start_write(file);
}

@@ -930,10 +978,12 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
fnv = NULL;
if (type == READ) {
fn = file->f_op->read;
- fnv = file->f_op->aio_read;
+ if (file->f_op->aio_read || file->f_op->read_iter)
+ fnv = do_aio_read;
} else {
fn = (io_fn_t)file->f_op->write;
- fnv = file->f_op->aio_write;
+ if (file->f_op->aio_write || file->f_op->write_iter)
+ fnv = do_aio_write;
file_start_write(file);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fc1e2a8..26d9d8d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1654,12 +1654,14 @@ struct file_operations {

static inline int file_readable(struct file *filp)
{
- return filp && (filp->f_op->read || filp->f_op->aio_read);
+ return filp && (filp->f_op->read || filp->f_op->aio_read ||
+ filp->f_op->read_iter);
}

static inline int file_writable(struct file *filp)
{
- return filp && (filp->f_op->write || filp->f_op->aio_write);
+ return filp && (filp->f_op->write || filp->f_op->aio_write ||
+ filp->f_op->write_iter);
}

struct inode_operations {
--
1.8.3.4

2013-07-25 17:59:56

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 08/33] iov_iter: add bvec support

From: Zach Brown <[email protected]>

This adds a set of iov_iter_ops calls which work with memory which is
specified by an array of bio_vec structs instead of an array of iovec
structs.

The big difference is that the pages referenced by the bio_vec elements
are pinned. They don't need to be faulted in and we can always use
kmap_atomic() to map them one at a time.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/iov-iter.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 19 ++++++++
2 files changed, 148 insertions(+)

diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 59f9556..5624e36 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -5,6 +5,7 @@
#include <linux/hardirq.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/bio.h>

static size_t __iovec_copy_to_user(char *vaddr, const struct iovec *iov,
size_t base, size_t bytes, int atomic)
@@ -109,6 +110,134 @@ success:
return copied;
}

+#ifdef CONFIG_BLOCK
+/*
+ * As an easily verifiable first pass, we implement all the methods that
+ * copy data to and from bvec pages with one function. We implement it
+ * all with kmap_atomic().
+ */
+static size_t bvec_copy_tofrom_page(struct iov_iter *iter, struct page *page,
+ unsigned long page_offset, size_t bytes,
+ int topage)
+{
+ struct bio_vec *bvec = (struct bio_vec *)iter->data;
+ size_t bvec_offset = iter->iov_offset;
+ size_t remaining = bytes;
+ void *bvec_map;
+ void *page_map;
+ size_t copy;
+
+ page_map = kmap_atomic(page);
+
+ BUG_ON(bytes > iter->count);
+ while (remaining) {
+ BUG_ON(bvec->bv_len == 0);
+ BUG_ON(bvec_offset >= bvec->bv_len);
+ copy = min(remaining, bvec->bv_len - bvec_offset);
+ bvec_map = kmap_atomic(bvec->bv_page);
+ if (topage)
+ memcpy(page_map + page_offset,
+ bvec_map + bvec->bv_offset + bvec_offset,
+ copy);
+ else
+ memcpy(bvec_map + bvec->bv_offset + bvec_offset,
+ page_map + page_offset,
+ copy);
+ kunmap_atomic(bvec_map);
+ remaining -= copy;
+ bvec_offset += copy;
+ page_offset += copy;
+ if (bvec_offset == bvec->bv_len) {
+ bvec_offset = 0;
+ bvec++;
+ }
+ }
+
+ kunmap_atomic(page_map);
+
+ return bytes;
+}
+
+static size_t ii_bvec_copy_to_user_atomic(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 0);
+}
+static size_t ii_bvec_copy_to_user(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes,
+ int check_access)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 0);
+}
+static size_t ii_bvec_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 1);
+}
+static size_t ii_bvec_copy_from_user(struct page *page, struct iov_iter *i,
+ unsigned long offset, size_t bytes)
+{
+ return bvec_copy_tofrom_page(i, page, offset, bytes, 1);
+}
+
+/*
+ * bio_vecs have a stricter structure than iovecs that might have
+ * come from userspace. There are no zero length bio_vec elements.
+ */
+static void ii_bvec_advance(struct iov_iter *i, size_t bytes)
+{
+ struct bio_vec *bvec = (struct bio_vec *)i->data;
+ size_t offset = i->iov_offset;
+ size_t delta;
+
+ BUG_ON(i->count < bytes);
+ while (bytes) {
+ BUG_ON(bvec->bv_len == 0);
+ BUG_ON(bvec->bv_len <= offset);
+ delta = min(bytes, bvec->bv_len - offset);
+ offset += delta;
+ i->count -= delta;
+ bytes -= delta;
+ if (offset == bvec->bv_len) {
+ bvec++;
+ offset = 0;
+ }
+ }
+
+ i->data = (unsigned long)bvec;
+ i->iov_offset = offset;
+}
+
+/*
+ * pages pointed to by bio_vecs are always pinned.
+ */
+static int ii_bvec_fault_in_readable(struct iov_iter *i, size_t bytes)
+{
+ return 0;
+}
+
+static size_t ii_bvec_single_seg_count(const struct iov_iter *i)
+{
+ const struct bio_vec *bvec = (struct bio_vec *)i->data;
+ if (i->nr_segs == 1)
+ return i->count;
+ else
+ return min(i->count, bvec->bv_len - i->iov_offset);
+}
+
+struct iov_iter_ops ii_bvec_ops = {
+ .ii_copy_to_user_atomic = ii_bvec_copy_to_user_atomic,
+ .ii_copy_to_user = ii_bvec_copy_to_user,
+ .ii_copy_from_user_atomic = ii_bvec_copy_from_user_atomic,
+ .ii_copy_from_user = ii_bvec_copy_from_user,
+ .ii_advance = ii_bvec_advance,
+ .ii_fault_in_readable = ii_bvec_fault_in_readable,
+ .ii_single_seg_count = ii_bvec_single_seg_count,
+};
+EXPORT_SYMBOL(ii_bvec_ops);
+#endif /* CONFIG_BLOCK */
+
static size_t __iovec_copy_from_user(char *vaddr, const struct iovec *iov,
size_t base, size_t bytes, int atomic)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 96120d5..c2cd17c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -349,6 +349,25 @@ static inline size_t iov_iter_single_seg_count(const struct iov_iter *i)
return i->ops->ii_single_seg_count(i);
}

+#ifdef CONFIG_BLOCK
+extern struct iov_iter_ops ii_bvec_ops;
+
+struct bio_vec;
+static inline void iov_iter_init_bvec(struct iov_iter *i,
+ struct bio_vec *bvec,
+ unsigned long nr_segs,
+ size_t count, size_t written)
+{
+ i->ops = &ii_bvec_ops;
+ i->data = (unsigned long)bvec;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count + written;
+
+ iov_iter_advance(i, written);
+}
+#endif
+
extern struct iov_iter_ops ii_iovec_ops;

static inline void iov_iter_init(struct iov_iter *i,
--
1.8.3.4

2013-07-25 17:59:53

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 22/33] ext4: add support for read_iter and write_iter

use the generic_file_read_iter(), create ext4_file_write_iter() based on
ext4_file_write(), and make ext4_file_write() a wrapper around
ext4_file_write_iter().

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: [email protected]
---
fs/ext4/file.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f4cc56..c25d48a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,12 +74,11 @@ void ext4_unwritten_wait(struct inode *inode)
* or one thread will zero the other's data, causing corruption.
*/
static int
-ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ext4_unaligned_aio(struct inode *inode, struct iov_iter *iter, loff_t pos)
{
struct super_block *sb = inode->i_sb;
int blockmask = sb->s_blocksize - 1;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(iter);
loff_t final_size = pos + count;

if (pos >= inode->i_size)
@@ -92,8 +91,8 @@ ext4_unaligned_aio(struct inode *inode, const struct iovec *iov,
}

static ssize_t
-ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ext4_file_dio_write(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -101,11 +100,11 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
int unaligned_aio = 0;
ssize_t ret;
int overwrite = 0;
- size_t length = iov_length(iov, nr_segs);
+ size_t length = iov_iter_count(iter);

if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
!is_sync_kiocb(iocb))
- unaligned_aio = ext4_unaligned_aio(inode, iov, nr_segs, pos);
+ unaligned_aio = ext4_unaligned_aio(inode, iter, pos);

/* Unaligned direct AIO must be serialized; see comment above */
if (unaligned_aio) {
@@ -146,7 +145,7 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
overwrite = 1;
}

- ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
+ ret = __generic_file_write_iter(iocb, iter, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);

if (ret > 0 || ret == -EIOCBQUEUED) {
@@ -165,8 +164,7 @@ ext4_file_dio_write(struct kiocb *iocb, const struct iovec *iov,
}

static ssize_t
-ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
@@ -178,22 +176,24 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,

if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- size_t length = iov_length(iov, nr_segs);
+ size_t length = iov_iter_count(iter);

if ((pos > sbi->s_bitmap_maxbytes ||
(pos == sbi->s_bitmap_maxbytes && length > 0)))
return -EFBIG;

if (pos + length > sbi->s_bitmap_maxbytes) {
- nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
- sbi->s_bitmap_maxbytes - pos);
+ ret = iov_iter_shorten(iter,
+ sbi->s_bitmap_maxbytes - pos);
+ if (ret)
+ return ret;
}
}

if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
- ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
+ ret = ext4_file_dio_write(iocb, iter, pos);
else
- ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
+ ret = generic_file_write_iter(iocb, iter, pos);

return ret;
}
@@ -607,8 +607,8 @@ const struct file_operations ext4_file_operations = {
.llseek = ext4_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = ext4_file_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = ext4_file_write_iter,
.unlocked_ioctl = ext4_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext4_compat_ioctl,
--
1.8.3.4

2013-07-25 18:00:59

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 29/33] udf: convert file ops from aio_read/write to read/write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/udf/file.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/udf/file.c b/fs/udf/file.c
index 339df8b..e392d60 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -133,8 +133,8 @@ const struct address_space_operations udf_adinicb_aops = {
.direct_IO = udf_adinicb_direct_IO,
};

-static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t ppos)
+static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t ppos)
{
ssize_t retval;
struct file *file = iocb->ki_filp;
@@ -168,7 +168,7 @@ static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
} else
up_write(&iinfo->i_data_sem);

- retval = generic_file_aio_write(iocb, iov, nr_segs, ppos);
+ retval = generic_file_write_iter(iocb, iter, ppos);
if (retval > 0)
mark_inode_dirty(inode);

@@ -242,12 +242,12 @@ static int udf_release_file(struct inode *inode, struct file *filp)

const struct file_operations udf_file_operations = {
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.unlocked_ioctl = udf_ioctl,
.open = generic_file_open,
.mmap = generic_file_mmap,
.write = do_sync_write,
- .aio_write = udf_file_aio_write,
+ .write_iter = udf_file_write_iter,
.release = udf_release_file,
.fsync = generic_file_fsync,
.splice_read = generic_file_splice_read,
--
1.8.3.4

2013-07-25 18:01:16

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 30/33] afs: add support for read_iter and write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: David Howells <[email protected]>
Cc: [email protected]
---
fs/afs/file.c | 4 ++--
fs/afs/internal.h | 3 +--
fs/afs/write.c | 9 ++++-----
3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 66d50fe..3b71622 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -33,8 +33,8 @@ const struct file_operations afs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = afs_file_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = afs_file_write,
.mmap = generic_file_readonly_mmap,
.splice_read = generic_file_splice_read,
.fsync = afs_fsync,
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index a306bb6..9c048ff 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -747,8 +747,7 @@ extern int afs_write_end(struct file *file, struct address_space *mapping,
extern int afs_writepage(struct page *, struct writeback_control *);
extern int afs_writepages(struct address_space *, struct writeback_control *);
extern void afs_pages_written_back(struct afs_vnode *, struct afs_call *);
-extern ssize_t afs_file_write(struct kiocb *, const struct iovec *,
- unsigned long, loff_t);
+extern ssize_t afs_file_write(struct kiocb *, struct iov_iter *, loff_t);
extern int afs_writeback_all(struct afs_vnode *);
extern int afs_fsync(struct file *, loff_t, loff_t, int);

diff --git a/fs/afs/write.c b/fs/afs/write.c
index a890db4..9fa2f59 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -625,15 +625,14 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call)
/*
* write to an AFS file
*/
-ssize_t afs_file_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct afs_vnode *vnode = AFS_FS_I(file_inode(iocb->ki_filp));
ssize_t result;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(iter);

_enter("{%x.%u},{%zu},%lu,",
- vnode->fid.vid, vnode->fid.vnode, count, nr_segs);
+ vnode->fid.vid, vnode->fid.vnode, count, iter->nr_segs);

if (IS_SWAPFILE(&vnode->vfs_inode)) {
printk(KERN_INFO
@@ -644,7 +643,7 @@ ssize_t afs_file_write(struct kiocb *iocb, const struct iovec *iov,
if (!count)
return 0;

- result = generic_file_aio_write(iocb, iov, nr_segs, pos);
+ result = generic_file_write_iter(iocb, iter, pos);
if (IS_ERR_VALUE(result)) {
_leave(" = %zd", result);
return result;
--
1.8.3.4

2013-07-25 17:59:43

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 28/33] gfs2: Convert aio_read/write ops to read/write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Steven Whitehouse <[email protected]>
Cc: [email protected]
---
fs/gfs2/file.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 72c3866..23cbdd4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -679,10 +679,9 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
}

/**
- * gfs2_file_aio_write - Perform a write to a file
+ * gfs2_file_write_iter - Perform a write to a file
* @iocb: The io context
- * @iov: The data to write
- * @nr_segs: Number of @iov segments
+ * @iter: The data to write
* @pos: The file position
*
* We have to do a lock/unlock here to refresh the inode size for
@@ -692,11 +691,11 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
*
*/

-static ssize_t gfs2_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
{
struct file *file = iocb->ki_filp;
- size_t writesize = iov_length(iov, nr_segs);
+ size_t writesize = iov_iter_count(iter);
struct gfs2_inode *ip = GFS2_I(file_inode(file));
int ret;

@@ -715,7 +714,7 @@ static ssize_t gfs2_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
gfs2_glock_dq_uninit(&gh);
}

- return generic_file_aio_write(iocb, iov, nr_segs, pos);
+ return generic_file_write_iter(iocb, iter, pos);
}

static int fallocate_chunk(struct inode *inode, loff_t offset, loff_t len,
@@ -1047,9 +1046,9 @@ static int gfs2_flock(struct file *file, int cmd, struct file_lock *fl)
const struct file_operations gfs2_file_fops = {
.llseek = gfs2_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = gfs2_file_aio_write,
+ .write_iter = gfs2_file_write_iter,
.unlocked_ioctl = gfs2_ioctl,
.mmap = gfs2_mmap,
.open = gfs2_open,
@@ -1079,9 +1078,9 @@ const struct file_operations gfs2_dir_fops = {
const struct file_operations gfs2_file_fops_nolock = {
.llseek = gfs2_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = gfs2_file_aio_write,
+ .write_iter = gfs2_file_write_iter,
.unlocked_ioctl = gfs2_ioctl,
.mmap = gfs2_mmap,
.open = gfs2_open,
--
1.8.3.4

2013-07-25 18:04:31

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 05/33] fuse: convert fuse to use iov_iter_copy_[to|from]_user

A future patch hides the internals of struct iov_iter, so fuse should
be using the supported interface.

Signed-off-by: Dave Kleikamp <[email protected]>
Acked-by: Miklos Szeredi <[email protected]>
Cc: [email protected]
---
fs/fuse/file.c | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5c121fe..633766c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1861,30 +1861,17 @@ static int fuse_ioctl_copy_user(struct page **pages, struct iovec *iov,
while (iov_iter_count(&ii)) {
struct page *page = pages[page_idx++];
size_t todo = min_t(size_t, PAGE_SIZE, iov_iter_count(&ii));
- void *kaddr;
+ size_t left;

- kaddr = kmap(page);
-
- while (todo) {
- char __user *uaddr = ii.iov->iov_base + ii.iov_offset;
- size_t iov_len = ii.iov->iov_len - ii.iov_offset;
- size_t copy = min(todo, iov_len);
- size_t left;
-
- if (!to_user)
- left = copy_from_user(kaddr, uaddr, copy);
- else
- left = copy_to_user(uaddr, kaddr, copy);
-
- if (unlikely(left))
- return -EFAULT;
+ if (!to_user)
+ left = iov_iter_copy_from_user(page, &ii, 0, todo);
+ else
+ left = iov_iter_copy_to_user(page, &ii, 0, todo);

- iov_iter_advance(&ii, copy);
- todo -= copy;
- kaddr += copy;
- }
+ if (unlikely(left))
+ return -EFAULT;

- kunmap(page);
+ iov_iter_advance(&ii, todo);
}

return 0;
--
1.8.3.4

2013-07-25 18:04:44

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 17/33] loop: use aio to perform io on the underlying file

From: Zach Brown <[email protected]>

This uses the new kernel aio interface to process loopback IO by
submitting concurrent direct aio. Previously loop's IO was serialized
by synchronous processing in a thread.

The aio operations specify the memory for the IO with the bio_vec arrays
directly instead of mappings of the pages.

The use of aio operations is enabled when the backing file supports the
read_iter, write_iter and direct_IO methods.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
drivers/block/loop.c | 148 +++++++++++++++++++++++++++++++++-------------
include/uapi/linux/loop.h | 1 +
2 files changed, 109 insertions(+), 40 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 40e7155..66e3ccf 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -75,6 +75,7 @@
#include <linux/sysfs.h>
#include <linux/miscdevice.h>
#include <linux/falloc.h>
+#include <linux/aio.h>
#include "loop.h"

#include <asm/uaccess.h>
@@ -218,6 +219,48 @@ lo_do_transfer(struct loop_device *lo, int cmd,
return lo->transfer(lo, cmd, rpage, roffs, lpage, loffs, size, rblock);
}

+#ifdef CONFIG_AIO
+static void lo_rw_aio_complete(u64 data, long res)
+{
+ struct bio *bio = (struct bio *)(uintptr_t)data;
+
+ if (res > 0)
+ res = 0;
+ else if (res < 0)
+ res = -EIO;
+
+ bio_endio(bio, res);
+}
+
+static int lo_rw_aio(struct loop_device *lo, struct bio *bio)
+{
+ struct file *file = lo->lo_backing_file;
+ struct kiocb *iocb;
+ unsigned short op;
+ struct iov_iter iter;
+ struct bio_vec *bvec;
+ size_t nr_segs;
+ loff_t pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;
+
+ iocb = aio_kernel_alloc(GFP_NOIO);
+ if (!iocb)
+ return -ENOMEM;
+
+ if (bio_rw(bio) & WRITE)
+ op = IOCB_CMD_WRITE_ITER;
+ else
+ op = IOCB_CMD_READ_ITER;
+
+ bvec = bio_iovec_idx(bio, bio->bi_idx);
+ nr_segs = bio_segments(bio);
+ iov_iter_init_bvec(&iter, bvec, nr_segs, bvec_length(bvec, nr_segs), 0);
+ aio_kernel_init_iter(iocb, file, op, &iter, pos);
+ aio_kernel_init_callback(iocb, lo_rw_aio_complete, (u64)(uintptr_t)bio);
+
+ return aio_kernel_submit(iocb);
+}
+#endif /* CONFIG_AIO */
+
/**
* __do_lo_send_write - helper for writing data to a loop device
*
@@ -418,50 +461,33 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio)
pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

if (bio_rw(bio) == WRITE) {
- struct file *file = lo->lo_backing_file;
-
- if (bio->bi_rw & REQ_FLUSH) {
- ret = vfs_fsync(file, 0);
- if (unlikely(ret && ret != -EINVAL)) {
- ret = -EIO;
- goto out;
- }
- }
+ ret = lo_send(lo, bio, pos);
+ } else
+ ret = lo_receive(lo, bio, lo->lo_blocksize, pos);

- /*
- * We use punch hole to reclaim the free space used by the
- * image a.k.a. discard. However we do not support discard if
- * encryption is enabled, because it may give an attacker
- * useful information.
- */
- if (bio->bi_rw & REQ_DISCARD) {
- struct file *file = lo->lo_backing_file;
- int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+ return ret;
+}

- if ((!file->f_op->fallocate) ||
- lo->lo_encrypt_key_size) {
- ret = -EOPNOTSUPP;
- goto out;
- }
- ret = file->f_op->fallocate(file, mode, pos,
- bio->bi_size);
- if (unlikely(ret && ret != -EINVAL &&
- ret != -EOPNOTSUPP))
- ret = -EIO;
- goto out;
- }
+static int lo_discard(struct loop_device *lo, struct bio *bio)
+{
+ struct file *file = lo->lo_backing_file;
+ int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+ loff_t pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;
+ int ret;

- ret = lo_send(lo, bio, pos);
+ /*
+ * We use punch hole to reclaim the free space used by the
+ * image a.k.a. discard. However we do not support discard if
+ * encryption is enabled, because it may give an attacker
+ * useful information.
+ */

- if ((bio->bi_rw & REQ_FUA) && !ret) {
- ret = vfs_fsync(file, 0);
- if (unlikely(ret && ret != -EINVAL))
- ret = -EIO;
- }
- } else
- ret = lo_receive(lo, bio, lo->lo_blocksize, pos);
+ if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
+ return -EOPNOTSUPP;

-out:
+ ret = file->f_op->fallocate(file, mode, pos, bio->bi_size);
+ if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
+ ret = -EIO;
return ret;
}

@@ -525,7 +551,35 @@ static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
do_loop_switch(lo, bio->bi_private);
bio_put(bio);
} else {
- int ret = do_bio_filebacked(lo, bio);
+ int ret;
+
+ if (bio_rw(bio) == WRITE) {
+ if (bio->bi_rw & REQ_FLUSH) {
+ ret = vfs_fsync(lo->lo_backing_file, 1);
+ if (unlikely(ret && ret != -EINVAL))
+ goto out;
+ }
+ if (bio->bi_rw & REQ_DISCARD) {
+ ret = lo_discard(lo, bio);
+ goto out;
+ }
+ }
+#ifdef CONFIG_AIO
+ if (lo->lo_flags & LO_FLAGS_USE_AIO &&
+ lo->transfer == transfer_none) {
+ ret = lo_rw_aio(lo, bio);
+ if (ret == 0)
+ return;
+ } else
+#endif
+ ret = do_bio_filebacked(lo, bio);
+
+ if ((bio_rw(bio) == WRITE) && bio->bi_rw & REQ_FUA && !ret) {
+ ret = vfs_fsync(lo->lo_backing_file, 0);
+ if (unlikely(ret && ret != -EINVAL))
+ ret = -EIO;
+ }
+out:
bio_endio(bio, ret);
}
}
@@ -547,6 +601,12 @@ static int loop_thread(void *data)
struct loop_device *lo = data;
struct bio *bio;

+ /*
+ * In cases where the underlying filesystem calls balance_dirty_pages()
+ * we want less throttling to avoid lock ups trying to write dirty
+ * pages through the loop device
+ */
+ current->flags |= PF_LESS_THROTTLE;
set_user_nice(current, -20);

while (!kthread_should_stop() || !bio_list_empty(&lo->lo_bio_list)) {
@@ -869,6 +929,14 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
!file->f_op->write)
lo_flags |= LO_FLAGS_READ_ONLY;

+#ifdef CONFIG_AIO
+ if (file->f_op->write_iter && file->f_op->read_iter &&
+ mapping->a_ops->direct_IO) {
+ file->f_flags |= O_DIRECT;
+ lo_flags |= LO_FLAGS_USE_AIO;
+ }
+#endif
+
lo_blocksize = S_ISBLK(inode->i_mode) ?
inode->i_bdev->bd_block_size : PAGE_SIZE;

diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
index e0cecd2..6edc6b6 100644
--- a/include/uapi/linux/loop.h
+++ b/include/uapi/linux/loop.h
@@ -21,6 +21,7 @@ enum {
LO_FLAGS_READ_ONLY = 1,
LO_FLAGS_AUTOCLEAR = 4,
LO_FLAGS_PARTSCAN = 8,
+ LO_FLAGS_USE_AIO = 16,
};

#include <asm/posix_types.h> /* for __kernel_old_dev_t */
--
1.8.3.4

2013-07-25 18:04:42

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 21/33] ocfs2: add support for read_iter and write_iter

Signed-off-by: Dave Kleikamp <[email protected]>
Acked-by: Joel Becker <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: [email protected]
---
fs/ocfs2/aops.h | 2 +-
fs/ocfs2/file.c | 55 ++++++++++++++++++++++----------------------------
fs/ocfs2/ocfs2_trace.h | 6 +++---
3 files changed, 28 insertions(+), 35 deletions(-)

diff --git a/fs/ocfs2/aops.h b/fs/ocfs2/aops.h
index f671e49..573f41d 100644
--- a/fs/ocfs2/aops.h
+++ b/fs/ocfs2/aops.h
@@ -74,7 +74,7 @@ static inline void ocfs2_iocb_set_rw_locked(struct kiocb *iocb, int level)
/*
* Using a named enum representing lock types in terms of #N bit stored in
* iocb->private, which is going to be used for communication between
- * ocfs2_dio_end_io() and ocfs2_file_aio_write/read().
+ * ocfs2_dio_end_io() and ocfs2_file_write/read_iter().
*/
enum ocfs2_iocb_lock_bits {
OCFS2_IOCB_RW_LOCK = 0,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 41000f2..d2d203b 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2220,15 +2220,13 @@ out:
return ret;
}

-static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs,
- loff_t pos)
+static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
+ struct iov_iter *iter,
+ loff_t pos)
{
int ret, direct_io, appending, rw_level, have_alloc_sem = 0;
int can_do_direct, has_refcount = 0;
ssize_t written = 0;
- size_t ocount; /* original count */
size_t count; /* after file limit checks */
loff_t old_size, *ppos = &iocb->ki_pos;
u32 old_clusters;
@@ -2239,11 +2237,11 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
OCFS2_MOUNT_COHERENCY_BUFFERED);
int unaligned_dio = 0;

- trace_ocfs2_file_aio_write(inode, file, file->f_path.dentry,
+ trace_ocfs2_file_write_iter(inode, file, file->f_path.dentry,
(unsigned long long)OCFS2_I(inode)->ip_blkno,
file->f_path.dentry->d_name.len,
file->f_path.dentry->d_name.name,
- (unsigned int)nr_segs);
+ (unsigned long long)pos);

if (iocb->ki_left == 0)
return 0;
@@ -2343,28 +2341,24 @@ relock:
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_rw_locked(iocb, rw_level);

- ret = generic_segment_checks(iov, &nr_segs, &ocount,
- VERIFY_READ);
- if (ret)
- goto out_dio;

- count = ocount;
+ count = iov_iter_count(iter);
ret = generic_write_checks(file, ppos, &count,
S_ISBLK(inode->i_mode));
if (ret)
goto out_dio;

if (direct_io) {
- written = generic_file_direct_write(iocb, iov, &nr_segs, *ppos,
- ppos, count, ocount);
+ written = generic_file_direct_write_iter(iocb, iter, *ppos,
+ ppos, count);
if (written < 0) {
ret = written;
goto out_dio;
}
} else {
current->backing_dev_info = file->f_mapping->backing_dev_info;
- written = generic_file_buffered_write(iocb, iov, nr_segs, *ppos,
- ppos, count, 0);
+ written = generic_file_buffered_write_iter(iocb, iter, *ppos,
+ ppos, count, 0);
current->backing_dev_info = NULL;
}

@@ -2520,7 +2514,7 @@ static ssize_t ocfs2_file_splice_read(struct file *in,
in->f_path.dentry->d_name.name, len);

/*
- * See the comment in ocfs2_file_aio_read()
+ * See the comment in ocfs2_file_read_iter()
*/
ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level);
if (ret < 0) {
@@ -2535,19 +2529,18 @@ bail:
return ret;
}

-static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs,
- loff_t pos)
+static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
+ struct iov_iter *iter,
+ loff_t pos)
{
int ret = 0, rw_level = -1, have_alloc_sem = 0, lock_level = 0;
struct file *filp = iocb->ki_filp;
struct inode *inode = file_inode(filp);

- trace_ocfs2_file_aio_read(inode, filp, filp->f_path.dentry,
+ trace_ocfs2_file_read_iter(inode, filp, filp->f_path.dentry,
(unsigned long long)OCFS2_I(inode)->ip_blkno,
filp->f_path.dentry->d_name.len,
- filp->f_path.dentry->d_name.name, nr_segs);
+ filp->f_path.dentry->d_name.name, pos);


if (!inode) {
@@ -2583,7 +2576,7 @@ static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
*
* Take and drop the meta data lock to update inode fields
* like i_size. This allows the checks down below
- * generic_file_aio_read() a chance of actually working.
+ * generic_file_read_iter() a chance of actually working.
*/
ret = ocfs2_inode_lock_atime(inode, filp->f_path.mnt, &lock_level);
if (ret < 0) {
@@ -2592,13 +2585,13 @@ static ssize_t ocfs2_file_aio_read(struct kiocb *iocb,
}
ocfs2_inode_unlock(inode, lock_level);

- ret = generic_file_aio_read(iocb, iov, nr_segs, iocb->ki_pos);
- trace_generic_file_aio_read_ret(ret);
+ ret = generic_file_read_iter(iocb, iter, iocb->ki_pos);
+ trace_generic_file_read_iter_ret(ret);

/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT));

- /* see ocfs2_file_aio_write */
+ /* see ocfs2_file_write_iter */
if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
rw_level = -1;
have_alloc_sem = 0;
@@ -2686,8 +2679,8 @@ const struct file_operations ocfs2_fops = {
.fsync = ocfs2_sync_file,
.release = ocfs2_file_release,
.open = ocfs2_file_open,
- .aio_read = ocfs2_file_aio_read,
- .aio_write = ocfs2_file_aio_write,
+ .read_iter = ocfs2_file_read_iter,
+ .write_iter = ocfs2_file_write_iter,
.unlocked_ioctl = ocfs2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ocfs2_compat_ioctl,
@@ -2734,8 +2727,8 @@ const struct file_operations ocfs2_fops_no_plocks = {
.fsync = ocfs2_sync_file,
.release = ocfs2_file_release,
.open = ocfs2_file_open,
- .aio_read = ocfs2_file_aio_read,
- .aio_write = ocfs2_file_aio_write,
+ .read_iter = ocfs2_file_read_iter,
+ .write_iter = ocfs2_file_write_iter,
.unlocked_ioctl = ocfs2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ocfs2_compat_ioctl,
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index 3b481f4..1c5018c 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1310,13 +1310,13 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_file_release);

DEFINE_OCFS2_FILE_OPS(ocfs2_sync_file);

-DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_write);
+DEFINE_OCFS2_FILE_OPS(ocfs2_file_write_iter);

DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);

DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);

-DEFINE_OCFS2_FILE_OPS(ocfs2_file_aio_read);
+DEFINE_OCFS2_FILE_OPS(ocfs2_file_read_iter);

DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);

@@ -1474,7 +1474,7 @@ TRACE_EVENT(ocfs2_prepare_inode_for_write,
__entry->direct_io, __entry->has_refcount)
);

-DEFINE_OCFS2_INT_EVENT(generic_file_aio_read_ret);
+DEFINE_OCFS2_INT_EVENT(generic_file_read_iter_ret);

/* End of trace events for fs/ocfs2/file.c. */

--
1.8.3.4

2013-07-25 18:04:40

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 16/33] bio: add bvec_length(), like iov_length()

From: Zach Brown <[email protected]>

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
include/linux/bio.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index ec48bac..4fd5253 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -307,6 +307,14 @@ extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned long *, mempool_t *);
extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);

+static inline ssize_t bvec_length(const struct bio_vec *bvec, unsigned long nr)
+{
+ ssize_t bytes = 0;
+ while (nr--)
+ bytes += (bvec++)->bv_len;
+ return bytes;
+}
+
#ifdef CONFIG_BLK_CGROUP
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
--
1.8.3.4

2013-07-25 18:04:37

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 01/33] iov_iter: move into its own file

From: Zach Brown <[email protected]>

This moves the iov_iter functions in to their own file. We're going to
be working on them in upcoming patches. They become sufficiently large,
and remain self-contained, to justify seperating them from the rest of
the huge mm/filemap.c.

Signed-off-by: Dave Kleikamp <[email protected]>
Acked-by: Jeff Moyer <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/Makefile | 2 +-
fs/iov-iter.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/filemap.c | 144 -------------------------------------------------------
3 files changed, 152 insertions(+), 145 deletions(-)
create mode 100644 fs/iov-iter.c

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..1afa0e0 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o \
- stack.o fs_struct.o statfs.o
+ stack.o fs_struct.o statfs.o iov-iter.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/iov-iter.c b/fs/iov-iter.c
new file mode 100644
index 0000000..52c23d9
--- /dev/null
+++ b/fs/iov-iter.c
@@ -0,0 +1,151 @@
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/uio.h>
+#include <linux/hardirq.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
+ const struct iovec *iov, size_t base, size_t bytes)
+{
+ size_t copied = 0, left = 0;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + base;
+ int copy = min(bytes, iov->iov_len - base);
+
+ base = 0;
+ left = __copy_from_user_inatomic(vaddr, buf, copy);
+ copied += copy;
+ bytes -= copy;
+ vaddr += copy;
+ iov++;
+
+ if (unlikely(left))
+ break;
+ }
+ return copied - left;
+}
+
+/*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were successfully copied. If a fault is encountered then return the number
+ * of bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ BUG_ON(!in_atomic());
+ kaddr = kmap_atomic(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap_atomic(kaddr);
+
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ kaddr = kmap(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_from_user(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap(page);
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_from_user);
+
+void iov_iter_advance(struct iov_iter *i, size_t bytes)
+{
+ BUG_ON(i->count < bytes);
+
+ if (likely(i->nr_segs == 1)) {
+ i->iov_offset += bytes;
+ i->count -= bytes;
+ } else {
+ const struct iovec *iov = i->iov;
+ size_t base = i->iov_offset;
+ unsigned long nr_segs = i->nr_segs;
+
+ /*
+ * The !iov->iov_len check ensures we skip over unlikely
+ * zero-length segments (without overruning the iovec).
+ */
+ while (bytes || unlikely(i->count && !iov->iov_len)) {
+ int copy;
+
+ copy = min(bytes, iov->iov_len - base);
+ BUG_ON(!i->count || i->count < copy);
+ i->count -= copy;
+ bytes -= copy;
+ base += copy;
+ if (iov->iov_len == base) {
+ iov++;
+ nr_segs--;
+ base = 0;
+ }
+ }
+ i->iov = iov;
+ i->iov_offset = base;
+ i->nr_segs = nr_segs;
+ }
+}
+EXPORT_SYMBOL(iov_iter_advance);
+
+/*
+ * Fault in the first iovec of the given iov_iter, to a maximum length
+ * of bytes. Returns 0 on success, or non-zero if the memory could not be
+ * accessed (ie. because it is an invalid address).
+ *
+ * writev-intensive code may want this to prefault several iovecs -- that
+ * would be possible (callers must not rely on the fact that _only_ the
+ * first iovec will be faulted with the current implementation).
+ */
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+{
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ bytes = min(bytes, i->iov->iov_len - i->iov_offset);
+ return fault_in_pages_readable(buf, bytes);
+}
+EXPORT_SYMBOL(iov_iter_fault_in_readable);
+
+/*
+ * Return the count of just the current iov_iter segment.
+ */
+size_t iov_iter_single_seg_count(const struct iov_iter *i)
+{
+ const struct iovec *iov = i->iov;
+ if (i->nr_segs == 1)
+ return i->count;
+ else
+ return min(i->count, iov->iov_len - i->iov_offset);
+}
+EXPORT_SYMBOL(iov_iter_single_seg_count);
diff --git a/mm/filemap.c b/mm/filemap.c
index 4b51ac1..11ebe36 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1941,150 +1941,6 @@ struct page *read_cache_page(struct address_space *mapping,
}
EXPORT_SYMBOL(read_cache_page);

-static size_t __iovec_copy_from_user_inatomic(char *vaddr,
- const struct iovec *iov, size_t base, size_t bytes)
-{
- size_t copied = 0, left = 0;
-
- while (bytes) {
- char __user *buf = iov->iov_base + base;
- int copy = min(bytes, iov->iov_len - base);
-
- base = 0;
- left = __copy_from_user_inatomic(vaddr, buf, copy);
- copied += copy;
- bytes -= copy;
- vaddr += copy;
- iov++;
-
- if (unlikely(left))
- break;
- }
- return copied - left;
-}
-
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were successfully copied. If a fault is encountered then return the number of
- * bytes which were copied.
- */
-size_t iov_iter_copy_from_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- BUG_ON(!in_atomic());
- kaddr = kmap_atomic(page);
- if (likely(i->nr_segs == 1)) {
- int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
- left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
- copied = bytes - left;
- } else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
- }
- kunmap_atomic(kaddr);
-
- return copied;
-}
-EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
-
-/*
- * This has the same sideeffects and return value as
- * iov_iter_copy_from_user_atomic().
- * The difference is that it attempts to resolve faults.
- * Page must not be locked.
- */
-size_t iov_iter_copy_from_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- kaddr = kmap(page);
- if (likely(i->nr_segs == 1)) {
- int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
- left = __copy_from_user(kaddr + offset, buf, bytes);
- copied = bytes - left;
- } else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
- }
- kunmap(page);
- return copied;
-}
-EXPORT_SYMBOL(iov_iter_copy_from_user);
-
-void iov_iter_advance(struct iov_iter *i, size_t bytes)
-{
- BUG_ON(i->count < bytes);
-
- if (likely(i->nr_segs == 1)) {
- i->iov_offset += bytes;
- i->count -= bytes;
- } else {
- const struct iovec *iov = i->iov;
- size_t base = i->iov_offset;
- unsigned long nr_segs = i->nr_segs;
-
- /*
- * The !iov->iov_len check ensures we skip over unlikely
- * zero-length segments (without overruning the iovec).
- */
- while (bytes || unlikely(i->count && !iov->iov_len)) {
- int copy;
-
- copy = min(bytes, iov->iov_len - base);
- BUG_ON(!i->count || i->count < copy);
- i->count -= copy;
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- nr_segs--;
- base = 0;
- }
- }
- i->iov = iov;
- i->iov_offset = base;
- i->nr_segs = nr_segs;
- }
-}
-EXPORT_SYMBOL(iov_iter_advance);
-
-/*
- * Fault in the first iovec of the given iov_iter, to a maximum length
- * of bytes. Returns 0 on success, or non-zero if the memory could not be
- * accessed (ie. because it is an invalid address).
- *
- * writev-intensive code may want this to prefault several iovecs -- that
- * would be possible (callers must not rely on the fact that _only_ the
- * first iovec will be faulted with the current implementation).
- */
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
-{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
- return fault_in_pages_readable(buf, bytes);
-}
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
-
-/*
- * Return the count of just the current iov_iter segment.
- */
-size_t iov_iter_single_seg_count(const struct iov_iter *i)
-{
- const struct iovec *iov = i->iov;
- if (i->nr_segs == 1)
- return i->count;
- else
- return min(i->count, iov->iov_len - i->iov_offset);
-}
-EXPORT_SYMBOL(iov_iter_single_seg_count);
-
/*
* Performs necessary checks before doing a write
*
--
1.8.3.4

2013-07-25 18:04:35

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 25/33] btrfs: add support for read_iter and write_iter

btrfs can use generic_file_read_iter(). Base btrfs_file_write_iter()
on btrfs_file_aio_write(), then have the latter call the former.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: [email protected]
---
fs/btrfs/file.c | 42 ++++++++++++++----------------------------
1 file changed, 14 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a005fe2..ca28faa 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -453,7 +453,7 @@ static noinline int btrfs_copy_from_user(loff_t pos, int num_pages,
write_bytes -= copied;
total_copied += copied;

- /* Return to btrfs_file_aio_write to fault page */
+ /* Return to btrfs_file_write_iter to fault page */
if (unlikely(copied == 0))
break;

@@ -1548,27 +1548,23 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
}

static ssize_t __btrfs_direct_write(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs, loff_t pos,
- loff_t *ppos, size_t count, size_t ocount)
+ struct iov_iter *iter, loff_t pos,
+ loff_t *ppos, size_t count)
{
struct file *file = iocb->ki_filp;
- struct iov_iter i;
ssize_t written;
ssize_t written_buffered;
loff_t endbyte;
int err;

- written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos,
- count, ocount);
+ written = generic_file_direct_write_iter(iocb, iter, pos, ppos, count);

if (written < 0 || written == count)
return written;

pos += written;
count -= written;
- iov_iter_init(&i, iov, nr_segs, count, written);
- written_buffered = __btrfs_buffered_write(file, &i, pos);
+ written_buffered = __btrfs_buffered_write(file, iter, pos);
if (written_buffered < 0) {
err = written_buffered;
goto out;
@@ -1603,9 +1599,8 @@ static void update_time_for_write(struct inode *inode)
inode_inc_iversion(inode);
}

-static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
- const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
+ struct iov_iter *iter, loff_t pos)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
@@ -1614,17 +1609,12 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
u64 start_pos;
ssize_t num_written = 0;
ssize_t err = 0;
- size_t count, ocount;
+ size_t count;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);

mutex_lock(&inode->i_mutex);

- err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
- if (err) {
- mutex_unlock(&inode->i_mutex);
- goto out;
- }
- count = ocount;
+ count = iov_iter_count(iter);

current->backing_dev_info = inode->i_mapping->backing_dev_info;
err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
@@ -1677,14 +1667,10 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
atomic_inc(&BTRFS_I(inode)->sync_writers);

if (unlikely(file->f_flags & O_DIRECT)) {
- num_written = __btrfs_direct_write(iocb, iov, nr_segs,
- pos, ppos, count, ocount);
+ num_written = __btrfs_direct_write(iocb, iter, pos, ppos,
+ count);
} else {
- struct iov_iter i;
-
- iov_iter_init(&i, iov, nr_segs, count, num_written);
-
- num_written = __btrfs_buffered_write(file, &i, pos);
+ num_written = __btrfs_buffered_write(file, iter, pos);
if (num_written > 0)
*ppos = pos + num_written;
}
@@ -2543,9 +2529,9 @@ const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
.splice_read = generic_file_splice_read,
- .aio_write = btrfs_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = btrfs_file_write_iter,
.mmap = btrfs_file_mmap,
.open = generic_file_open,
.release = btrfs_release_file,
--
1.8.3.4

2013-07-25 18:04:26

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 02/33] iov_iter: iov_iter_copy_from_user() should use non-atomic copy

Signed-off-by: Dave Kleikamp <[email protected]>
---
fs/iov-iter.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 52c23d9..563a6ba 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -6,8 +6,8 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>

-static size_t __iovec_copy_from_user_inatomic(char *vaddr,
- const struct iovec *iov, size_t base, size_t bytes)
+static size_t __iovec_copy_from_user(char *vaddr, const struct iovec *iov,
+ size_t base, size_t bytes, int atomic)
{
size_t copied = 0, left = 0;

@@ -16,7 +16,10 @@ static size_t __iovec_copy_from_user_inatomic(char *vaddr,
int copy = min(bytes, iov->iov_len - base);

base = 0;
- left = __copy_from_user_inatomic(vaddr, buf, copy);
+ if (atomic)
+ left = __copy_from_user_inatomic(vaddr, buf, copy);
+ else
+ left = __copy_from_user(vaddr, buf, copy);
copied += copy;
bytes -= copy;
vaddr += copy;
@@ -47,8 +50,8 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
+ copied = __iovec_copy_from_user(kaddr + offset, i->iov,
+ i->iov_offset, bytes, 1);
}
kunmap_atomic(kaddr);

@@ -75,8 +78,8 @@ size_t iov_iter_copy_from_user(struct page *page,
left = __copy_from_user(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
- copied = __iovec_copy_from_user_inatomic(kaddr + offset,
- i->iov, i->iov_offset, bytes);
+ copied = __iovec_copy_from_user(kaddr + offset, i->iov,
+ i->iov_offset, bytes, 0);
}
kunmap(page);
return copied;
--
1.8.3.4

2013-07-25 18:04:23

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 26/33] block_dev: add support for read_iter, write_iter

From: Asias He <[email protected]>

Signed-off-by: Asias He <[email protected]>
Signed-off-by: Dave Kleikamp <[email protected]>
---
drivers/char/raw.c | 4 ++--
fs/block_dev.c | 19 +++++++++----------
include/linux/fs.h | 4 ++--
3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index f3223aa..db5fa4e 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -285,9 +285,9 @@ static long raw_ctl_compat_ioctl(struct file *file, unsigned int cmd,

static const struct file_operations raw_fops = {
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = blkdev_aio_write,
+ .write_iter = blkdev_write_iter,
.fsync = blkdev_fsync,
.open = raw_open,
.release = raw_release,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6f8c9e4..89d8ec5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1508,8 +1508,7 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
* Does not take i_mutex for the write and thus is not for general purpose
* use.
*/
-ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct file *file = iocb->ki_filp;
struct blk_plug plug;
@@ -1518,7 +1517,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
BUG_ON(iocb->ki_pos != pos);

blk_start_plug(&plug);
- ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
+ ret = __generic_file_write_iter(iocb, iter, &iocb->ki_pos);
if (ret > 0 || ret == -EIOCBQUEUED) {
ssize_t err;

@@ -1529,10 +1528,10 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
blk_finish_plug(&plug);
return ret;
}
-EXPORT_SYMBOL_GPL(blkdev_aio_write);
+EXPORT_SYMBOL_GPL(blkdev_write_iter);

-static ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
{
struct file *file = iocb->ki_filp;
struct inode *bd_inode = file->f_mapping->host;
@@ -1543,8 +1542,8 @@ static ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,

size -= pos;
if (size < iocb->ki_left)
- nr_segs = iov_shorten((struct iovec *)iov, nr_segs, size);
- return generic_file_aio_read(iocb, iov, nr_segs, pos);
+ iov_iter_shorten(iter, size);
+ return generic_file_read_iter(iocb, iter, pos);
}

/*
@@ -1578,8 +1577,8 @@ const struct file_operations def_blk_fops = {
.llseek = block_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = blkdev_aio_read,
- .aio_write = blkdev_aio_write,
+ .read_iter = blkdev_read_iter,
+ .write_iter = blkdev_write_iter,
.mmap = generic_file_mmap,
.fsync = blkdev_fsync,
.unlocked_ioctl = block_ioctl,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 06f2290..269ef07 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2526,8 +2526,8 @@ extern int generic_segment_checks(const struct iovec *iov,
unsigned long *nr_segs, size_t *count, int access_flags);

/* fs/block_dev.c */
-extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos);
+extern ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos);
extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
int datasync);
extern void block_sync_page(struct page *page);
--
1.8.3.4

2013-07-25 17:51:59

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 09/33] iov_iter: add a shorten call

From: Zach Brown <[email protected]>

The generic direct write path wants to shorten its memory vector. It
does this when it finds that it has to perform a partial write due to
LIMIT_FSIZE. .direct_IO() always performs IO on all of the referenced
memory because it doesn't have an argument to specify the length of the
IO.

We add an iov_iter operation for this so that the generic path can ask
to shorten the memory vector without having to know what kind it is.
We're happy to shorten the kernel copy of the iovec array, but we refuse
to shorten the bio_vec array and return an error in this case.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/iov-iter.c | 15 +++++++++++++++
include/linux/fs.h | 5 +++++
2 files changed, 20 insertions(+)

diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 5624e36..ec461c8 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -226,6 +226,11 @@ static size_t ii_bvec_single_seg_count(const struct iov_iter *i)
return min(i->count, bvec->bv_len - i->iov_offset);
}

+static int ii_bvec_shorten(struct iov_iter *i, size_t count)
+{
+ return -EINVAL;
+}
+
struct iov_iter_ops ii_bvec_ops = {
.ii_copy_to_user_atomic = ii_bvec_copy_to_user_atomic,
.ii_copy_to_user = ii_bvec_copy_to_user,
@@ -234,6 +239,7 @@ struct iov_iter_ops ii_bvec_ops = {
.ii_advance = ii_bvec_advance,
.ii_fault_in_readable = ii_bvec_fault_in_readable,
.ii_single_seg_count = ii_bvec_single_seg_count,
+ .ii_shorten = ii_bvec_shorten,
};
EXPORT_SYMBOL(ii_bvec_ops);
#endif /* CONFIG_BLOCK */
@@ -384,6 +390,14 @@ static size_t ii_iovec_single_seg_count(const struct iov_iter *i)
return min(i->count, iov->iov_len - i->iov_offset);
}

+static int ii_iovec_shorten(struct iov_iter *i, size_t count)
+{
+ struct iovec *iov = (struct iovec *)i->data;
+ i->nr_segs = iov_shorten(iov, i->nr_segs, count);
+ i->count = min(i->count, count);
+ return 0;
+}
+
struct iov_iter_ops ii_iovec_ops = {
.ii_copy_to_user_atomic = ii_iovec_copy_to_user_atomic,
.ii_copy_to_user = ii_iovec_copy_to_user,
@@ -392,5 +406,6 @@ struct iov_iter_ops ii_iovec_ops = {
.ii_advance = ii_iovec_advance,
.ii_fault_in_readable = ii_iovec_fault_in_readable,
.ii_single_seg_count = ii_iovec_single_seg_count,
+ .ii_shorten = ii_iovec_shorten,
};
EXPORT_SYMBOL(ii_iovec_ops);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2cd17c..a89bcb9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -309,6 +309,7 @@ struct iov_iter_ops {
void (*ii_advance)(struct iov_iter *, size_t);
int (*ii_fault_in_readable)(struct iov_iter *, size_t);
size_t (*ii_single_seg_count)(const struct iov_iter *);
+ int (*ii_shorten)(struct iov_iter *, size_t);
};

static inline size_t iov_iter_copy_to_user_atomic(struct page *page,
@@ -348,6 +349,10 @@ static inline size_t iov_iter_single_seg_count(const struct iov_iter *i)
{
return i->ops->ii_single_seg_count(i);
}
+static inline int iov_iter_shorten(struct iov_iter *i, size_t count)
+{
+ return i->ops->ii_shorten(i, count);
+}

#ifdef CONFIG_BLOCK
extern struct iov_iter_ops ii_bvec_ops;
--
1.8.3.4

2013-07-25 18:07:54

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 23/33] nfs: add support for read_iter, write_iter

This patch implements the read_iter and write_iter file operations which
allow kernel code to initiate directIO. This allows the loop device to
read and write directly to the server, bypassing the page cache.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: [email protected]
---
fs/nfs/direct.c | 247 +++++++++++++++++++++++++++++++++++++------------
fs/nfs/file.c | 33 ++++---
fs/nfs/internal.h | 4 +-
fs/nfs/nfs4file.c | 4 +-
include/linux/nfs_fs.h | 6 +-
5 files changed, 210 insertions(+), 84 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index bceb47e..2b0ebcb 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -90,6 +90,7 @@ struct nfs_direct_req {
int flags;
#define NFS_ODIRECT_DO_COMMIT (1) /* an unstable reply was received */
#define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */
+#define NFS_ODIRECT_MARK_DIRTY (4) /* mark read pages dirty */
struct nfs_writeverf verf; /* unstable write verifier */
};

@@ -131,15 +132,13 @@ ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,

return -EINVAL;
#else
- const struct iovec *iov = iov_iter_iovec(iter);
-
VM_BUG_ON(iocb->ki_left != PAGE_SIZE);
VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE);

if (rw == READ || rw == KERNEL_READ)
- return nfs_file_direct_read(iocb, iov, iter->nr_segs, pos,
+ return nfs_file_direct_read(iocb, iter, pos,
rw == READ ? true : false);
- return nfs_file_direct_write(iocb, iov, iter->nr_segs, pos,
+ return nfs_file_direct_write(iocb, iter, pos,
rw == WRITE ? true : false);
#endif /* CONFIG_NFS_SWAP */
}
@@ -269,7 +268,8 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr)
struct nfs_page *req = nfs_list_entry(hdr->pages.next);
struct page *page = req->wb_page;

- if (!PageCompound(page) && bytes < hdr->good_bytes)
+ if ((dreq->flags & NFS_ODIRECT_MARK_DIRTY) &&
+ !PageCompound(page) && bytes < hdr->good_bytes)
set_page_dirty(page);
bytes += req->wb_bytes;
nfs_list_remove_request(req);
@@ -401,24 +401,17 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *de
return result < 0 ? (ssize_t) result : -EFAULT;
}

-static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
- const struct iovec *iov,
- unsigned long nr_segs,
- loff_t pos, bool uio)
+static ssize_t nfs_direct_do_schedule_read_iovec(
+ struct nfs_pageio_descriptor *desc, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos, bool uio)
{
- struct nfs_pageio_descriptor desc;
ssize_t result = -EINVAL;
size_t requested_bytes = 0;
unsigned long seg;

- NFS_PROTO(dreq->inode)->read_pageio_init(&desc, dreq->inode,
- &nfs_direct_read_completion_ops);
- get_dreq(dreq);
- desc.pg_dreq = dreq;
-
for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];
- result = nfs_direct_read_schedule_segment(&desc, vec, pos, uio);
+ result = nfs_direct_read_schedule_segment(desc, vec, pos, uio);
if (result < 0)
break;
requested_bytes += result;
@@ -426,6 +419,78 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
break;
pos += vec->iov_len;
}
+ if (requested_bytes)
+ return requested_bytes;
+
+ return result < 0 ? result : -EIO;
+}
+
+#ifdef CONFIG_BLOCK
+static ssize_t nfs_direct_do_schedule_read_bvec(
+ struct nfs_pageio_descriptor *desc,
+ struct bio_vec *bvec, unsigned long nr_segs, loff_t pos)
+{
+ struct nfs_direct_req *dreq = desc->pg_dreq;
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = ctx->dentry->d_inode;
+ ssize_t result = -EINVAL;
+ size_t requested_bytes = 0;
+ unsigned long seg;
+ struct nfs_page *req;
+ unsigned int req_len;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ result = -EIO;
+ req_len = bvec[seg].bv_len;
+ req = nfs_create_request(ctx, inode,
+ bvec[seg].bv_page,
+ bvec[seg].bv_offset, req_len);
+ if (IS_ERR(req)) {
+ result = PTR_ERR(req);
+ break;
+ }
+ req->wb_index = pos >> PAGE_SHIFT;
+ req->wb_offset = pos & ~PAGE_MASK;
+ if (!nfs_pageio_add_request(desc, req)) {
+ result = desc->pg_error;
+ nfs_release_request(req);
+ break;
+ }
+ requested_bytes += req_len;
+ pos += req_len;
+ }
+
+ if (requested_bytes)
+ return requested_bytes;
+
+ return result < 0 ? result : -EIO;
+}
+#endif /* CONFIG_BLOCK */
+
+static ssize_t nfs_direct_read_schedule(struct nfs_direct_req *dreq,
+ struct iov_iter *iter, loff_t pos,
+ bool uio)
+{
+ struct nfs_pageio_descriptor desc;
+ ssize_t result;
+
+ NFS_PROTO(dreq->inode)->read_pageio_init(&desc, dreq->inode,
+ &nfs_direct_read_completion_ops);
+ get_dreq(dreq);
+ desc.pg_dreq = dreq;
+
+ if (iov_iter_has_iovec(iter)) {
+ if (uio)
+ dreq->flags = NFS_ODIRECT_MARK_DIRTY;
+ result = nfs_direct_do_schedule_read_iovec(&desc,
+ iov_iter_iovec(iter), iter->nr_segs, pos, uio);
+#ifdef CONFIG_BLOCK
+ } else if (iov_iter_has_bvec(iter)) {
+ result = nfs_direct_do_schedule_read_bvec(&desc,
+ iov_iter_bvec(iter), iter->nr_segs, pos);
+#endif
+ } else
+ BUG();

nfs_pageio_complete(&desc);

@@ -433,9 +498,9 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
* If no bytes were started, return the error, and let the
* generic layer handle the completion.
*/
- if (requested_bytes == 0) {
+ if (result < 0) {
nfs_direct_req_release(dreq);
- return result < 0 ? result : -EIO;
+ return result;
}

if (put_dreq(dreq))
@@ -443,8 +508,8 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
return 0;
}

-static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, bool uio)
+static ssize_t nfs_direct_read(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, bool uio)
{
ssize_t result = -ENOMEM;
struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -456,7 +521,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
goto out;

dreq->inode = inode;
- dreq->bytes_left = iov_length(iov, nr_segs);
+ dreq->bytes_left = iov_iter_count(iter);
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
@@ -467,8 +532,8 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
if (!is_sync_kiocb(iocb))
dreq->iocb = iocb;

- NFS_I(inode)->read_io += iov_length(iov, nr_segs);
- result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos, uio);
+ NFS_I(inode)->read_io += iov_iter_count(iter);
+ result = nfs_direct_read_schedule(dreq, iter, pos, uio);
if (!result)
result = nfs_direct_wait(dreq);
out_release:
@@ -802,27 +867,18 @@ static const struct nfs_pgio_completion_ops nfs_direct_write_completion_ops = {
.completion = nfs_direct_write_completion,
};

-static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
- const struct iovec *iov,
- unsigned long nr_segs,
- loff_t pos, bool uio)
+static ssize_t nfs_direct_do_schedule_write_iovec(
+ struct nfs_pageio_descriptor *desc, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos, bool uio)
{
- struct nfs_pageio_descriptor desc;
- struct inode *inode = dreq->inode;
- ssize_t result = 0;
+ ssize_t result = -EINVAL;
size_t requested_bytes = 0;
unsigned long seg;

- NFS_PROTO(inode)->write_pageio_init(&desc, inode, FLUSH_COND_STABLE,
- &nfs_direct_write_completion_ops);
- desc.pg_dreq = dreq;
- get_dreq(dreq);
- atomic_inc(&inode->i_dio_count);
-
- NFS_I(dreq->inode)->write_io += iov_length(iov, nr_segs);
for (seg = 0; seg < nr_segs; seg++) {
const struct iovec *vec = &iov[seg];
- result = nfs_direct_write_schedule_segment(&desc, vec, pos, uio);
+ result = nfs_direct_write_schedule_segment(desc, vec,
+ pos, uio);
if (result < 0)
break;
requested_bytes += result;
@@ -830,16 +886,92 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
break;
pos += vec->iov_len;
}
+
+ if (requested_bytes)
+ return requested_bytes;
+
+ return result < 0 ? result : -EIO;
+}
+
+#ifdef CONFIG_BLOCK
+static ssize_t nfs_direct_do_schedule_write_bvec(
+ struct nfs_pageio_descriptor *desc,
+ struct bio_vec *bvec, unsigned long nr_segs, loff_t pos)
+{
+ struct nfs_direct_req *dreq = desc->pg_dreq;
+ struct nfs_open_context *ctx = dreq->ctx;
+ struct inode *inode = dreq->inode;
+ ssize_t result = 0;
+ size_t requested_bytes = 0;
+ unsigned long seg;
+ struct nfs_page *req;
+ unsigned int req_len;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ req_len = bvec[seg].bv_len;
+
+ req = nfs_create_request(ctx, inode, bvec[seg].bv_page,
+ bvec[seg].bv_offset, req_len);
+ if (IS_ERR(req)) {
+ result = PTR_ERR(req);
+ break;
+ }
+ nfs_lock_request(req);
+ req->wb_index = pos >> PAGE_SHIFT;
+ req->wb_offset = pos & ~PAGE_MASK;
+ if (!nfs_pageio_add_request(desc, req)) {
+ result = desc->pg_error;
+ nfs_unlock_and_release_request(req);
+ break;
+ }
+ requested_bytes += req_len;
+ pos += req_len;
+ }
+
+ if (requested_bytes)
+ return requested_bytes;
+
+ return result < 0 ? result : -EIO;
+}
+#endif /* CONFIG_BLOCK */
+
+static ssize_t nfs_direct_write_schedule(struct nfs_direct_req *dreq,
+ struct iov_iter *iter, loff_t pos,
+ bool uio)
+{
+ struct nfs_pageio_descriptor desc;
+ struct inode *inode = dreq->inode;
+ ssize_t result = 0;
+
+ NFS_PROTO(inode)->write_pageio_init(&desc, inode, FLUSH_COND_STABLE,
+ &nfs_direct_write_completion_ops);
+ desc.pg_dreq = dreq;
+ get_dreq(dreq);
+ atomic_inc(&inode->i_dio_count);
+
+ NFS_I(dreq->inode)->write_io += iov_iter_count(iter);
+
+ if (iov_iter_has_iovec(iter)) {
+ result = nfs_direct_do_schedule_write_iovec(&desc,
+ iov_iter_iovec(iter), iter->nr_segs, pos, uio);
+#ifdef CONFIG_BLOCK
+ } else if (iov_iter_has_bvec(iter)) {
+ result = nfs_direct_do_schedule_write_bvec(&desc,
+ iov_iter_bvec(iter), iter->nr_segs, pos);
+#endif
+ } else
+ BUG();
+
nfs_pageio_complete(&desc);

/*
* If no bytes were started, return the error, and let the
* generic layer handle the completion.
*/
- if (requested_bytes == 0) {
+ if (result < 0) {
inode_dio_done(inode);
nfs_direct_req_release(dreq);
- return result < 0 ? result : -EIO;
+ return result;
}

if (put_dreq(dreq))
@@ -847,9 +979,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
return 0;
}

-static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos,
- size_t count, bool uio)
+static ssize_t nfs_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, bool uio)
{
ssize_t result = -ENOMEM;
struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -861,7 +992,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
goto out;

dreq->inode = inode;
- dreq->bytes_left = count;
+ dreq->bytes_left = iov_iter_count(iter);
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
@@ -872,7 +1003,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
if (!is_sync_kiocb(iocb))
dreq->iocb = iocb;

- result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, uio);
+ result = nfs_direct_write_schedule(dreq, iter, pos, uio);
if (!result)
result = nfs_direct_wait(dreq);
out_release:
@@ -884,12 +1015,11 @@ out:
/**
* nfs_file_direct_read - file direct read operation for NFS files
* @iocb: target I/O control block
- * @iov: vector of user buffers into which to read data
- * @nr_segs: size of iov vector
+ * @iter: vector of buffers into which to read data
* @pos: byte offset in file where reading starts
*
* We use this function for direct reads instead of calling
- * generic_file_aio_read() in order to avoid gfar's check to see if
+ * generic_file_read_iter() in order to avoid gfar's check to see if
* the request starts before the end of the file. For that check
* to work, we must generate a GETATTR before each direct read, and
* even then there is a window between the GETATTR and the subsequent
@@ -902,15 +1032,15 @@ out:
* client must read the updated atime from the server back into its
* cache.
*/
-ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, bool uio)
+ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, bool uio)
{
ssize_t retval = -EINVAL;
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
size_t count;

- count = iov_length(iov, nr_segs);
+ count = iov_iter_count(iter);
nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);

dfprintk(FILE, "NFS: direct read(%s/%s, %zd@%Ld)\n",
@@ -928,7 +1058,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,

task_io_account_read(count);

- retval = nfs_direct_read(iocb, iov, nr_segs, pos, uio);
+ retval = nfs_direct_read(iocb, iter, pos, uio);
if (retval > 0)
iocb->ki_pos = pos + retval;

@@ -939,12 +1069,11 @@ out:
/**
* nfs_file_direct_write - file direct write operation for NFS files
* @iocb: target I/O control block
- * @iov: vector of user buffers from which to write data
- * @nr_segs: size of iov vector
+ * @iter: vector of buffers from which to write data
* @pos: byte offset in file where writing starts
*
* We use this function for direct writes instead of calling
- * generic_file_aio_write() in order to avoid taking the inode
+ * generic_file_write_iter() in order to avoid taking the inode
* semaphore and updating the i_size. The NFS server will set
* the new i_size and this client must read the updated size
* back into its cache. We let the server do generic write
@@ -958,15 +1087,15 @@ out:
* Note that O_APPEND is not supported for NFS direct writes, as there
* is no atomic O_APPEND write facility in the NFS protocol.
*/
-ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, bool uio)
+ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, bool uio)
{
ssize_t retval = -EINVAL;
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
size_t count;

- count = iov_length(iov, nr_segs);
+ count = iov_iter_count(iter);
nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);

dfprintk(FILE, "NFS: direct write(%s/%s, %zd@%Ld)\n",
@@ -991,7 +1120,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,

task_io_account_write(count);

- retval = nfs_direct_write(iocb, iov, nr_segs, pos, count, uio);
+ retval = nfs_direct_write(iocb, iter, pos, uio);
if (retval > 0) {
struct inode *inode = mapping->host;

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 94e94bd..bbff2f9 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -172,29 +172,28 @@ nfs_file_flush(struct file *file, fl_owner_t id)
EXPORT_SYMBOL_GPL(nfs_file_flush);

ssize_t
-nfs_file_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+nfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct dentry * dentry = iocb->ki_filp->f_path.dentry;
struct inode * inode = dentry->d_inode;
ssize_t result;

if (iocb->ki_filp->f_flags & O_DIRECT)
- return nfs_file_direct_read(iocb, iov, nr_segs, pos, true);
+ return nfs_file_direct_read(iocb, iter, pos, true);

- dprintk("NFS: read(%s/%s, %lu@%lu)\n",
+ dprintk("NFS: read_iter(%s/%s, %lu@%lu)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
- (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos);
+ (unsigned long) iov_iter_count(iter), (unsigned long) pos);

result = nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping);
if (!result) {
- result = generic_file_aio_read(iocb, iov, nr_segs, pos);
+ result = generic_file_read_iter(iocb, iter, pos);
if (result > 0)
nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
}
return result;
}
-EXPORT_SYMBOL_GPL(nfs_file_read);
+EXPORT_SYMBOL_GPL(nfs_file_read_iter);

ssize_t
nfs_file_splice_read(struct file *filp, loff_t *ppos,
@@ -250,7 +249,7 @@ EXPORT_SYMBOL_GPL(nfs_file_mmap);
* disk, but it retrieves and clears ctx->error after synching, despite
* the two being set at the same time in nfs_context_set_write_error().
* This is because the former is used to notify the _next_ call to
- * nfs_file_write() that a write error occurred, and hence cause it to
+ * nfs_file_write_iter() that a write error occurred, and hence cause it to
* fall back to doing a synchronous write.
*/
int
@@ -642,19 +641,19 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
return 0;
}

-ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+ssize_t nfs_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
{
struct dentry * dentry = iocb->ki_filp->f_path.dentry;
struct inode * inode = dentry->d_inode;
unsigned long written = 0;
ssize_t result;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(iter);

if (iocb->ki_filp->f_flags & O_DIRECT)
- return nfs_file_direct_write(iocb, iov, nr_segs, pos, true);
+ return nfs_file_direct_write(iocb, iter, pos, true);

- dprintk("NFS: write(%s/%s, %lu@%Ld)\n",
+ dprintk("NFS: write_iter(%s/%s, %lu@%lld)\n",
dentry->d_parent->d_name.name, dentry->d_name.name,
(unsigned long) count, (long long) pos);

@@ -674,7 +673,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
if (!count)
goto out;

- result = generic_file_aio_write(iocb, iov, nr_segs, pos);
+ result = generic_file_write_iter(iocb, iter, pos);
if (result > 0)
written = result;

@@ -693,7 +692,7 @@ out_swapfile:
printk(KERN_INFO "NFS: attempt to write to active swap file!\n");
goto out;
}
-EXPORT_SYMBOL_GPL(nfs_file_write);
+EXPORT_SYMBOL_GPL(nfs_file_write_iter);

ssize_t nfs_file_splice_write(struct pipe_inode_info *pipe,
struct file *filp, loff_t *ppos,
@@ -953,8 +952,8 @@ const struct file_operations nfs_file_operations = {
.llseek = nfs_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = nfs_file_read,
- .aio_write = nfs_file_write,
+ .read_iter = nfs_file_read_iter,
+ .write_iter = nfs_file_write_iter,
.mmap = nfs_file_mmap,
.open = nfs_file_open,
.flush = nfs_file_flush,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 3c8373f..d689ca9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -286,11 +286,11 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
int nfs_file_fsync_commit(struct file *, loff_t, loff_t, int);
loff_t nfs_file_llseek(struct file *, loff_t, int);
int nfs_file_flush(struct file *, fl_owner_t);
-ssize_t nfs_file_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ssize_t nfs_file_read_iter(struct kiocb *, struct iov_iter *, loff_t);
ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
size_t, unsigned int);
int nfs_file_mmap(struct file *, struct vm_area_struct *);
-ssize_t nfs_file_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ssize_t nfs_file_write_iter(struct kiocb *, struct iov_iter *, loff_t);
int nfs_file_release(struct inode *, struct file *);
int nfs_lock(struct file *, int, struct file_lock *);
int nfs_flock(struct file *, int, struct file_lock *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index e5b804d..e13bb02 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -121,8 +121,8 @@ const struct file_operations nfs4_file_operations = {
.llseek = nfs_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = nfs_file_read,
- .aio_write = nfs_file_write,
+ .read_iter = nfs_file_read_iter,
+ .write_iter = nfs_file_write_iter,
.mmap = nfs_file_mmap,
.open = nfs4_file_open,
.flush = nfs_file_flush,
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a4b19d2..b2324be 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -458,11 +458,9 @@ extern int nfs3_removexattr (struct dentry *, const char *name);
* linux/fs/nfs/direct.c
*/
extern ssize_t nfs_direct_IO(int, struct kiocb *, struct iov_iter *, loff_t);
-extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
- const struct iovec *iov, unsigned long nr_segs,
+extern ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
loff_t pos, bool uio);
-extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
- const struct iovec *iov, unsigned long nr_segs,
+extern ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
loff_t pos, bool uio);

/*
--
1.8.3.4

2013-07-25 18:08:22

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 20/33] fs: add read_iter and write_iter to several file systems

These are the simple ones.

File systems that use generic_file_aio_read() and generic_file_aio_write()
can trivially support generic_file_read_iter() and generic_file_write_iter().

This patch adds those file_operations for 9p, adfs, affs, bfs, exofs, ext2,
ext3, fat, f2fs, hfs, hfsplus, hostfs, hpfs, jfs, jffs2, logfs, minix, nilfs2,
omfs, ramfs, reiserfs, romfs, sysv, and ufs.

Signed-off-by: Dave Kleikamp <[email protected]>
Acked-by: Boaz Harrosh <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/9p/vfs_addr.c | 4 ++--
fs/9p/vfs_file.c | 8 ++++----
fs/adfs/file.c | 4 ++--
fs/affs/file.c | 4 ++--
fs/bfs/file.c | 4 ++--
fs/exofs/file.c | 4 ++--
fs/ext2/file.c | 4 ++--
fs/ext3/file.c | 4 ++--
fs/f2fs/file.c | 4 ++--
fs/fat/file.c | 4 ++--
fs/hfs/inode.c | 4 ++--
fs/hfsplus/inode.c | 4 ++--
fs/hostfs/hostfs_kern.c | 4 ++--
fs/hpfs/file.c | 4 ++--
fs/jffs2/file.c | 8 ++++----
fs/jfs/file.c | 4 ++--
fs/logfs/file.c | 4 ++--
fs/minix/file.c | 4 ++--
fs/nilfs2/file.c | 4 ++--
fs/omfs/file.c | 4 ++--
fs/ramfs/file-mmu.c | 4 ++--
fs/ramfs/file-nommu.c | 4 ++--
fs/reiserfs/file.c | 4 ++--
fs/romfs/mmap-nommu.c | 2 +-
fs/sysv/file.c | 4 ++--
fs/ufs/file.c | 4 ++--
26 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 5581415..da0821b 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -251,8 +251,8 @@ static int v9fs_launder_page(struct page *page)
* the VFS gets them, so this method should never be called.
*
* Direct IO is not 'yet' supported in the cached mode. Hence when
- * this routine is called through generic_file_aio_read(), the read/write fails
- * with an error.
+ * this routine is called through generic_file_read_iter(), the read/write
+ * fails with an error.
*
*/
static ssize_t
diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index d384a8b..18d0293 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -743,8 +743,8 @@ const struct file_operations v9fs_cached_file_operations = {
.llseek = generic_file_llseek,
.read = v9fs_cached_file_read,
.write = v9fs_cached_file_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.open = v9fs_file_open,
.release = v9fs_dir_release,
.lock = v9fs_file_lock,
@@ -756,8 +756,8 @@ const struct file_operations v9fs_cached_file_operations_dotl = {
.llseek = generic_file_llseek,
.read = v9fs_cached_file_read,
.write = v9fs_cached_file_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.open = v9fs_file_open,
.release = v9fs_dir_release,
.lock = v9fs_file_lock_dotl,
diff --git a/fs/adfs/file.c b/fs/adfs/file.c
index a36da53..da1e021 100644
--- a/fs/adfs/file.c
+++ b/fs/adfs/file.c
@@ -24,11 +24,11 @@
const struct file_operations adfs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.mmap = generic_file_mmap,
.fsync = generic_file_fsync,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.splice_read = generic_file_splice_read,
};

diff --git a/fs/affs/file.c b/fs/affs/file.c
index af3261b..d09a2db 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -28,9 +28,9 @@ static int affs_file_release(struct inode *inode, struct file *filp);
const struct file_operations affs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.open = affs_file_open,
.release = affs_file_release,
diff --git a/fs/bfs/file.c b/fs/bfs/file.c
index ad3ea14..3d14806 100644
--- a/fs/bfs/file.c
+++ b/fs/bfs/file.c
@@ -24,9 +24,9 @@
const struct file_operations bfs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.splice_read = generic_file_splice_read,
};
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index 491c6c0..20564f8a 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -69,8 +69,8 @@ const struct file_operations exofs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.open = generic_file_open,
.release = exofs_release_file,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a5b3a5d..6af043b 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -64,8 +64,8 @@ const struct file_operations ext2_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 25cb413..a796771 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -52,8 +52,8 @@ const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext3_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index d2d2b7d..d498bad 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -678,8 +678,8 @@ const struct file_operations f2fs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.open = generic_file_open,
.mmap = f2fs_file_mmap,
.fsync = f2fs_sync_file,
diff --git a/fs/fat/file.c b/fs/fat/file.c
index 9b104f5..33711ff 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -172,8 +172,8 @@ const struct file_operations fat_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.release = fat_file_release,
.unlocked_ioctl = fat_generic_ioctl,
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 62440ce..f9242b8 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -674,9 +674,9 @@ static int hfs_file_fsync(struct file *filp, loff_t start, loff_t end,
static const struct file_operations hfs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.splice_read = generic_file_splice_read,
.fsync = hfs_file_fsync,
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index fca99cc..13813f6 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -388,9 +388,9 @@ static const struct inode_operations hfsplus_file_inode_operations = {
static const struct file_operations hfsplus_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.splice_read = generic_file_splice_read,
.fsync = hfsplus_file_fsync,
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index cddb052..e3adc8e 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -381,8 +381,8 @@ static const struct file_operations hostfs_file_fops = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.splice_read = generic_file_splice_read,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.write = do_sync_write,
.mmap = generic_file_mmap,
.open = hostfs_file_open,
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index 4e9dabc..2561eba 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -198,9 +198,9 @@ const struct file_operations hpfs_file_ops =
{
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.release = hpfs_file_release,
.fsync = hpfs_file_fsync,
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index 1506673..1d7ab8b 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -51,10 +51,10 @@ const struct file_operations jffs2_file_operations =
{
.llseek = generic_file_llseek,
.open = generic_file_open,
- .read = do_sync_read,
- .aio_read = generic_file_aio_read,
- .write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .read = do_sync_read,
+ .read_iter = generic_file_read_iter,
+ .write = do_sync_write,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl=jffs2_ioctl,
.mmap = generic_file_readonly_mmap,
.fsync = jffs2_fsync,
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index dd7442c..040b6c7 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -151,8 +151,8 @@ const struct file_operations jfs_file_operations = {
.llseek = generic_file_llseek,
.write = do_sync_write,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
diff --git a/fs/logfs/file.c b/fs/logfs/file.c
index 57914fc..57f994e 100644
--- a/fs/logfs/file.c
+++ b/fs/logfs/file.c
@@ -264,8 +264,8 @@ const struct inode_operations logfs_reg_iops = {
};

const struct file_operations logfs_reg_fops = {
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.fsync = logfs_fsync,
.unlocked_ioctl = logfs_ioctl,
.llseek = generic_file_llseek,
diff --git a/fs/minix/file.c b/fs/minix/file.c
index adc6f54..346d8f37 100644
--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -15,9 +15,9 @@
const struct file_operations minix_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.fsync = generic_file_fsync,
.splice_read = generic_file_splice_read,
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77..7aeb8ee 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -153,8 +153,8 @@ const struct file_operations nilfs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = nilfs_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = nilfs_compat_ioctl,
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index e0d9b3e..badafd8 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -339,8 +339,8 @@ const struct file_operations omfs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.fsync = generic_file_fsync,
.splice_read = generic_file_splice_read,
diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5..c4d8572 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -39,9 +39,9 @@ const struct address_space_operations ramfs_aops = {

const struct file_operations ramfs_file_operations = {
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.fsync = noop_fsync,
.splice_read = generic_file_splice_read,
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index 8d5b438..f2487c3 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -39,9 +39,9 @@ const struct file_operations ramfs_file_operations = {
.mmap = ramfs_nommu_mmap,
.get_unmapped_area = ramfs_nommu_get_unmapped_area,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.fsync = noop_fsync,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
diff --git a/fs/reiserfs/file.c b/fs/reiserfs/file.c
index dcaafcf..f98feb2 100644
--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -245,8 +245,8 @@ const struct file_operations reiserfs_file_operations = {
.open = reiserfs_file_open,
.release = reiserfs_file_release,
.fsync = reiserfs_sync_file,
- .aio_read = generic_file_aio_read,
- .aio_write = generic_file_aio_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
.llseek = generic_file_llseek,
diff --git a/fs/romfs/mmap-nommu.c b/fs/romfs/mmap-nommu.c
index f373bde..f8a9e2b 100644
--- a/fs/romfs/mmap-nommu.c
+++ b/fs/romfs/mmap-nommu.c
@@ -73,7 +73,7 @@ static int romfs_mmap(struct file *file, struct vm_area_struct *vma)
const struct file_operations romfs_ro_fops = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.splice_read = generic_file_splice_read,
.mmap = romfs_mmap,
.get_unmapped_area = romfs_get_unmapped_area,
diff --git a/fs/sysv/file.c b/fs/sysv/file.c
index 9d4dc68..ff4b363 100644
--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -22,9 +22,9 @@
const struct file_operations sysv_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.fsync = generic_file_fsync,
.splice_read = generic_file_splice_read,
diff --git a/fs/ufs/file.c b/fs/ufs/file.c
index 33afa20..e155e4c 100644
--- a/fs/ufs/file.c
+++ b/fs/ufs/file.c
@@ -36,9 +36,9 @@
const struct file_operations ufs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
- .aio_read = generic_file_aio_read,
+ .read_iter = generic_file_read_iter,
.write = do_sync_write,
- .aio_write = generic_file_aio_write,
+ .write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.open = generic_file_open,
.fsync = generic_file_fsync,
--
1.8.3.4

2013-07-25 18:08:54

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 11/33] dio: Convert direct_IO to use iov_iter

Change the direct_IO aop to take an iov_iter argument rather than an iovec.
This will get passed down through most filesystems so that only the
__blockdev_direct_IO helper need be aware of whether user or kernel memory
is being passed to the function.

Signed-off-by: Dave Kleikamp <[email protected]>
---
Documentation/filesystems/Locking | 4 +--
Documentation/filesystems/vfs.txt | 4 +--
fs/9p/vfs_addr.c | 8 ++---
fs/block_dev.c | 8 ++---
fs/btrfs/inode.c | 63 +++++++++++++++++++++++----------------
fs/ceph/addr.c | 3 +-
fs/direct-io.c | 19 ++++++------
fs/ext2/inode.c | 8 ++---
fs/ext3/inode.c | 15 ++++------
fs/ext4/ext4.h | 3 +-
fs/ext4/indirect.c | 16 +++++-----
fs/ext4/inode.c | 23 +++++++-------
fs/f2fs/data.c | 4 +--
fs/fat/inode.c | 10 +++----
fs/fuse/cuse.c | 10 +++++--
fs/fuse/file.c | 56 +++++++++++++++++-----------------
fs/fuse/fuse_i.h | 5 ++--
fs/gfs2/aops.c | 7 ++---
fs/hfs/inode.c | 7 ++---
fs/hfsplus/inode.c | 6 ++--
fs/jfs/inode.c | 7 ++---
fs/nfs/direct.c | 13 ++++----
fs/nilfs2/inode.c | 8 ++---
fs/ocfs2/aops.c | 8 ++---
fs/reiserfs/inode.c | 7 ++---
fs/udf/file.c | 3 +-
fs/udf/inode.c | 10 +++----
fs/xfs/xfs_aops.c | 13 ++++----
include/linux/fs.h | 18 +++++------
include/linux/nfs_fs.h | 3 +-
mm/filemap.c | 13 ++++++--
mm/page_io.c | 8 +++--
32 files changed, 196 insertions(+), 194 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index fe7afe2..ff1e311 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -192,8 +192,8 @@ prototypes:
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
- int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs);
+ int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter,
+ loff_t offset);
int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f93a882..461bee1 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -573,8 +573,8 @@ struct address_space_operations {
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
- ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs);
+ ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter,
+ loff_t offset);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
/* migrate the contents of a page to the specified target */
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 9ff073f..5581415 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -241,9 +241,8 @@ static int v9fs_launder_page(struct page *page)
* v9fs_direct_IO - 9P address space operation for direct I/O
* @rw: direction (read or write)
* @iocb: target I/O control block
- * @iov: array of vectors that define I/O buffer
+ * @iter: array of vectors that define I/O buffer
* @pos: offset in file to begin the operation
- * @nr_segs: size of iovec array
*
* The presence of v9fs_direct_IO() in the address space ops vector
* allowes open() O_DIRECT flags which would have failed otherwise.
@@ -257,8 +256,7 @@ static int v9fs_launder_page(struct page *page)
*
*/
static ssize_t
-v9fs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
- loff_t pos, unsigned long nr_segs)
+v9fs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
/*
* FIXME
@@ -267,7 +265,7 @@ v9fs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
*/
p9_debug(P9_DEBUG_VFS, "v9fs_direct_IO: v9fs_direct_IO (%s) off/no(%lld/%lu) EINVAL\n",
iocb->ki_filp->f_path.dentry->d_name.name,
- (long long)pos, nr_segs);
+ (long long)pos, iter->nr_segs);

return -EINVAL;
}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..6f8c9e4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -165,14 +165,14 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
}

static ssize_t
-blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+blkdev_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
+ loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;

- return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, offset,
- nr_segs, blkdev_get_block, NULL, NULL, 0);
+ return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iter,
+ offset, blkdev_get_block, NULL, NULL, 0);
}

int __sync_blockdev(struct block_device *bdev, int wait)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6d1b93c..fe59386 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7194,8 +7194,7 @@ free_ordered:
}

static ssize_t check_direct_IO(struct btrfs_root *root, int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
int seg;
int i;
@@ -7209,35 +7208,50 @@ static ssize_t check_direct_IO(struct btrfs_root *root, int rw, struct kiocb *io
goto out;

/* Check the memory alignment. Blocks cannot straddle pages */
- for (seg = 0; seg < nr_segs; seg++) {
- addr = (unsigned long)iov[seg].iov_base;
- size = iov[seg].iov_len;
- end += size;
- if ((addr & blocksize_mask) || (size & blocksize_mask))
- goto out;
+ if (iov_iter_has_iovec(iter)) {
+ const struct iovec *iov = iov_iter_iovec(iter);
+
+ for (seg = 0; seg < iter->nr_segs; seg++) {
+ addr = (unsigned long)iov[seg].iov_base;
+ size = iov[seg].iov_len;
+ end += size;
+ if ((addr & blocksize_mask) || (size & blocksize_mask))
+ goto out;

- /* If this is a write we don't need to check anymore */
- if (rw & WRITE)
- continue;
+ /* If this is a write we don't need to check anymore */
+ if (rw & WRITE)
+ continue;

- /*
- * Check to make sure we don't have duplicate iov_base's in this
- * iovec, if so return EINVAL, otherwise we'll get csum errors
- * when reading back.
- */
- for (i = seg + 1; i < nr_segs; i++) {
- if (iov[seg].iov_base == iov[i].iov_base)
+ /*
+ * Check to make sure we don't have duplicate iov_base's
+ * in this iovec, if so return EINVAL, otherwise we'll
+ * get csum errors when reading back.
+ */
+ for (i = seg + 1; i < iter->nr_segs; i++) {
+ if (iov[seg].iov_base == iov[i].iov_base)
+ goto out;
+ }
+ }
+ } else if (iov_iter_has_bvec(iter)) {
+ struct bio_vec *bvec = iov_iter_bvec(iter);
+
+ for (seg = 0; seg < iter->nr_segs; seg++) {
+ addr = (unsigned long)bvec[seg].bv_offset;
+ size = bvec[seg].bv_len;
+ end += size;
+ if ((addr & blocksize_mask) || (size & blocksize_mask))
goto out;
}
- }
+ } else
+ BUG();
+
retval = 0;
out:
return retval;
}

static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -7247,8 +7261,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
bool relock = false;
ssize_t ret;

- if (check_direct_IO(BTRFS_I(inode)->root, rw, iocb, iov,
- offset, nr_segs))
+ if (check_direct_IO(BTRFS_I(inode)->root, rw, iocb, iter, offset))
return 0;

atomic_inc(&inode->i_dio_count);
@@ -7260,7 +7273,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
* call btrfs_wait_ordered_range to make absolutely sure that any
* outstanding dirty pages are on disk.
*/
- count = iov_length(iov, nr_segs);
+ count = iov_iter_count(iter);
btrfs_wait_ordered_range(inode, offset, count);

if (rw & WRITE) {
@@ -7285,7 +7298,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,

ret = __blockdev_direct_IO(rw, iocb, inode,
BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev,
- iov, offset, nr_segs, btrfs_get_blocks_direct, NULL,
+ iter, offset, btrfs_get_blocks_direct, NULL,
btrfs_submit_direct, flags);
if (rw & WRITE) {
if (ret < 0 && ret != -EIOCBQUEUED)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 5318a3b..dd6dc25 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1169,8 +1169,7 @@ static int ceph_write_end(struct file *file, struct address_space *mapping,
* never get called.
*/
static ssize_t ceph_direct_io(int rw, struct kiocb *iocb,
- const struct iovec *iov,
- loff_t pos, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t pos)
{
WARN_ON(1);
return -EINVAL;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..a81366c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1043,9 +1043,9 @@ static inline int drop_refcount(struct dio *dio)
*/
static inline ssize_t
do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
- struct block_device *bdev, const struct iovec *iov, loff_t offset,
- unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
- dio_submit_t submit_io, int flags)
+ struct block_device *bdev, struct iov_iter *iter, loff_t offset,
+ get_block_t get_block, dio_iodone_t end_io, dio_submit_t submit_io,
+ int flags)
{
int seg;
size_t size;
@@ -1061,6 +1061,8 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
size_t bytes;
struct buffer_head map_bh = { 0, };
struct blk_plug plug;
+ const struct iovec *iov = iov_iter_iovec(iter);
+ unsigned long nr_segs = iter->nr_segs;

if (rw & WRITE)
rw = WRITE_ODIRECT;
@@ -1279,9 +1281,9 @@ out:

ssize_t
__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
- struct block_device *bdev, const struct iovec *iov, loff_t offset,
- unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
- dio_submit_t submit_io, int flags)
+ struct block_device *bdev, struct iov_iter *iter, loff_t offset,
+ get_block_t get_block, dio_iodone_t end_io, dio_submit_t submit_io,
+ int flags)
{
/*
* The block device state is needed in the end to finally
@@ -1295,9 +1297,8 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
prefetch(bdev->bd_queue);
prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);

- return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
- nr_segs, get_block, end_io,
- submit_io, flags);
+ return do_blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
+ get_block, end_io, submit_io, flags);
}

EXPORT_SYMBOL(__blockdev_direct_IO);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0a87bb1..e3e8e3b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -848,18 +848,16 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
}

static ssize_t
-ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+ext2_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
- ext2_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext2_get_block);
if (ret < 0 && (rw & WRITE))
- ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
+ ext2_write_failed(mapping, offset + iov_iter_count(iter));
return ret;
}

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 2bd8548..85bd13b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1862,8 +1862,7 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
* VFS code falls back into buffered path in that case so we are safe.
*/
static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -1871,10 +1870,10 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
handle_t *handle;
ssize_t ret;
int orphan = 0;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(iter);
int retries = 0;

- trace_ext3_direct_IO_enter(inode, offset, iov_length(iov, nr_segs), rw);
+ trace_ext3_direct_IO_enter(inode, offset, count, rw);

if (rw == WRITE) {
loff_t final_size = offset + count;
@@ -1898,15 +1897,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
}

retry:
- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
- ext3_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext3_get_block);
/*
* In case of error extending write may have instantiated a few
* blocks outside i_size. Trim these off again.
*/
if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + count;

if (end > isize)
ext3_truncate_failed_direct_write(inode);
@@ -1949,8 +1947,7 @@ retry:
ret = err;
}
out:
- trace_ext3_direct_IO_exit(inode, offset,
- iov_length(iov, nr_segs), rw, ret);
+ trace_ext3_direct_IO_exit(inode, offset, count, rw, ret);
return ret;
}

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..afa7741 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2111,8 +2111,7 @@ extern void ext4_da_update_reserve_space(struct inode *inode,
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs);
+ struct iov_iter *iter, loff_t offset);
extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
extern void ext4_ind_truncate(handle_t *, struct inode *inode);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 87b30cd..b6eb453 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -640,8 +640,7 @@ out:
* VFS code falls back into buffered path in that case so we are safe.
*/
ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -649,7 +648,7 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
handle_t *handle;
ssize_t ret;
int orphan = 0;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(iter);
int retries = 0;

if (rw == WRITE) {
@@ -688,18 +687,17 @@ retry:
goto locked;
}
ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- ext4_get_block, NULL, NULL, 0);
+ inode->i_sb->s_bdev, iter,
+ offset, ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
locked:
- ret = blockdev_direct_IO(rw, iocb, inode, iov,
- offset, nr_segs, ext4_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter,
+ offset, ext4_get_block);

if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + iov_iter_count(iter);

if (end > isize)
ext4_truncate_failed_write(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba33c67..1380108 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3043,13 +3043,12 @@ static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
*
*/
static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
ssize_t ret;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(iter);
int overwrite = 0;
get_block_t *get_block_func = NULL;
int dio_flags = 0;
@@ -3058,7 +3057,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,

/* Use the old path for reads and writes beyond i_size. */
if (rw != WRITE || final_size > inode->i_size)
- return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
+ return ext4_ind_direct_IO(rw, iocb, iter, offset);

BUG_ON(iocb->private == NULL);

@@ -3126,8 +3125,8 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
dio_flags = DIO_LOCKING;
}
ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
+ inode->i_sb->s_bdev, iter,
+ offset,
get_block_func,
ext4_end_io_dio,
NULL,
@@ -3188,8 +3187,7 @@ retake_lock:
}

static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -3205,13 +3203,12 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
if (ext4_has_inline_data(inode))
return 0;

- trace_ext4_direct_IO_enter(inode, offset, iov_length(iov, nr_segs), rw);
+ trace_ext4_direct_IO_enter(inode, offset, iov_iter_count(iter), rw);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- ret = ext4_ext_direct_IO(rw, iocb, iov, offset, nr_segs);
+ ret = ext4_ext_direct_IO(rw, iocb, iter, offset);
else
- ret = ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
- trace_ext4_direct_IO_exit(inode, offset,
- iov_length(iov, nr_segs), rw, ret);
+ ret = ext4_ind_direct_IO(rw, iocb, iter, offset);
+ trace_ext4_direct_IO_exit(inode, offset, iov_iter_count(iter), rw, ret);
return ret;
}

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 035f9a3..c40a0af 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -723,7 +723,7 @@ static int f2fs_write_end(struct file *file,
}

static ssize_t f2fs_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -732,7 +732,7 @@ static ssize_t f2fs_direct_IO(int rw, struct kiocb *iocb,
return 0;

/* Needs synchronization with the cleaner */
- return blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ return blockdev_direct_IO(rw, iocb, inode, iter, offset,
get_data_block_ro);
}

diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 11b51bb..70a218d 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -185,8 +185,7 @@ static int fat_write_end(struct file *file, struct address_space *mapping,
}

static ssize_t fat_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -203,7 +202,7 @@ static ssize_t fat_direct_IO(int rw, struct kiocb *iocb,
*
* Return 0, and fallback to normal buffered write.
*/
- loff_t size = offset + iov_length(iov, nr_segs);
+ loff_t size = offset + iov_iter_count(iter);
if (MSDOS_I(inode)->mmu_private < size)
return 0;
}
@@ -212,10 +211,9 @@ static ssize_t fat_direct_IO(int rw, struct kiocb *iocb,
* FAT need to use the DIO_LOCKING for avoiding the race
* condition of fat_get_block() and ->truncate().
*/
- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
- fat_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, fat_get_block);
if (ret < 0 && (rw & WRITE))
- fat_write_failed(mapping, offset + iov_length(iov, nr_segs));
+ fat_write_failed(mapping, offset + iov_iter_count(iter));

return ret;
}
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index aef34b1..014ccc5 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -94,8 +94,11 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count,
loff_t pos = 0;
struct iovec iov = { .iov_base = buf, .iov_len = count };
struct fuse_io_priv io = { .async = 0, .file = file };
+ struct iov_iter ii;

- return fuse_direct_io(&io, &iov, 1, count, &pos, 0);
+ iov_iter_init(&ii, &iov, 1, count, 0);
+
+ return fuse_direct_io(&io, &ii, count, &pos, 0);
}

static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -104,12 +107,15 @@ static ssize_t cuse_write(struct file *file, const char __user *buf,
loff_t pos = 0;
struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count };
struct fuse_io_priv io = { .async = 0, .file = file };
+ struct iov_iter ii;
+
+ iov_iter_init(&ii, &iov, 1, count, 0);

/*
* No locking or generic_write_checks(), the server is
* responsible for locking and sanity checks.
*/
- return fuse_direct_io(&io, &iov, 1, count, &pos, 1);
+ return fuse_direct_io(&io, &ii, count, &pos, 1);
}

static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77865d1..d429c01 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1264,9 +1264,8 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p)
return min(npages, FUSE_MAX_PAGES_PER_REQ);
}

-ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
- unsigned long nr_segs, size_t count, loff_t *ppos,
- int write)
+ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *ii,
+ size_t count, loff_t *ppos, int write)
{
struct file *file = io->file;
struct fuse_file *ff = file->private_data;
@@ -1275,14 +1274,11 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
loff_t pos = *ppos;
ssize_t res = 0;
struct fuse_req *req;
- struct iov_iter ii;
-
- iov_iter_init(&ii, iov, nr_segs, count, 0);

if (io->async)
- req = fuse_get_req_for_background(fc, fuse_iter_npages(&ii));
+ req = fuse_get_req_for_background(fc, fuse_iter_npages(ii));
else
- req = fuse_get_req(fc, fuse_iter_npages(&ii));
+ req = fuse_get_req(fc, fuse_iter_npages(ii));
if (IS_ERR(req))
return PTR_ERR(req);

@@ -1290,7 +1286,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
size_t nres;
fl_owner_t owner = current->files;
size_t nbytes = min(count, nmax);
- int err = fuse_get_user_pages(req, &ii, &nbytes, write);
+ int err = fuse_get_user_pages(req, ii, &nbytes, write);
if (err) {
res = err;
break;
@@ -1320,9 +1316,9 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
fuse_put_request(fc, req);
if (io->async)
req = fuse_get_req_for_background(fc,
- fuse_iter_npages(&ii));
+ fuse_iter_npages(ii));
else
- req = fuse_get_req(fc, fuse_iter_npages(&ii));
+ req = fuse_get_req(fc, fuse_iter_npages(ii));
if (IS_ERR(req))
break;
}
@@ -1336,10 +1332,8 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
}
EXPORT_SYMBOL_GPL(fuse_direct_io);

-static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
- const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos,
- size_t count)
+static ssize_t __fuse_direct_read(struct fuse_io_priv *io, struct iov_iter *ii,
+ loff_t *ppos, size_t count)
{
ssize_t res;
struct file *file = io->file;
@@ -1348,7 +1342,7 @@ static ssize_t __fuse_direct_read(struct fuse_io_priv *io,
if (is_bad_inode(inode))
return -EIO;

- res = fuse_direct_io(io, iov, nr_segs, count, ppos, 0);
+ res = fuse_direct_io(io, ii, count, ppos, 0);

fuse_invalidate_attr(inode);

@@ -1360,21 +1354,24 @@ static ssize_t fuse_direct_read(struct file *file, char __user *buf,
{
struct fuse_io_priv io = { .async = 0, .file = file };
struct iovec iov = { .iov_base = buf, .iov_len = count };
- return __fuse_direct_read(&io, &iov, 1, ppos, count);
+ struct iov_iter ii;
+
+ iov_iter_init(&ii, &iov, 1, count, 0);
+
+ return __fuse_direct_read(&io, &ii, ppos, count);
}

-static ssize_t __fuse_direct_write(struct fuse_io_priv *io,
- const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
+static ssize_t __fuse_direct_write(struct fuse_io_priv *io, struct iov_iter *ii,
+ loff_t *ppos)
{
struct file *file = io->file;
struct inode *inode = file_inode(file);
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(ii);
ssize_t res;

res = generic_write_checks(file, ppos, &count, 0);
if (!res)
- res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1);
+ res = fuse_direct_io(io, ii, count, ppos, 1);

fuse_invalidate_attr(inode);

@@ -1385,6 +1382,7 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = count };
+ struct iov_iter ii;
struct inode *inode = file_inode(file);
ssize_t res;
struct fuse_io_priv io = { .async = 0, .file = file };
@@ -1392,9 +1390,11 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
if (is_bad_inode(inode))
return -EIO;

+ iov_iter_init(&ii, &iov, 1, count, 0);
+
/* Don't allow parallel writes to the same file */
mutex_lock(&inode->i_mutex);
- res = __fuse_direct_write(&io, &iov, 1, ppos);
+ res = __fuse_direct_write(&io, &ii, ppos);
if (res > 0)
fuse_write_update_size(inode, *ppos);
mutex_unlock(&inode->i_mutex);
@@ -2366,8 +2366,8 @@ static inline loff_t fuse_round_up(loff_t off)
}

static ssize_t
-fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+fuse_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *ii,
+ loff_t offset)
{
ssize_t ret = 0;
struct file *file = iocb->ki_filp;
@@ -2376,7 +2376,7 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t pos = 0;
struct inode *inode;
loff_t i_size;
- size_t count = iov_length(iov, nr_segs);
+ size_t count = iov_iter_count(ii);
struct fuse_io_priv *io;

pos = offset;
@@ -2417,9 +2417,9 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
io->async = false;

if (rw == WRITE)
- ret = __fuse_direct_write(io, iov, nr_segs, &pos);
+ ret = __fuse_direct_write(io, ii, &pos);
else
- ret = __fuse_direct_read(io, iov, nr_segs, &pos, count);
+ ret = __fuse_direct_read(io, ii, &pos, count);

if (io->async) {
fuse_aio_complete(io, ret < 0 ? ret : 0, -1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fde7249..dacffcb 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -854,9 +854,8 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,

int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
bool isdir);
-ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
- unsigned long nr_segs, size_t count, loff_t *ppos,
- int write);
+ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *ii,
+ size_t count, loff_t *ppos, int write);
long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
unsigned int flags);
long fuse_ioctl_common(struct file *file, unsigned int cmd,
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ee48ad3..733e94a 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1001,8 +1001,7 @@ static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset)


static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
@@ -1026,8 +1025,8 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
if (rv != 1)
goto out; /* dio not valid, fall back to buffered i/o */

- rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
- offset, nr_segs, gfs2_get_block_direct,
+ rv = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iter,
+ offset, gfs2_get_block_direct,
NULL, NULL, 0);
out:
gfs2_glock_dq(&gh);
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index f9299d8..62440ce 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -125,15 +125,14 @@ static int hfs_releasepage(struct page *page, gfp_t mask)
}

static ssize_t hfs_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = file_inode(file)->i_mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
- hfs_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, hfs_get_block);

/*
* In case of error extending write may have instantiated a few
@@ -141,7 +140,7 @@ static ssize_t hfs_direct_IO(int rw, struct kiocb *iocb,
*/
if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + iov_iter_count(iter);

if (end > isize)
hfs_write_failed(mapping, end);
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index f833d35..fca99cc 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -122,14 +122,14 @@ static int hfsplus_releasepage(struct page *page, gfp_t mask)
}

static ssize_t hfsplus_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = file_inode(file)->i_mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
hfsplus_get_block);

/*
@@ -138,7 +138,7 @@ static ssize_t hfsplus_direct_IO(int rw, struct kiocb *iocb,
*/
if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + iov_iter_count(iter);

if (end > isize)
hfsplus_write_failed(mapping, end);
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 730f24e..0a0453a 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -331,15 +331,14 @@ static sector_t jfs_bmap(struct address_space *mapping, sector_t block)
}

static ssize_t jfs_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = file->f_mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
- jfs_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, jfs_get_block);

/*
* In case of error extending write may have instantiated a few
@@ -347,7 +346,7 @@ static ssize_t jfs_direct_IO(int rw, struct kiocb *iocb,
*/
if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + iov_iter_count(iter);

if (end > isize)
jfs_write_failed(mapping, end);
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 0bd7a55..bceb47e 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -112,7 +112,7 @@ static inline int put_dreq(struct nfs_direct_req *dreq)
* nfs_direct_IO - NFS address space operation for direct I/O
* @rw: direction (read or write)
* @iocb: target I/O control block
- * @iov: array of vectors that define I/O buffer
+ * @iter: array of vectors that define I/O buffer
* @pos: offset in file to begin the operation
* @nr_segs: size of iovec array
*
@@ -121,22 +121,25 @@ static inline int put_dreq(struct nfs_direct_req *dreq)
* shunt off direct read and write requests before the VFS gets them,
* so this method is only ever called for swap.
*/
-ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs)
+ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
{
#ifndef CONFIG_NFS_SWAP
dprintk("NFS: nfs_direct_IO (%s) off/no(%Ld/%lu) EINVAL\n",
iocb->ki_filp->f_path.dentry->d_name.name,
- (long long) pos, nr_segs);
+ (long long) pos, iter->nr_segs);

return -EINVAL;
#else
+ const struct iovec *iov = iov_iter_iovec(iter);
+
VM_BUG_ON(iocb->ki_left != PAGE_SIZE);
VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE);

if (rw == READ || rw == KERNEL_READ)
- return nfs_file_direct_read(iocb, iov, nr_segs, pos,
+ return nfs_file_direct_read(iocb, iov, iter->nr_segs, pos,
rw == READ ? true : false);
- return nfs_file_direct_write(iocb, iov, nr_segs, pos,
+ return nfs_file_direct_write(iocb, iov, iter->nr_segs, pos,
rw == WRITE ? true : false);
#endif /* CONFIG_NFS_SWAP */
}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index b1a5277..059b760 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -298,8 +298,8 @@ static int nilfs_write_end(struct file *file, struct address_space *mapping,
}

static ssize_t
-nilfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+nilfs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
+ loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -310,7 +310,7 @@ nilfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
return 0;

/* Needs synchronization with the cleaner */
- size = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ size = blockdev_direct_IO(rw, iocb, inode, iter, offset,
nilfs_get_block);

/*
@@ -319,7 +319,7 @@ nilfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
*/
if (unlikely((rw & WRITE) && size < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + iov_iter_count(iter);

if (end > isize)
nilfs_write_failed(mapping, end);
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 79736a2..b5217d9 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -622,9 +622,8 @@ static int ocfs2_releasepage(struct page *page, gfp_t wait)

static ssize_t ocfs2_direct_IO(int rw,
struct kiocb *iocb,
- const struct iovec *iov,
- loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter,
+ loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file)->i_mapping->host;
@@ -641,8 +640,7 @@ static ssize_t ocfs2_direct_IO(int rw,
return 0;

return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev,
- iov, offset, nr_segs,
- ocfs2_direct_IO_get_blocks,
+ iter, offset, ocfs2_direct_IO_get_blocks,
ocfs2_dio_end_io, NULL, 0);
}

diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 0048cc1..9507e17 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3079,14 +3079,13 @@ static int reiserfs_releasepage(struct page *page, gfp_t unused_gfp_flags)
/* We thank Mingming Cao for helping us understand in great detail what
to do in this section of the code. */
static ssize_t reiserfs_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov, loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
reiserfs_get_blocks_direct_io);

/*
@@ -3095,7 +3094,7 @@ static ssize_t reiserfs_direct_IO(int rw, struct kiocb *iocb,
*/
if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
- loff_t end = offset + iov_length(iov, nr_segs);
+ loff_t end = offset + iov_iter_count(iter);

if ((end > isize) && inode_newsize_ok(inode, isize) == 0) {
truncate_setsize(inode, isize);
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 29569dd..339df8b 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -119,8 +119,7 @@ static int udf_adinicb_write_end(struct file *file,
}

static ssize_t udf_adinicb_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+ struct iov_iter *iter, loff_t offset)
{
/* Fallback to buffered I/O. */
return 0;
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index b6d15d3..fad32d5 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -216,19 +216,17 @@ static int udf_write_begin(struct file *file, struct address_space *mapping,
return ret;
}

-static ssize_t udf_direct_IO(int rw, struct kiocb *iocb,
- const struct iovec *iov,
- loff_t offset, unsigned long nr_segs)
+static ssize_t udf_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
+ loff_t offset)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
- udf_get_block);
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, udf_get_block);
if (unlikely(ret < 0 && (rw & WRITE)))
- udf_write_failed(mapping, offset + iov_length(iov, nr_segs));
+ udf_write_failed(mapping, offset + iov_iter_count(iter));
return ret;
}

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 596ec71..4568b6e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1431,9 +1431,8 @@ STATIC ssize_t
xfs_vm_direct_IO(
int rw,
struct kiocb *iocb,
- const struct iovec *iov,
- loff_t offset,
- unsigned long nr_segs)
+ struct iov_iter *iter,
+ loff_t offset)
{
struct inode *inode = iocb->ki_filp->f_mapping->host;
struct block_device *bdev = xfs_find_bdev_for_inode(inode);
@@ -1441,7 +1440,7 @@ xfs_vm_direct_IO(
ssize_t ret;

if (rw & WRITE) {
- size_t size = iov_length(iov, nr_segs);
+ size_t size = iov_iter_count(iter);

/*
* We cannot preallocate a size update transaction here as we
@@ -1453,15 +1452,13 @@ xfs_vm_direct_IO(
if (offset + size > XFS_I(inode)->i_d.di_size)
ioend->io_isdirect = 1;

- ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov,
- offset, nr_segs,
+ ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
xfs_get_blocks_direct,
xfs_end_io_direct_write, NULL, 0);
if (ret != -EIOCBQUEUED && iocb->private)
goto out_destroy_ioend;
} else {
- ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov,
- offset, nr_segs,
+ ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iter, offset,
xfs_get_blocks_direct,
NULL, NULL, 0);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 322d585..2ddd8e3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -462,8 +462,8 @@ struct address_space_operations {
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
- ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
- loff_t offset, unsigned long nr_segs);
+ ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter,
+ loff_t offset);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
void **, unsigned long *);
/*
@@ -2562,16 +2562,16 @@ enum {
void dio_end_io(struct bio *bio, int error);

ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
- struct block_device *bdev, const struct iovec *iov, loff_t offset,
- unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
- dio_submit_t submit_io, int flags);
+ struct block_device *bdev, struct iov_iter *iter, loff_t offset,
+ get_block_t get_block, dio_iodone_t end_io, dio_submit_t submit_io,
+ int flags);

static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
- struct inode *inode, const struct iovec *iov, loff_t offset,
- unsigned long nr_segs, get_block_t get_block)
+ struct inode *inode, struct iov_iter *iter, loff_t offset,
+ get_block_t get_block)
{
- return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
- offset, nr_segs, get_block, NULL, NULL,
+ return __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iter,
+ offset, get_block, NULL, NULL,
DIO_LOCKING | DIO_SKIP_HOLES);
}
#endif
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 7125cef..a4b19d2 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -457,8 +457,7 @@ extern int nfs3_removexattr (struct dentry *, const char *name);
/*
* linux/fs/nfs/direct.c
*/
-extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t,
- unsigned long);
+extern ssize_t nfs_direct_IO(int, struct kiocb *, struct iov_iter *, loff_t);
extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
const struct iovec *iov, unsigned long nr_segs,
loff_t pos, bool uio);
diff --git a/mm/filemap.c b/mm/filemap.c
index 11ebe36..e140e38 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1427,11 +1427,15 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
goto out; /* skip atime */
size = i_size_read(inode);
if (pos < size) {
+ size_t bytes = iov_length(iov, nr_segs);
retval = filemap_write_and_wait_range(mapping, pos,
- pos + iov_length(iov, nr_segs) - 1);
+ pos + bytes - 1);
if (!retval) {
+ struct iov_iter iter;
+
+ iov_iter_init(&iter, iov, nr_segs, bytes, 0);
retval = mapping->a_ops->direct_IO(READ, iocb,
- iov, pos, nr_segs);
+ &iter, pos);
}
if (retval > 0) {
*ppos = pos + retval;
@@ -2056,6 +2060,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
ssize_t written;
size_t write_len;
pgoff_t end;
+ struct iov_iter iter;

if (count != ocount)
*nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
@@ -2087,7 +2092,9 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
}
}

- written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
+ iov_iter_init(&iter, iov, *nr_segs, write_len, 0);
+
+ written = mapping->a_ops->direct_IO(WRITE, iocb, &iter, pos);

/*
* Finally, try again to invalidate clean pages which might have been
diff --git a/mm/page_io.c b/mm/page_io.c
index ba05b64..0c1db1a 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -263,6 +263,9 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
.iov_base = kmap(page),
.iov_len = PAGE_SIZE,
};
+ struct iov_iter iter;
+
+ iov_iter_init(&iter, &iov, 1, PAGE_SIZE, 0);

init_sync_kiocb(&kiocb, swap_file);
kiocb.ki_pos = page_file_offset(page);
@@ -271,9 +274,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

set_page_writeback(page);
unlock_page(page);
- ret = mapping->a_ops->direct_IO(KERNEL_WRITE,
- &kiocb, &iov,
- kiocb.ki_pos, 1);
+ ret = mapping->a_ops->direct_IO(KERNEL_WRITE, &kiocb, &iter,
+ kiocb.ki_pos);
kunmap(page);
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
--
1.8.3.4

2013-07-25 18:09:03

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 14/33] aio: add aio_kernel_() interface

This adds an interface that lets kernel callers submit aio iocbs without
going through the user space syscalls. This lets kernel callers avoid
the management limits and overhead of the context. It will also let us
integrate aio operations with other kernel apis that the user space
interface doesn't have access to.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/aio.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/aio.h | 17 ++++++++++-
2 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 9b5ca11..c65ba13 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -596,6 +596,10 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
atomic_set(&iocb->ki_users, 0);
wake_up_process(iocb->ki_obj.tsk);
return;
+ } else if (is_kernel_kiocb(iocb)) {
+ iocb->ki_obj.complete(iocb->ki_user_data, res);
+ aio_kernel_free(iocb);
+ return;
}

/*
@@ -1072,6 +1076,83 @@ rw_common:
return 0;
}

+/*
+ * This allocates an iocb that will be used to submit and track completion of
+ * an IO that is issued from kernel space.
+ *
+ * The caller is expected to call the appropriate aio_kernel_init_() functions
+ * and then call aio_kernel_submit(). From that point forward progress is
+ * guaranteed by the file system aio method. Eventually the caller's
+ * completion callback will be called.
+ *
+ * These iocbs are special. They don't have a context, we don't limit the
+ * number pending, and they can't be canceled.
+ */
+struct kiocb *aio_kernel_alloc(gfp_t gfp)
+{
+ return kzalloc(sizeof(struct kiocb), gfp);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_alloc);
+
+void aio_kernel_free(struct kiocb *iocb)
+{
+ kfree(iocb);
+}
+EXPORT_SYMBOL_GPL(aio_kernel_free);
+
+/*
+ * ptr and count can be a buff and bytes or an iov and segs.
+ */
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+ unsigned short op, void *ptr, size_t nr, loff_t off)
+{
+ iocb->ki_filp = filp;
+ iocb->ki_opcode = op;
+ iocb->ki_buf = (char __user *)(unsigned long)ptr;
+ iocb->ki_left = nr;
+ iocb->ki_nbytes = nr;
+ iocb->ki_pos = off;
+ iocb->ki_ctx = (void *)-1;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
+
+void aio_kernel_init_callback(struct kiocb *iocb,
+ void (*complete)(u64 user_data, long res),
+ u64 user_data)
+{
+ iocb->ki_obj.complete = complete;
+ iocb->ki_user_data = user_data;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_callback);
+
+/*
+ * The iocb is our responsibility once this is called. The caller must not
+ * reference it.
+ *
+ * Callers must be prepared for their iocb completion callback to be called the
+ * moment they enter this function. The completion callback may be called from
+ * any context.
+ *
+ * Returns: 0: the iocb completion callback will be called with the op result
+ * negative errno: the operation was not submitted and the iocb was freed
+ */
+int aio_kernel_submit(struct kiocb *iocb)
+{
+ int ret;
+
+ BUG_ON(!is_kernel_kiocb(iocb));
+ BUG_ON(!iocb->ki_obj.complete);
+ BUG_ON(!iocb->ki_filp);
+
+ ret = aio_run_iocb(iocb, 0);
+
+ if (ret)
+ aio_kernel_free(iocb);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_submit);
+
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
struct iocb *iocb, bool compat)
{
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 1bdf965..014a75d 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -33,13 +33,15 @@ struct kiocb {
atomic_t ki_users;

struct file *ki_filp;
- struct kioctx *ki_ctx; /* NULL for sync ops */
+ struct kioctx *ki_ctx; /* NULL for sync ops,
+ -1 for kernel caller */
kiocb_cancel_fn *ki_cancel;
void (*ki_dtor)(struct kiocb *);

union {
void __user *user;
struct task_struct *tsk;
+ void (*complete)(u64 user_data, long res);
} ki_obj;

__u64 ki_user_data; /* user's data for completion */
@@ -71,6 +73,11 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
return kiocb->ki_ctx == NULL;
}

+static inline bool is_kernel_kiocb(struct kiocb *kiocb)
+{
+ return kiocb->ki_ctx == (void *)-1;
+}
+
static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
{
*kiocb = (struct kiocb) {
@@ -91,6 +98,14 @@ extern void exit_aio(struct mm_struct *mm);
extern long do_io_submit(aio_context_t ctx_id, long nr,
struct iocb __user *__user *iocbpp, bool compat);
void kiocb_set_cancel_fn(struct kiocb *req, kiocb_cancel_fn *cancel);
+struct kiocb *aio_kernel_alloc(gfp_t gfp);
+void aio_kernel_free(struct kiocb *iocb);
+void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
+ unsigned short op, void *ptr, size_t nr, loff_t off);
+void aio_kernel_init_callback(struct kiocb *iocb,
+ void (*complete)(u64 user_data, long res),
+ u64 user_data);
+int aio_kernel_submit(struct kiocb *iocb);
#else
static inline ssize_t wait_on_sync_kiocb(struct kiocb *iocb) { return 0; }
static inline void aio_put_req(struct kiocb *iocb) { }
--
1.8.3.4

2013-07-25 18:08:58

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 06/33] iov_iter: hide iovec details behind ops function pointers

From: Zach Brown <[email protected]>

This moves the current iov_iter functions behind an ops struct of
function pointers. The current iov_iter functions all work with memory
which is specified by iovec arrays of user space pointers.

This patch is part of a series that lets us specify memory with bio_vec
arrays of page pointers. By moving to an iov_iter operation struct we
can add that support in later patches in this series by adding another
set of function pointers.

I only came to this after having initialy tried to teach the current
iov_iter functions about bio_vecs by introducing conditional branches
that dealt with bio_vecs in all the functions. It wasn't pretty. This
approach seems to be the lesser evil.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/cifs/file.c | 4 +--
fs/fuse/file.c | 5 ++--
fs/iov-iter.c | 86 +++++++++++++++++++++++++++++-------------------------
include/linux/fs.h | 77 ++++++++++++++++++++++++++++++++++++++----------
4 files changed, 114 insertions(+), 58 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 1e57f36..b5f9d3d 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2733,8 +2733,8 @@ cifs_readdata_to_iov(struct cifs_readdata *rdata, const struct iovec *iov,
/* go while there's data to be copied and no errors */
if (copy && !rc) {
pdata = kmap(page);
- rc = memcpy_toiovecend(ii.iov, pdata, ii.iov_offset,
- (int)copy);
+ rc = memcpy_toiovecend(iov_iter_iovec(&ii), pdata,
+ ii.iov_offset, (int)copy);
kunmap(page);
if (!rc) {
*copied += copy;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 633766c..77865d1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1172,9 +1172,10 @@ static inline void fuse_page_descs_length_init(struct fuse_req *req,
req->page_descs[i].offset;
}

-static inline unsigned long fuse_get_user_addr(const struct iov_iter *ii)
+static inline unsigned long fuse_get_user_addr(struct iov_iter *ii)
{
- return (unsigned long)ii->iov->iov_base + ii->iov_offset;
+ struct iovec *iov = iov_iter_iovec(ii);
+ return (unsigned long)iov->iov_base + ii->iov_offset;
}

static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 6cecab4..6cb6be0 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -36,9 +36,10 @@ static size_t __iovec_copy_to_user(char *vaddr, const struct iovec *iov,
* were sucessfully copied. If a fault is encountered then return the number of
* bytes which were copied.
*/
-size_t iov_iter_copy_to_user_atomic(struct page *page,
+static size_t ii_iovec_copy_to_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

@@ -46,55 +47,52 @@ size_t iov_iter_copy_to_user_atomic(struct page *page,
kaddr = kmap_atomic(page);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = __copy_to_user_inatomic(buf, kaddr + offset, bytes);
copied = bytes - left;
} else {
- copied = __iovec_copy_to_user(kaddr + offset, i->iov,
+ copied = __iovec_copy_to_user(kaddr + offset, iov,
i->iov_offset, bytes, 1);
}
kunmap_atomic(kaddr);

return copied;
}
-EXPORT_SYMBOL(iov_iter_copy_to_user_atomic);

/*
* This has the same sideeffects and return value as
- * iov_iter_copy_to_user_atomic().
+ * ii_iovec_copy_to_user_atomic().
* The difference is that it attempts to resolve faults.
* Page must not be locked.
*/
-size_t __iov_iter_copy_to_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
+static size_t ii_iovec_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes,
+ int check_access)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

+ if (check_access) {
+ might_sleep();
+ if (generic_segment_checks(iov, &i->nr_segs, &bytes,
+ VERIFY_WRITE))
+ return 0;
+ }
+
kaddr = kmap(page);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = copy_to_user(buf, kaddr + offset, bytes);
copied = bytes - left;
} else {
- copied = __iovec_copy_to_user(kaddr + offset, i->iov,
+ copied = __iovec_copy_to_user(kaddr + offset, iov,
i->iov_offset, bytes, 0);
}
kunmap(page);
return copied;
}
-EXPORT_SYMBOL(__iov_iter_copy_to_user);
-
-size_t iov_iter_copy_to_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes)
-{
- might_sleep();
- if (generic_segment_checks(i->iov, &i->nr_segs, &bytes, VERIFY_WRITE))
- return 0;
- return __iov_iter_copy_to_user(page, i, offset, bytes);
-}
-EXPORT_SYMBOL(iov_iter_copy_to_user);

static size_t __iovec_copy_from_user(char *vaddr, const struct iovec *iov,
size_t base, size_t bytes, int atomic)
@@ -126,9 +124,10 @@ static size_t __iovec_copy_from_user(char *vaddr, const struct iovec *iov,
* were successfully copied. If a fault is encountered then return the number
* of bytes which were copied.
*/
-size_t iov_iter_copy_from_user_atomic(struct page *page,
+static size_t ii_iovec_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

@@ -136,11 +135,11 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
kaddr = kmap_atomic(page);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = __copy_from_user_inatomic(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
- copied = __iovec_copy_from_user(kaddr + offset, i->iov,
+ copied = __iovec_copy_from_user(kaddr + offset, iov,
i->iov_offset, bytes, 1);
}
kunmap_atomic(kaddr);
@@ -151,32 +150,32 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);

/*
* This has the same sideeffects and return value as
- * iov_iter_copy_from_user_atomic().
+ * ii_iovec_copy_from_user_atomic().
* The difference is that it attempts to resolve faults.
* Page must not be locked.
*/
-size_t iov_iter_copy_from_user(struct page *page,
+static size_t ii_iovec_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
+ struct iovec *iov = (struct iovec *)i->data;
char *kaddr;
size_t copied;

kaddr = kmap(page);
if (likely(i->nr_segs == 1)) {
int left;
- char __user *buf = i->iov->iov_base + i->iov_offset;
+ char __user *buf = iov->iov_base + i->iov_offset;
left = __copy_from_user(kaddr + offset, buf, bytes);
copied = bytes - left;
} else {
- copied = __iovec_copy_from_user(kaddr + offset, i->iov,
+ copied = __iovec_copy_from_user(kaddr + offset, iov,
i->iov_offset, bytes, 0);
}
kunmap(page);
return copied;
}
-EXPORT_SYMBOL(iov_iter_copy_from_user);

-void iov_iter_advance(struct iov_iter *i, size_t bytes)
+static void ii_iovec_advance(struct iov_iter *i, size_t bytes)
{
BUG_ON(i->count < bytes);

@@ -184,7 +183,7 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
i->iov_offset += bytes;
i->count -= bytes;
} else {
- const struct iovec *iov = i->iov;
+ struct iovec *iov = (struct iovec *)i->data;
size_t base = i->iov_offset;
unsigned long nr_segs = i->nr_segs;

@@ -206,12 +205,11 @@ void iov_iter_advance(struct iov_iter *i, size_t bytes)
base = 0;
}
}
- i->iov = iov;
+ i->data = (unsigned long)iov;
i->iov_offset = base;
i->nr_segs = nr_segs;
}
}
-EXPORT_SYMBOL(iov_iter_advance);

/*
* Fault in the first iovec of the given iov_iter, to a maximum length
@@ -222,23 +220,33 @@ EXPORT_SYMBOL(iov_iter_advance);
* would be possible (callers must not rely on the fact that _only_ the
* first iovec will be faulted with the current implementation).
*/
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+static int ii_iovec_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
+ struct iovec *iov = (struct iovec *)i->data;
+ char __user *buf = iov->iov_base + i->iov_offset;
+ bytes = min(bytes, iov->iov_len - i->iov_offset);
return fault_in_pages_readable(buf, bytes);
}
-EXPORT_SYMBOL(iov_iter_fault_in_readable);

/*
* Return the count of just the current iov_iter segment.
*/
-size_t iov_iter_single_seg_count(const struct iov_iter *i)
+static size_t ii_iovec_single_seg_count(const struct iov_iter *i)
{
- const struct iovec *iov = i->iov;
+ const struct iovec *iov = (struct iovec *)i->data;
if (i->nr_segs == 1)
return i->count;
else
return min(i->count, iov->iov_len - i->iov_offset);
}
-EXPORT_SYMBOL(iov_iter_single_seg_count);
+
+struct iov_iter_ops ii_iovec_ops = {
+ .ii_copy_to_user_atomic = ii_iovec_copy_to_user_atomic,
+ .ii_copy_to_user = ii_iovec_copy_to_user,
+ .ii_copy_from_user_atomic = ii_iovec_copy_from_user_atomic,
+ .ii_copy_from_user = ii_iovec_copy_from_user,
+ .ii_advance = ii_iovec_advance,
+ .ii_fault_in_readable = ii_iovec_fault_in_readable,
+ .ii_single_seg_count = ii_iovec_single_seg_count,
+};
+EXPORT_SYMBOL(ii_iovec_ops);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bfc6eb0..96120d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -290,31 +290,73 @@ struct address_space;
struct writeback_control;

struct iov_iter {
- const struct iovec *iov;
+ struct iov_iter_ops *ops;
+ unsigned long data;
unsigned long nr_segs;
size_t iov_offset;
size_t count;
};

-size_t __iov_iter_copy_to_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t __iov_iter_copy_to_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_to_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_from_user_atomic(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-size_t iov_iter_copy_from_user(struct page *page,
- struct iov_iter *i, unsigned long offset, size_t bytes);
-void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
-size_t iov_iter_single_seg_count(const struct iov_iter *i);
+struct iov_iter_ops {
+ size_t (*ii_copy_to_user_atomic)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ size_t (*ii_copy_to_user)(struct page *, struct iov_iter *,
+ unsigned long, size_t, int);
+ size_t (*ii_copy_from_user_atomic)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ size_t (*ii_copy_from_user)(struct page *, struct iov_iter *,
+ unsigned long, size_t);
+ void (*ii_advance)(struct iov_iter *, size_t);
+ int (*ii_fault_in_readable)(struct iov_iter *, size_t);
+ size_t (*ii_single_seg_count)(const struct iov_iter *);
+};
+
+static inline size_t iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_to_user_atomic(page, i, offset, bytes);
+}
+static inline size_t __iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_to_user(page, i, offset, bytes, 0);
+}
+static inline size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_to_user(page, i, offset, bytes, 1);
+}
+static inline size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_from_user_atomic(page, i, offset, bytes);
+}
+static inline size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ return i->ops->ii_copy_from_user(page, i, offset, bytes);
+}
+static inline void iov_iter_advance(struct iov_iter *i, size_t bytes)
+{
+ return i->ops->ii_advance(i, bytes);
+}
+static inline int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
+{
+ return i->ops->ii_fault_in_readable(i, bytes);
+}
+static inline size_t iov_iter_single_seg_count(const struct iov_iter *i)
+{
+ return i->ops->ii_single_seg_count(i);
+}
+
+extern struct iov_iter_ops ii_iovec_ops;

static inline void iov_iter_init(struct iov_iter *i,
const struct iovec *iov, unsigned long nr_segs,
size_t count, size_t written)
{
- i->iov = iov;
+ i->ops = &ii_iovec_ops;
+ i->data = (unsigned long)iov;
i->nr_segs = nr_segs;
i->iov_offset = 0;
i->count = count + written;
@@ -322,6 +364,11 @@ static inline void iov_iter_init(struct iov_iter *i,
iov_iter_advance(i, written);
}

+static inline struct iovec *iov_iter_iovec(struct iov_iter *i)
+{
+ return (struct iovec *)i->data;
+}
+
static inline size_t iov_iter_count(struct iov_iter *i)
{
return i->count;
--
1.8.3.4

2013-07-25 18:09:00

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 13/33] fs: pull iov_iter use higher up the stack

From: Zach Brown <[email protected]>

Right now only callers of generic_perform_write() pack their iovec
arguments into an iov_iter structure. All the callers higher up in the
stack work on raw iovec arguments.

This patch introduces the use of the iov_iter abstraction higher up the
stack. Private generic path functions are changed to operation on
iov_iter instead of on raw iovecs. Exported interfaces that take iovecs
immediately pack their arguments into an iov_iter and call into the
shared functions.

File operation struct functions are added with iov_iter as an argument
so that callers to the generic file system functions can specify
abstract memory rather than iovec arrays only.

Almost all of this patch only transforms arguments and shouldn't change
functionality. The buffered read path is the exception. We add a
read_actor function which uses the iov_iter helper functions instead of
operating on each individual iovec element. This may improve
performance as the iov_iter helper can copy multiple iovec elements from
one mapped page cache page.

As always, the direct IO path is special. Sadly, it may still be
cleanest to have it work on the underlying memory structures directly
instead of working through the iov_iter abstraction.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
Documentation/filesystems/Locking | 2 +
Documentation/filesystems/vfs.txt | 8 ++
include/linux/fs.h | 12 ++
mm/filemap.c | 258 +++++++++++++++++++++++++-------------
4 files changed, 190 insertions(+), 90 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index ff1e311..21ef48f 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -426,7 +426,9 @@ prototypes:
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);
int (*iterate) (struct file *, struct dir_context *);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 461bee1..f8749f7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -790,7 +790,9 @@ struct file_operations {
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);
int (*iterate) (struct file *, struct dir_context *);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -825,10 +827,16 @@ otherwise noted.

aio_read: called by io_submit(2) and other asynchronous I/O operations

+ read_iter: aio_read replacement, called by io_submit(2) and other
+ asynchronous I/O operations
+
write: called by write(2) and related system calls

aio_write: called by io_submit(2) and other asynchronous I/O operations

+ write_iter: aio_write replacement, called by io_submit(2) and other
+ asynchronous I/O operations
+
iterate: called when the VFS needs to read the directory contents

poll: called by the VFS when a process wants to check if there is
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ddd8e3..d716a29 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1625,7 +1625,9 @@ struct file_operations {
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
+ ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, loff_t);
int (*iterate) (struct file *, struct dir_context *);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -2491,13 +2493,23 @@ extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr,
extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *,
+ loff_t);
extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long,
loff_t *);
+extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *,
+ loff_t *);
extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t);
+extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *,
+ loff_t);
extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
+extern ssize_t generic_file_direct_write_iter(struct kiocb *, struct iov_iter *,
+ loff_t, loff_t *, size_t);
extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern ssize_t generic_file_buffered_write_iter(struct kiocb *,
+ struct iov_iter *, loff_t, loff_t *, size_t, ssize_t);
extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
extern int generic_segment_checks(const struct iovec *iov,
diff --git a/mm/filemap.c b/mm/filemap.c
index e140e38..41b9672 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1390,31 +1390,41 @@ int generic_segment_checks(const struct iovec *iov,
}
EXPORT_SYMBOL(generic_segment_checks);

+static int file_read_iter_actor(read_descriptor_t *desc, struct page *page,
+ unsigned long offset, unsigned long size)
+{
+ struct iov_iter *iter = desc->arg.data;
+ unsigned long copied = 0;
+
+ if (size > desc->count)
+ size = desc->count;
+
+ copied = __iov_iter_copy_to_user(page, iter, offset, size);
+ if (copied < size)
+ desc->error = -EFAULT;
+
+ iov_iter_advance(iter, copied);
+ desc->count -= copied;
+ desc->written += copied;
+
+ return copied;
+}
+
/**
- * generic_file_aio_read - generic filesystem read routine
+ * generic_file_read_iter - generic filesystem read routine
* @iocb: kernel I/O control block
- * @iov: io vector request
- * @nr_segs: number of segments in the iovec
+ * @iter: memory vector
* @pos: current file position
- *
- * This is the "read()" routine for all filesystems
- * that can use the page cache directly.
*/
ssize_t
-generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos)
+generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, loff_t pos)
{
struct file *filp = iocb->ki_filp;
- ssize_t retval;
- unsigned long seg = 0;
- size_t count;
+ read_descriptor_t desc;
+ ssize_t retval = 0;
+ size_t count = iov_iter_count(iter);
loff_t *ppos = &iocb->ki_pos;

- count = 0;
- retval = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
- if (retval)
- return retval;
-
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
loff_t size;
@@ -1427,16 +1437,11 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
goto out; /* skip atime */
size = i_size_read(inode);
if (pos < size) {
- size_t bytes = iov_length(iov, nr_segs);
retval = filemap_write_and_wait_range(mapping, pos,
- pos + bytes - 1);
- if (!retval) {
- struct iov_iter iter;
-
- iov_iter_init(&iter, iov, nr_segs, bytes, 0);
+ pos + count - 1);
+ if (!retval)
retval = mapping->a_ops->direct_IO(READ, iocb,
- &iter, pos);
- }
+ iter, pos);
if (retval > 0) {
*ppos = pos + retval;
count -= retval;
@@ -1457,42 +1462,47 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
}
}

- count = retval;
- for (seg = 0; seg < nr_segs; seg++) {
- read_descriptor_t desc;
- loff_t offset = 0;
-
- /*
- * If we did a short DIO read we need to skip the section of the
- * iov that we've already read data into.
- */
- if (count) {
- if (count > iov[seg].iov_len) {
- count -= iov[seg].iov_len;
- continue;
- }
- offset = count;
- count = 0;
- }
-
- desc.written = 0;
- desc.arg.buf = iov[seg].iov_base + offset;
- desc.count = iov[seg].iov_len - offset;
- if (desc.count == 0)
- continue;
- desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
- retval += desc.written;
- if (desc.error) {
- retval = retval ?: desc.error;
- break;
- }
- if (desc.count > 0)
- break;
- }
+ desc.written = 0;
+ desc.arg.data = iter;
+ desc.count = count;
+ desc.error = 0;
+ do_generic_file_read(filp, ppos, &desc, file_read_iter_actor);
+ if (desc.written)
+ retval = desc.written;
+ else
+ retval = desc.error;
out:
return retval;
}
+EXPORT_SYMBOL(generic_file_read_iter);
+
+/**
+ * generic_file_aio_read - generic filesystem read routine
+ * @iocb: kernel I/O control block
+ * @iov: io vector request
+ * @nr_segs: number of segments in the iovec
+ * @pos: current file position
+ *
+ * This is the "read()" routine for all filesystems
+ * that can use the page cache directly.
+ */
+ssize_t
+generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos)
+{
+ struct iov_iter iter;
+ int ret;
+ size_t count;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_WRITE);
+ if (ret)
+ return ret;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+
+ return generic_file_read_iter(iocb, &iter, pos);
+}
EXPORT_SYMBOL(generic_file_aio_read);

#ifdef CONFIG_MMU
@@ -2050,9 +2060,8 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
EXPORT_SYMBOL(pagecache_write_end);

ssize_t
-generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long *nr_segs, loff_t pos, loff_t *ppos,
- size_t count, size_t ocount)
+generic_file_direct_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, loff_t *ppos, size_t count)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -2060,12 +2069,14 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
ssize_t written;
size_t write_len;
pgoff_t end;
- struct iov_iter iter;

- if (count != ocount)
- *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count);
+ if (count != iov_iter_count(iter)) {
+ written = iov_iter_shorten(iter, count);
+ if (written)
+ goto out;
+ }

- write_len = iov_length(iov, *nr_segs);
+ write_len = count;
end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;

written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
@@ -2092,9 +2103,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
}
}

- iov_iter_init(&iter, iov, *nr_segs, write_len, 0);
-
- written = mapping->a_ops->direct_IO(WRITE, iocb, &iter, pos);
+ written = mapping->a_ops->direct_IO(WRITE, iocb, iter, pos);

/*
* Finally, try again to invalidate clean pages which might have been
@@ -2120,6 +2129,23 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
out:
return written;
}
+EXPORT_SYMBOL(generic_file_direct_write_iter);
+
+ssize_t
+generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long *nr_segs, loff_t pos, loff_t *ppos,
+ size_t count, size_t ocount)
+{
+ struct iov_iter iter;
+ ssize_t ret;
+
+ iov_iter_init(&iter, iov, *nr_segs, ocount, 0);
+ ret = generic_file_direct_write_iter(iocb, &iter, pos, ppos, count);
+ /* generic_file_direct_write_iter() might have shortened the vec */
+ if (*nr_segs != iter.nr_segs)
+ *nr_segs = iter.nr_segs;
+ return ret;
+}
EXPORT_SYMBOL(generic_file_direct_write);

/*
@@ -2253,16 +2279,19 @@ again:
}

ssize_t
-generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, loff_t *ppos,
- size_t count, ssize_t written)
+generic_file_buffered_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos, loff_t *ppos, size_t count, ssize_t written)
{
struct file *file = iocb->ki_filp;
ssize_t status;
- struct iov_iter i;

- iov_iter_init(&i, iov, nr_segs, count, written);
- status = generic_perform_write(file, &i, pos);
+ if ((count + written) != iov_iter_count(iter)) {
+ int rc = iov_iter_shorten(iter, count + written);
+ if (rc)
+ return rc;
+ }
+
+ status = generic_perform_write(file, iter, pos);

if (likely(status >= 0)) {
written += status;
@@ -2271,13 +2300,24 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,

return written ? written : status;
}
+EXPORT_SYMBOL(generic_file_buffered_write_iter);
+
+ssize_t
+generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos, loff_t *ppos,
+ size_t count, ssize_t written)
+{
+ struct iov_iter iter;
+ iov_iter_init(&iter, iov, nr_segs, count, written);
+ return generic_file_buffered_write_iter(iocb, &iter, pos, ppos,
+ count, written);
+}
EXPORT_SYMBOL(generic_file_buffered_write);

/**
* __generic_file_aio_write - write data to a file
* @iocb: IO state structure (file, offset, etc.)
- * @iov: vector with data to write
- * @nr_segs: number of segments in the vector
+ * @iter: iov_iter specifying memory to write
* @ppos: position where to write
*
* This function does all the work needed for actually writing data to a
@@ -2292,24 +2332,18 @@ EXPORT_SYMBOL(generic_file_buffered_write);
* A caller has to handle it. This is mainly due to the fact that we want to
* avoid syncing under i_mutex.
*/
-ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t *ppos)
+ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t *ppos)
{
struct file *file = iocb->ki_filp;
struct address_space * mapping = file->f_mapping;
- size_t ocount; /* original count */
size_t count; /* after file limit checks */
struct inode *inode = mapping->host;
loff_t pos;
ssize_t written;
ssize_t err;

- ocount = 0;
- err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
- if (err)
- return err;
-
- count = ocount;
+ count = iov_iter_count(iter);
pos = *ppos;

/* We can write back this queue in page reclaim */
@@ -2336,8 +2370,8 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
loff_t endbyte;
ssize_t written_buffered;

- written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
- ppos, count, ocount);
+ written = generic_file_direct_write_iter(iocb, iter, pos,
+ ppos, count);
if (written < 0 || written == count)
goto out;
/*
@@ -2346,9 +2380,9 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
*/
pos += written;
count -= written;
- written_buffered = generic_file_buffered_write(iocb, iov,
- nr_segs, pos, ppos, count,
- written);
+ iov_iter_advance(iter, written);
+ written_buffered = generic_file_buffered_write_iter(iocb, iter,
+ pos, ppos, count, written);
/*
* If generic_file_buffered_write() retuned a synchronous error
* then we want to return the number of bytes which were
@@ -2380,13 +2414,57 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
*/
}
} else {
- written = generic_file_buffered_write(iocb, iov, nr_segs,
+ iter->count = count;
+ written = generic_file_buffered_write_iter(iocb, iter,
pos, ppos, count, written);
}
out:
current->backing_dev_info = NULL;
return written ? written : err;
}
+EXPORT_SYMBOL(__generic_file_write_iter);
+
+ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+
+ mutex_lock(&inode->i_mutex);
+ ret = __generic_file_write_iter(iocb, iter, &iocb->ki_pos);
+ mutex_unlock(&inode->i_mutex);
+
+ if (ret > 0 || ret == -EIOCBQUEUED) {
+ ssize_t err;
+
+ err = generic_write_sync(file, pos, ret);
+ if (err < 0 && ret > 0)
+ ret = err;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(generic_file_write_iter);
+
+ssize_t
+__generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t *ppos)
+{
+ struct iov_iter iter;
+ size_t count;
+ int ret;
+
+ count = 0;
+ ret = generic_segment_checks(iov, &nr_segs, &count, VERIFY_READ);
+ if (ret)
+ goto out;
+
+ iov_iter_init(&iter, iov, nr_segs, count, 0);
+
+ ret = __generic_file_write_iter(iocb, &iter, ppos);
+out:
+ return ret;
+}
EXPORT_SYMBOL(__generic_file_aio_write);

/**
--
1.8.3.4

2013-07-25 18:10:45

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 03/33] iov_iter: add copy_to_user support

From: Zach Brown <[email protected]>

This adds iov_iter wrappers around copy_to_user() to match the existing
wrappers around copy_from_user().

This will be used by the generic file system buffered read path.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/iov-iter.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 4 +++
2 files changed, 84 insertions(+)

diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 563a6ba..0b2407e 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -6,6 +6,86 @@
#include <linux/highmem.h>
#include <linux/pagemap.h>

+static size_t __iovec_copy_to_user(char *vaddr, const struct iovec *iov,
+ size_t base, size_t bytes, int atomic)
+{
+ size_t copied = 0, left = 0;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + base;
+ int copy = min(bytes, iov->iov_len - base);
+
+ base = 0;
+ if (atomic)
+ left = __copy_to_user_inatomic(buf, vaddr, copy);
+ else
+ left = copy_to_user(buf, vaddr, copy);
+ copied += copy;
+ bytes -= copy;
+ vaddr += copy;
+ iov++;
+
+ if (unlikely(left))
+ break;
+ }
+ return copied - left;
+}
+
+/*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ BUG_ON(!in_atomic());
+ kaddr = kmap_atomic(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_to_user_inatomic(buf, kaddr + offset, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_to_user(kaddr + offset, i->iov,
+ i->iov_offset, bytes, 1);
+ }
+ kunmap_atomic(kaddr);
+
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_to_user_atomic);
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_to_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ kaddr = kmap(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = copy_to_user(buf, kaddr + offset, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_to_user(kaddr + offset, i->iov,
+ i->iov_offset, bytes, 0);
+ }
+ kunmap(page);
+ return copied;
+}
+EXPORT_SYMBOL(iov_iter_copy_to_user);
+
static size_t __iovec_copy_from_user(char *vaddr, const struct iovec *iov,
size_t base, size_t bytes, int atomic)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..80f71df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -296,6 +296,10 @@ struct iov_iter {
size_t count;
};

+size_t iov_iter_copy_to_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_to_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
size_t iov_iter_copy_from_user(struct page *page,
--
1.8.3.4

2013-07-25 18:10:41

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 15/33] aio: add aio support for iov_iter arguments

This adds iocb cmds which specify that memory is held in iov_iter
structures. This lets kernel callers specify memory that can be
expressed in an iov_iter, which includes pages in bio_vec arrays.

Only kernel callers can provide an iov_iter so it doesn't make a lot of
sense to expose the IOCB_CMD values for this as part of the user space
ABI.

But kernel callers should also be able to perform the usual aio
operations which suggests using the the existing operation namespace and
support code.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/aio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/aio.h | 3 ++
include/uapi/linux/aio_abi.h | 2 ++
3 files changed, 72 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index c65ba13..0da82c0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -991,6 +991,48 @@ static ssize_t aio_setup_single_vector(int rw, struct kiocb *kiocb)
return 0;
}

+static ssize_t aio_read_iter(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret;
+
+ if (unlikely(!is_kernel_kiocb(iocb)))
+ return -EINVAL;
+
+ if (unlikely(!(file->f_mode & FMODE_READ)))
+ return -EBADF;
+
+ ret = security_file_permission(file, MAY_READ);
+ if (unlikely(ret))
+ return ret;
+
+ if (!file->f_op->read_iter)
+ return -EINVAL;
+
+ return file->f_op->read_iter(iocb, iocb->ki_iter, iocb->ki_pos);
+}
+
+static ssize_t aio_write_iter(struct kiocb *iocb)
+{
+ struct file *file = iocb->ki_filp;
+ ssize_t ret;
+
+ if (unlikely(!is_kernel_kiocb(iocb)))
+ return -EINVAL;
+
+ if (unlikely(!(file->f_mode & FMODE_WRITE)))
+ return -EBADF;
+
+ ret = security_file_permission(file, MAY_WRITE);
+ if (unlikely(ret))
+ return ret;
+
+ if (!file->f_op->write_iter)
+ return -EINVAL;
+
+ return file->f_op->write_iter(iocb, iocb->ki_iter, iocb->ki_pos);
+}
+
/*
* aio_setup_iocb:
* Performs the initial checks and aio retry method
@@ -1042,6 +1084,14 @@ rw_common:
ret = aio_rw_vect_retry(req, rw, rw_op);
break;

+ case IOCB_CMD_READ_ITER:
+ ret = aio_read_iter(req);
+ break;
+
+ case IOCB_CMD_WRITE_ITER:
+ ret = aio_write_iter(req);
+ break;
+
case IOCB_CMD_FDSYNC:
if (!file->f_op->aio_fsync)
return -EINVAL;
@@ -1116,6 +1166,23 @@ void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
}
EXPORT_SYMBOL_GPL(aio_kernel_init_rw);

+/*
+ * The iter count must be set before calling here. Some filesystems uses
+ * iocb->ki_left as an indicator of the size of an IO.
+ */
+void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
+ unsigned short op, struct iov_iter *iter, loff_t off)
+{
+ iocb->ki_filp = filp;
+ iocb->ki_iter = iter;
+ iocb->ki_opcode = op;
+ iocb->ki_pos = off;
+ iocb->ki_nbytes = iov_iter_count(iter);
+ iocb->ki_left = iocb->ki_nbytes;
+ iocb->ki_ctx = (void *)-1;
+}
+EXPORT_SYMBOL_GPL(aio_kernel_init_iter);
+
void aio_kernel_init_callback(struct kiocb *iocb,
void (*complete)(u64 user_data, long res),
u64 user_data)
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 014a75d..64d059d 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -66,6 +66,7 @@ struct kiocb {
* this is the underlying eventfd context to deliver events to.
*/
struct eventfd_ctx *ki_eventfd;
+ struct iov_iter *ki_iter;
};

static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -102,6 +103,8 @@ struct kiocb *aio_kernel_alloc(gfp_t gfp);
void aio_kernel_free(struct kiocb *iocb);
void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
unsigned short op, void *ptr, size_t nr, loff_t off);
+void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
+ unsigned short op, struct iov_iter *iter, loff_t off);
void aio_kernel_init_callback(struct kiocb *iocb,
void (*complete)(u64 user_data, long res),
u64 user_data);
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..22ce4bd 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
+ IOCB_CMD_READ_ITER = 9,
+ IOCB_CMD_WRITE_ITER = 10,
};

/*
--
1.8.3.4

2013-07-25 18:11:20

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 12/33] dio: add bio_vec support to __blockdev_direct_IO()

The trick here is to initialize the dio state so that do_direct_IO()
consumes the pages we provide and never tries to map user pages. This
is done by making sure that final_block_in_request covers the page that
we set in the dio. do_direct_IO() will return before running out of
pages.

The caller is responsible for dirtying these pages, if needed. We add
an option to the dio struct that makes sure we only dirty pages when
we're operating on iovecs of user addresses.

Signed-off-by: Dave Kleikamp <[email protected]>
Cc: Zach Brown <[email protected]>
---
fs/direct-io.c | 206 +++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 148 insertions(+), 58 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index a81366c..75a3989 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -127,6 +127,7 @@ struct dio {
spinlock_t bio_lock; /* protects BIO fields below */
int page_errors; /* errno from get_user_pages() */
int is_async; /* is IO async ? */
+ int should_dirty; /* should we mark read pages dirty? */
int io_error; /* IO error in completion path */
unsigned long refcount; /* direct_io_worker() and bios */
struct bio *bio_list; /* singly linked via bi_private */
@@ -377,7 +378,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
dio->refcount++;
spin_unlock_irqrestore(&dio->bio_lock, flags);

- if (dio->is_async && dio->rw == READ)
+ if (dio->is_async && dio->rw == READ && dio->should_dirty)
bio_set_pages_dirty(bio);

if (sdio->submit_io)
@@ -448,13 +449,14 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
if (!uptodate)
dio->io_error = -EIO;

- if (dio->is_async && dio->rw == READ) {
+ if (dio->is_async && dio->rw == READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;

- if (dio->rw == READ && !PageCompound(page))
+ if (dio->rw == READ && !PageCompound(page) &&
+ dio->should_dirty)
set_page_dirty_lock(page);
page_cache_release(page);
}
@@ -1016,6 +1018,101 @@ static inline int drop_refcount(struct dio *dio)
return ret2;
}

+static ssize_t direct_IO_iovec(const struct iovec *iov, unsigned long nr_segs,
+ struct dio *dio, struct dio_submit *sdio,
+ unsigned blkbits, struct buffer_head *map_bh)
+{
+ size_t bytes;
+ ssize_t retval = 0;
+ int seg;
+ unsigned long user_addr;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ user_addr = (unsigned long)iov[seg].iov_base;
+ sdio->pages_in_io +=
+ ((user_addr + iov[seg].iov_len + PAGE_SIZE-1) /
+ PAGE_SIZE - user_addr / PAGE_SIZE);
+ }
+
+ dio->should_dirty = 1;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ user_addr = (unsigned long)iov[seg].iov_base;
+ sdio->size += bytes = iov[seg].iov_len;
+
+ /* Index into the first page of the first block */
+ sdio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
+ sdio->final_block_in_request = sdio->block_in_file +
+ (bytes >> blkbits);
+ /* Page fetching state */
+ sdio->head = 0;
+ sdio->tail = 0;
+ sdio->curr_page = 0;
+
+ sdio->total_pages = 0;
+ if (user_addr & (PAGE_SIZE-1)) {
+ sdio->total_pages++;
+ bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
+ }
+ sdio->total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+ sdio->curr_user_address = user_addr;
+
+ retval = do_direct_IO(dio, sdio, map_bh);
+
+ dio->result += iov[seg].iov_len -
+ ((sdio->final_block_in_request - sdio->block_in_file) <<
+ blkbits);
+
+ if (retval) {
+ dio_cleanup(dio, sdio);
+ break;
+ }
+ } /* end iovec loop */
+
+ return retval;
+}
+
+static ssize_t direct_IO_bvec(struct bio_vec *bvec, unsigned long nr_segs,
+ struct dio *dio, struct dio_submit *sdio,
+ unsigned blkbits, struct buffer_head *map_bh)
+{
+ ssize_t retval = 0;
+ int seg;
+
+ sdio->pages_in_io += nr_segs;
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ sdio->size += bvec[seg].bv_len;
+
+ /* Index into the first page of the first block */
+ sdio->first_block_in_page = bvec[seg].bv_offset >> blkbits;
+ sdio->final_block_in_request = sdio->block_in_file +
+ (bvec[seg].bv_len >> blkbits);
+ /* Page fetching state */
+ sdio->curr_page = 0;
+ page_cache_get(bvec[seg].bv_page);
+ dio->pages[0] = bvec[seg].bv_page;
+ sdio->head = 0;
+ sdio->tail = 1;
+
+ sdio->total_pages = 1;
+ sdio->curr_user_address = 0;
+
+ retval = do_direct_IO(dio, sdio, map_bh);
+
+ dio->result += bvec[seg].bv_len -
+ ((sdio->final_block_in_request - sdio->block_in_file) <<
+ blkbits);
+
+ if (retval) {
+ dio_cleanup(dio, sdio);
+ break;
+ }
+ }
+
+ return retval;
+}
+
/*
* This is a library function for use by filesystem drivers.
*
@@ -1057,11 +1154,8 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
loff_t end = offset;
struct dio *dio;
struct dio_submit sdio = { 0, };
- unsigned long user_addr;
- size_t bytes;
struct buffer_head map_bh = { 0, };
struct blk_plug plug;
- const struct iovec *iov = iov_iter_iovec(iter);
unsigned long nr_segs = iter->nr_segs;

if (rw & WRITE)
@@ -1081,20 +1175,49 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
}

/* Check the memory alignment. Blocks cannot straddle pages */
- for (seg = 0; seg < nr_segs; seg++) {
- addr = (unsigned long)iov[seg].iov_base;
- size = iov[seg].iov_len;
- end += size;
- if (unlikely((addr & blocksize_mask) ||
- (size & blocksize_mask))) {
- if (bdev)
- blkbits = blksize_bits(
- bdev_logical_block_size(bdev));
- blocksize_mask = (1 << blkbits) - 1;
- if ((addr & blocksize_mask) || (size & blocksize_mask))
- goto out;
+ if (iov_iter_has_iovec(iter)) {
+ const struct iovec *iov = iov_iter_iovec(iter);
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ addr = (unsigned long)iov[seg].iov_base;
+ size = iov[seg].iov_len;
+ end += size;
+ if (unlikely((addr & blocksize_mask) ||
+ (size & blocksize_mask))) {
+ if (bdev)
+ blkbits = blksize_bits(
+ bdev_logical_block_size(bdev));
+ blocksize_mask = (1 << blkbits) - 1;
+ if ((addr & blocksize_mask) ||
+ (size & blocksize_mask))
+ goto out;
+ }
}
- }
+ } else if (iov_iter_has_bvec(iter)) {
+ /*
+ * Is this necessary, or can we trust the in-kernel
+ * caller? Can we replace this with
+ * end += iov_iter_count(iter); ?
+ */
+ struct bio_vec *bvec = iov_iter_bvec(iter);
+
+ for (seg = 0; seg < nr_segs; seg++) {
+ addr = bvec[seg].bv_offset;
+ size = bvec[seg].bv_len;
+ end += size;
+ if (unlikely((addr & blocksize_mask) ||
+ (size & blocksize_mask))) {
+ if (bdev)
+ blkbits = blksize_bits(
+ bdev_logical_block_size(bdev));
+ blocksize_mask = (1 << blkbits) - 1;
+ if ((addr & blocksize_mask) ||
+ (size & blocksize_mask))
+ goto out;
+ }
+ }
+ } else
+ BUG();

/* watch out for a 0 len io from a tricksy fs */
if (rw == READ && end == offset)
@@ -1171,47 +1294,14 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
if (unlikely(sdio.blkfactor))
sdio.pages_in_io = 2;

- for (seg = 0; seg < nr_segs; seg++) {
- user_addr = (unsigned long)iov[seg].iov_base;
- sdio.pages_in_io +=
- ((user_addr + iov[seg].iov_len + PAGE_SIZE-1) /
- PAGE_SIZE - user_addr / PAGE_SIZE);
- }
-
blk_start_plug(&plug);

- for (seg = 0; seg < nr_segs; seg++) {
- user_addr = (unsigned long)iov[seg].iov_base;
- sdio.size += bytes = iov[seg].iov_len;
-
- /* Index into the first page of the first block */
- sdio.first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
- sdio.final_block_in_request = sdio.block_in_file +
- (bytes >> blkbits);
- /* Page fetching state */
- sdio.head = 0;
- sdio.tail = 0;
- sdio.curr_page = 0;
-
- sdio.total_pages = 0;
- if (user_addr & (PAGE_SIZE-1)) {
- sdio.total_pages++;
- bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
- }
- sdio.total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
- sdio.curr_user_address = user_addr;
-
- retval = do_direct_IO(dio, &sdio, &map_bh);
-
- dio->result += iov[seg].iov_len -
- ((sdio.final_block_in_request - sdio.block_in_file) <<
- blkbits);
-
- if (retval) {
- dio_cleanup(dio, &sdio);
- break;
- }
- } /* end iovec loop */
+ if (iov_iter_has_iovec(iter))
+ retval = direct_IO_iovec(iov_iter_iovec(iter), nr_segs, dio,
+ &sdio, blkbits, &map_bh);
+ else
+ retval = direct_IO_bvec(iov_iter_bvec(iter), nr_segs, dio,
+ &sdio, blkbits, &map_bh);

if (retval == -ENOTBLK) {
/*
--
1.8.3.4

2013-07-25 18:11:24

by Dave Kleikamp

[permalink] [raw]
Subject: [PATCH V8 07/33] iov_iter: ii_iovec_copy_to_user should pre-fault user pages

This duplicates the optimization in file_read_actor as a later patch
will replace it with a call to __iov_iter_copy_to_user().

Signed-off-by: Dave Kleikamp <[email protected]>
---
fs/iov-iter.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/iov-iter.c b/fs/iov-iter.c
index 6cb6be0..59f9556 100644
--- a/fs/iov-iter.c
+++ b/fs/iov-iter.c
@@ -80,17 +80,32 @@ static size_t ii_iovec_copy_to_user(struct page *page,
return 0;
}

- kaddr = kmap(page);
if (likely(i->nr_segs == 1)) {
int left;
char __user *buf = iov->iov_base + i->iov_offset;
+ /*
+ * Faults on the destination of a read are common, so do it
+ * before taking the kmap.
+ */
+ if (!fault_in_pages_writeable(buf, bytes)) {
+ kaddr = kmap_atomic(page);
+ left = __copy_to_user_inatomic(buf, kaddr + offset,
+ bytes);
+ kunmap_atomic(kaddr);
+ if (left == 0)
+ goto success;
+ }
+ kaddr = kmap(page);
left = copy_to_user(buf, kaddr + offset, bytes);
+ kunmap(page);
+success:
copied = bytes - left;
} else {
+ kaddr = kmap(page);
copied = __iovec_copy_to_user(kaddr + offset, iov,
i->iov_offset, bytes, 0);
+ kunmap(page);
}
- kunmap(page);
return copied;
}

--
1.8.3.4

2013-07-25 21:34:43

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH V8 29/33] udf: convert file ops from aio_read/write to read/write_iter

On Thu 25-07-13 12:50:55, Dave Kleikamp wrote:
> Signed-off-by: Dave Kleikamp <[email protected]>
> Cc: Jan Kara <[email protected]>
Looks good. You can add:
Acked-by: Jan Kara <[email protected]>

Honza
> ---
> fs/udf/file.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/fs/udf/file.c b/fs/udf/file.c
> index 339df8b..e392d60 100644
> --- a/fs/udf/file.c
> +++ b/fs/udf/file.c
> @@ -133,8 +133,8 @@ const struct address_space_operations udf_adinicb_aops = {
> .direct_IO = udf_adinicb_direct_IO,
> };
>
> -static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> - unsigned long nr_segs, loff_t ppos)
> +static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *iter,
> + loff_t ppos)
> {
> ssize_t retval;
> struct file *file = iocb->ki_filp;
> @@ -168,7 +168,7 @@ static ssize_t udf_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> } else
> up_write(&iinfo->i_data_sem);
>
> - retval = generic_file_aio_write(iocb, iov, nr_segs, ppos);
> + retval = generic_file_write_iter(iocb, iter, ppos);
> if (retval > 0)
> mark_inode_dirty(inode);
>
> @@ -242,12 +242,12 @@ static int udf_release_file(struct inode *inode, struct file *filp)
>
> const struct file_operations udf_file_operations = {
> .read = do_sync_read,
> - .aio_read = generic_file_aio_read,
> + .read_iter = generic_file_read_iter,
> .unlocked_ioctl = udf_ioctl,
> .open = generic_file_open,
> .mmap = generic_file_mmap,
> .write = do_sync_write,
> - .aio_write = udf_file_aio_write,
> + .write_iter = udf_file_write_iter,
> .release = udf_release_file,
> .fsync = generic_file_fsync,
> .splice_read = generic_file_splice_read,
> --
> 1.8.3.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-26 11:52:02

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V8 27/33] xfs: add support for read_iter and write_iter

On Thu, Jul 25, 2013 at 12:50:53PM -0500, Dave Kleikamp wrote:
> Signed-off-by: Dave Kleikamp <[email protected]>
> Cc: Ben Myers <[email protected]>
> Cc: Alex Elder <[email protected]>
> Cc: [email protected]

Looks fine.

Acked-by: Dave Chinner <[email protected]>

--
Dave Chinner
[email protected]

2013-07-30 21:28:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Thu, 25 Jul 2013 12:50:26 -0500 Dave Kleikamp <[email protected]> wrote:

> This patch series adds a kernel interface to fs/aio.c so that kernel code can
> issue concurrent asynchronous IO to file systems. It adds an aio command and
> file system methods which specify io memory with pages instead of userspace
> addresses.
>
> This series was written to reduce the current overhead loop imposes by
> performing synchronus buffered file system IO from a kernel thread. These
> patches turn loop into a light weight layer that translates bios into iocbs.

Do you have any performance numbers?

Does anyone care much about loop performance? What's the value here?

2013-07-31 00:43:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Tue, Jul 30, 2013 at 02:28:20PM -0700, Andrew Morton wrote:
> On Thu, 25 Jul 2013 12:50:26 -0500 Dave Kleikamp <[email protected]> wrote:
>
> > This patch series adds a kernel interface to fs/aio.c so that kernel code can
> > issue concurrent asynchronous IO to file systems. It adds an aio command and
> > file system methods which specify io memory with pages instead of userspace
> > addresses.
> >
> > This series was written to reduce the current overhead loop imposes by
> > performing synchronus buffered file system IO from a kernel thread. These
> > patches turn loop into a light weight layer that translates bios into iocbs.
>
> Do you have any performance numbers?
>
> Does anyone care much about loop performance? What's the value here?

Yes. Anyone using loopback devices for file-backed devices exposed
to containers and VMs cares about the memory and CPU overhead
the double caching the existing loop device has.

Or those of us how use loopback devices to examine metadumps from
broken filesystems and run repair/fsck on the filesystem image via
loopback devices. When I'm dealing with images containing tens to
hundreds of gigabytes of metadata, caching a second time in the
backing file is a significant overhead. It's especially annoying
when the application is already using direct IO because the kernel
based block device caching isn't at all efficient and just consumes
needless amount of memory holding on to pages that are never going
to be read again.

And on small memory machines xfstests can trigger OOM killer when it
uses loopback devices in certain tests. Thats generally caused by
the writeback of a dirty page causing a new page to be allocated and
dirtied and so writeback of dirty memory under memory pressure can
a) increase memory usage, and b) increase the percentage of dirty
memory that can't be reclaimed immediately....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-31 06:40:20

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Jul 31, 2013 at 2:43 AM, Dave Chinner <[email protected]> wrote:
> On Tue, Jul 30, 2013 at 02:28:20PM -0700, Andrew Morton wrote:
>> On Thu, 25 Jul 2013 12:50:26 -0500 Dave Kleikamp <[email protected]> wrote:
>>
>> > This patch series adds a kernel interface to fs/aio.c so that kernel code can
>> > issue concurrent asynchronous IO to file systems. It adds an aio command and
>> > file system methods which specify io memory with pages instead of userspace
>> > addresses.
>> >
>> > This series was written to reduce the current overhead loop imposes by
>> > performing synchronus buffered file system IO from a kernel thread. These
>> > patches turn loop into a light weight layer that translates bios into iocbs.
>>
>> Do you have any performance numbers?
>>

[ CC Al and Linux-next maintainer ]

The more important question how to test and then provide performance numbers.
If you give me a test-case I give you numbers!

>> Does anyone care much about loop performance? What's the value here?
>
> Yes. Anyone using loopback devices for file-backed devices exposed
> to containers and VMs cares about the memory and CPU overhead
> the double caching the existing loop device has.
>

Yupp, I am here on Ubuntu/precise AMD64 in a so-called WUBI
environment which makes intensive usage of loopback-device plus FUSE
driver and $fs-of-your-choice (here: ext4).

Today, I have pulled Dave's aio_loop GIT branch into v3.11-rc3.
After successful compilation I am running it right now.

I had also tested v6 of the series [1] from February 2013 and
encouraged Dave to put it into Linux-next [2].
Unfortunately, there was no response from Al.
Again, Dave try to get it into Linux-next!

- Sedat -

[1] http://marc.info/?t=135947707100013&r=1&w=4
[2] http://marc.info/?l=linux-fsdevel&m=136122569807203&w=4

2013-07-31 08:41:35

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Jul 31, 2013 at 8:40 AM, Sedat Dilek <[email protected]> wrote:
> On Wed, Jul 31, 2013 at 2:43 AM, Dave Chinner <[email protected]> wrote:
>> On Tue, Jul 30, 2013 at 02:28:20PM -0700, Andrew Morton wrote:
>>> On Thu, 25 Jul 2013 12:50:26 -0500 Dave Kleikamp <[email protected]> wrote:
>>>
>>> > This patch series adds a kernel interface to fs/aio.c so that kernel code can
>>> > issue concurrent asynchronous IO to file systems. It adds an aio command and
>>> > file system methods which specify io memory with pages instead of userspace
>>> > addresses.
>>> >
>>> > This series was written to reduce the current overhead loop imposes by
>>> > performing synchronus buffered file system IO from a kernel thread. These
>>> > patches turn loop into a light weight layer that translates bios into iocbs.
>>>
>>> Do you have any performance numbers?
>>>
>
> [ CC Al and Linux-next maintainer ]
>
> The more important question how to test and then provide performance numbers.
> If you give me a test-case I give you numbers!
>
>>> Does anyone care much about loop performance? What's the value here?
>>
>> Yes. Anyone using loopback devices for file-backed devices exposed
>> to containers and VMs cares about the memory and CPU overhead
>> the double caching the existing loop device has.
>>
>
> Yupp, I am here on Ubuntu/precise AMD64 in a so-called WUBI
> environment which makes intensive usage of loopback-device plus FUSE
> driver and $fs-of-your-choice (here: ext4).
>
> Today, I have pulled Dave's aio_loop GIT branch into v3.11-rc3.
> After successful compilation I am running it right now.
>
> I had also tested v6 of the series [1] from February 2013 and
> encouraged Dave to put it into Linux-next [2].
> Unfortunately, there was no response from Al.
> Again, Dave try to get it into Linux-next!
>

I have run runltp-lite from latest stable LTP (ltp-full-20130109), but
this reports errors.
I will see later if this happens with a vanilla v3.11-rc3.

See also attached files.

- Sedat -

> - Sedat -
>
> [1] http://marc.info/?t=135947707100013&r=1&w=4
> [2] http://marc.info/?l=linux-fsdevel&m=136122569807203&w=4


Attachments:
dmesg_3.11.0-rc3-1-aio-small.txt (82.92 kB)
config-3.11.0-rc3-1-aio-small (111.95 kB)
runltplite_3.11.0-rc3-1-aio-small.txt.gz (43.50 kB)
Download all attachments

2013-07-31 09:51:41

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

07/31/2013 01:28 AM, Andrew Morton пишет:
> On Thu, 25 Jul 2013 12:50:26 -0500 Dave Kleikamp <[email protected]> wrote:
>
>> This patch series adds a kernel interface to fs/aio.c so that kernel code can
>> issue concurrent asynchronous IO to file systems. It adds an aio command and
>> file system methods which specify io memory with pages instead of userspace
>> addresses.
>>
>> This series was written to reduce the current overhead loop imposes by
>> performing synchronus buffered file system IO from a kernel thread. These
>> patches turn loop into a light weight layer that translates bios into iocbs.
> Do you have any performance numbers?
>
> Does anyone care much about loop performance? What's the value here?
>
>

OpenVZ uses loopback-device to keep per-container filesystems. We care
much about overhead introduced by loop: IO-bound applications run on top
of per-container filesystem shouldn't perform worse than on top of host
filesystem. So the value for us is zero overhead.

Thanks,
Maxim

2013-07-31 11:22:25

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Jul 31, 2013 at 10:41 AM, Sedat Dilek <[email protected]> wrote:
> On Wed, Jul 31, 2013 at 8:40 AM, Sedat Dilek <[email protected]> wrote:
>> On Wed, Jul 31, 2013 at 2:43 AM, Dave Chinner <[email protected]> wrote:
>>> On Tue, Jul 30, 2013 at 02:28:20PM -0700, Andrew Morton wrote:
>>>> On Thu, 25 Jul 2013 12:50:26 -0500 Dave Kleikamp <[email protected]> wrote:
>>>>
>>>> > This patch series adds a kernel interface to fs/aio.c so that kernel code can
>>>> > issue concurrent asynchronous IO to file systems. It adds an aio command and
>>>> > file system methods which specify io memory with pages instead of userspace
>>>> > addresses.
>>>> >
>>>> > This series was written to reduce the current overhead loop imposes by
>>>> > performing synchronus buffered file system IO from a kernel thread. These
>>>> > patches turn loop into a light weight layer that translates bios into iocbs.
>>>>
>>>> Do you have any performance numbers?
>>>>
>>
>> [ CC Al and Linux-next maintainer ]
>>
>> The more important question how to test and then provide performance numbers.
>> If you give me a test-case I give you numbers!
>>
>>>> Does anyone care much about loop performance? What's the value here?
>>>
>>> Yes. Anyone using loopback devices for file-backed devices exposed
>>> to containers and VMs cares about the memory and CPU overhead
>>> the double caching the existing loop device has.
>>>
>>
>> Yupp, I am here on Ubuntu/precise AMD64 in a so-called WUBI
>> environment which makes intensive usage of loopback-device plus FUSE
>> driver and $fs-of-your-choice (here: ext4).
>>
>> Today, I have pulled Dave's aio_loop GIT branch into v3.11-rc3.
>> After successful compilation I am running it right now.
>>
>> I had also tested v6 of the series [1] from February 2013 and
>> encouraged Dave to put it into Linux-next [2].
>> Unfortunately, there was no response from Al.
>> Again, Dave try to get it into Linux-next!
>>
>
> I have run runltp-lite from latest stable LTP (ltp-full-20130109), but
> this reports errors.
> I will see later if this happens with a vanilla v3.11-rc3.
>

These results look similiar, so aio_loop stuff seems to be OK.

- Sedat -

> See also attached files.
>
> - Sedat -
>
>> - Sedat -
>>
>> [1] http://marc.info/?t=135947707100013&r=1&w=4
>> [2] http://marc.info/?l=linux-fsdevel&m=136122569807203&w=4


Attachments:
runltplite_3.11.0-rc3-1-iniza-small.txt.gz (43.49 kB)

2013-08-01 08:58:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

What should be added to this support is to move the swap over nfs code
over to this interface instead of the utterly bogus
KERNEL_READ/KERNEL_WRITE hacks that were added for it.

2013-08-01 13:05:03

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On 08/01/2013 03:58 AM, Christoph Hellwig wrote:
> What should be added to this support is to move the swap over nfs code
> over to this interface instead of the utterly bogus
> KERNEL_READ/KERNEL_WRITE hacks that were added for it.

That's patch 24/33 nfs: simplify swap

2013-08-02 10:48:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Thu, Aug 01, 2013 at 08:04:24AM -0500, Dave Kleikamp wrote:
> On 08/01/2013 03:58 AM, Christoph Hellwig wrote:
> > What should be added to this support is to move the swap over nfs code
> > over to this interface instead of the utterly bogus
> > KERNEL_READ/KERNEL_WRITE hacks that were added for it.
>
> That's patch 24/33 nfs: simplify swap

Sorry, missed it somehow. Thanks for doing this work, and removing that
wart alone is absolutely worth merging this patchset.

For the future we should look into enabling this for swap in general
and removing the slightly less cludge currently used for swapfiles.

Especially btrfs should benefit greatly from that.

2013-08-20 13:00:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

As I've seen very few replies to this: how do we ensure this gets
picked up for the 3.12 merge window? The series has been a reposted
a few times without complaints or major changes, but the ball still
doesn't seem to get rolling.

I'd really like to do some ecryptfs and scsi target work that is going
to rely on this soon.

2013-08-20 19:13:57

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

Stephen,
Would you be willing to pick up
git://github.com/kleikamp/linux-shaggy.git for-next
into linux-next?

There will be some unclean merges, and I can send you updated patches
created against your latest tree. I'm not exactly sure of your process
wrt cleaning up merges, but I guess they would help.

Thanks,
Shaggy

On 08/20/2013 08:00 AM, Christoph Hellwig wrote:
> As I've seen very few replies to this: how do we ensure this gets
> picked up for the 3.12 merge window? The series has been a reposted
> a few times without complaints or major changes, but the ball still
> doesn't seem to get rolling.
>
> I'd really like to do some ecryptfs and scsi target work that is going
> to rely on this soon.
>

2013-08-20 22:46:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Tue, 20 Aug 2013 06:00:56 -0700 Christoph Hellwig <[email protected]> wrote:

> As I've seen very few replies to this: how do we ensure this gets
> picked up for the 3.12 merge window? The series has been a reposted
> a few times without complaints or major changes,

That's probably a sign that nobody bothered reading it all :( The
smattering of acks in there is not encouraging.

Please add Sedat's Tested-by (thanks!)

Please add performance test results.

Mel, do you have any swap-over-nfs test cases which should be performed?

Dave, what sort of correctness/robustness tests have you been running?

Yes, I guess it's not a bad idea to get this into -next, but it does
seem to have been pretty low-profile...

2013-08-21 00:14:19

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

Hi Dave,

On Tue, 20 Aug 2013 14:13:15 -0500 Dave Kleikamp <[email protected]> wrote:
>
> Would you be willing to pick up
> git://github.com/kleikamp/linux-shaggy.git for-next
> into linux-next?

I have added that from today.

> There will be some unclean merges, and I can send you updated patches
> created against your latest tree. I'm not exactly sure of your process
> wrt cleaning up merges, but I guess they would help.

Since I will merge your tree into linux-next, I only need to fix merge
conflicts, so while those patches can be a guide, they are mostly not
needed.

> On 08/20/2013 08:00 AM, Christoph Hellwig wrote:
> > As I've seen very few replies to this: how do we ensure this gets
> > picked up for the 3.12 merge window? The series has been a reposted
> > a few times without complaints or major changes, but the ball still
> > doesn't seem to get rolling.
> >
> > I'd really like to do some ecryptfs and scsi target work that is going
> > to rely on this soon.

If this happens, then it is important that your (Dave's) tree is not
rebased/rewritten and that any other tree that depend on it merges your
tree.

I will merge your tree relatively early, that way the merge conflicts will
be spread over several other merges and hopefully each be fairly minor.

I gave called your tree "aio-direct", please let me know fi you think
there is a better name.

Thanks for adding your subsystem tree as a participant of linux-next. As
you may know, this is not a judgment of your code. The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window.

You will need to ensure that the patches/commits in your tree/series have
been:
* submitted under GPL v2 (or later) and include the Contributor's
Signed-off-by,
* posted to the relevant mailing list,
* reviewed by you (or another maintainer of your subsystem tree),
* successfully unit tested, and
* destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch). It is allowed to be rebased if you deem it necessary.

--
Cheers,
Stephen Rothwell
[email protected]

Legal Stuff:
By participating in linux-next, your subsystem tree contributions are
public and will be included in the linux-next trees. You may be sent
e-mail messages indicating errors or other issues when the
patches/commits from your subsystem tree are merged and tested in
linux-next. These messages may also be cross-posted to the linux-next
mailing list, the linux-kernel mailing list, etc. The linux-next tree
project and IBM (my employer) make no warranties regarding the linux-next
project, the testing procedures, the results, the e-mails, etc. If you
don't agree to these ground rules, let me know and I'll remove your tree
from participation in linux-next.


Attachments:
(No filename) (2.82 kB)
(No filename) (836.00 B)
Download all attachments

2013-08-21 05:35:43

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Aug 21, 2013 at 2:14 AM, Stephen Rothwell <[email protected]> wrote:
> Hi Dave,
>
> On Tue, 20 Aug 2013 14:13:15 -0500 Dave Kleikamp <[email protected]> wrote:
>>
>> Would you be willing to pick up
>> git://github.com/kleikamp/linux-shaggy.git for-next
>> into linux-next?
>
> I have added that from today.
>
>> There will be some unclean merges, and I can send you updated patches
>> created against your latest tree. I'm not exactly sure of your process
>> wrt cleaning up merges, but I guess they would help.
>
> Since I will merge your tree into linux-next, I only need to fix merge
> conflicts, so while those patches can be a guide, they are mostly not
> needed.
>
>> On 08/20/2013 08:00 AM, Christoph Hellwig wrote:
>> > As I've seen very few replies to this: how do we ensure this gets
>> > picked up for the 3.12 merge window? The series has been a reposted
>> > a few times without complaints or major changes, but the ball still
>> > doesn't seem to get rolling.
>> >
>> > I'd really like to do some ecryptfs and scsi target work that is going
>> > to rely on this soon.
>
> If this happens, then it is important that your (Dave's) tree is not
> rebased/rewritten and that any other tree that depend on it merges your
> tree.
>
> I will merge your tree relatively early, that way the merge conflicts will
> be spread over several other merges and hopefully each be fairly minor.
>
> I gave called your tree "aio-direct", please let me know fi you think
> there is a better name.
>

Cool, to see Dave's work in Linux-next!

Dave named his GIT branch "aio_loop".
( If you (Stephen) prefer "hyphen" for your GIT branches, use
"aio-loop", but that's up to the Dave. )

Anyway, I am glad to see this getting pushed.

( Hmmm, OverlayFS
..................................................................................................................).

- Sedat -

[1] https://github.com/kleikamp/linux-shaggy/tree/aio_loop

> Thanks for adding your subsystem tree as a participant of linux-next. As
> you may know, this is not a judgment of your code. The purpose of
> linux-next is for integration testing and to lower the impact of
> conflicts between subsystems in the next merge window.
>
> You will need to ensure that the patches/commits in your tree/series have
> been:
> * submitted under GPL v2 (or later) and include the Contributor's
> Signed-off-by,
> * posted to the relevant mailing list,
> * reviewed by you (or another maintainer of your subsystem tree),
> * successfully unit tested, and
> * destined for the current or next Linux merge window.
>
> Basically, this should be just what you would send to Linus (or ask him
> to fetch). It is allowed to be rebased if you deem it necessary.
>
> --
> Cheers,
> Stephen Rothwell
> [email protected]
>
> Legal Stuff:
> By participating in linux-next, your subsystem tree contributions are
> public and will be included in the linux-next trees. You may be sent
> e-mail messages indicating errors or other issues when the
> patches/commits from your subsystem tree are merged and tested in
> linux-next. These messages may also be cross-posted to the linux-next
> mailing list, the linux-kernel mailing list, etc. The linux-next tree
> project and IBM (my employer) make no warranties regarding the linux-next
> project, the testing procedures, the results, the e-mails, etc. If you
> don't agree to these ground rules, let me know and I'll remove your tree
> from participation in linux-next.

2013-08-21 13:02:35

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

Hello Dave,

On Thu, Jul 25, 2013 at 12:50:26PM -0500, Dave Kleikamp wrote:
> This patch series adds a kernel interface to fs/aio.c so that kernel code can
> issue concurrent asynchronous IO to file systems. It adds an aio command and
> file system methods which specify io memory with pages instead of userspace
> addresses.

First off, have you tested that this series actually works when merged with
the pending AIO changes from Kent? There a git tree with those pending
changes at git://git.kvack.org/~bcrl/aio-next.git , and they're in
linux-next.

One of the major problems your changeset continues to carry is that your
new read_iter/write_iter operations permit blocking (implicitely), which
really isn't what we want for aio. If you're going to introduce a new api,
it should be made non-blocking, and enforce that non-blocking requirement
(ie warn when read_iter/write_iter methods perform blockin operations,
similar to the warnings when scheduling in atomic mode). This means more
changes for some filesystem code involved, something that people have been
avoiding for years, but which really needs to be done.

-ben

> This series was written to reduce the current overhead loop imposes by
> performing synchronus buffered file system IO from a kernel thread. These
> patches turn loop into a light weight layer that translates bios into iocbs.
>
> It introduces new file ops, read_iter() and write_iter(), that replace the
> aio_read() and aio_write() operations. The iov_iter structure can now contain
> either a user-space iovec or a kernel-space bio_vec. Since it would be
> overly complicated to replace every instance of aio_read() and aio_write(),
> the old operations are not removed, but file systems implementing the new
> ones need not keep the old ones.
>
> Verion 8 is little changed from Version 7 that I send out in March, just
> updated to the latest kernel. These patches apply to 3.11-rc2 and can
> also be found at:
>
> git://github.com/kleikamp/linux-shaggy.git aio_loop
>
> Asias He (1):
> block_dev: add support for read_iter, write_iter
>
> Dave Kleikamp (22):
> iov_iter: iov_iter_copy_from_user() should use non-atomic copy
> iov_iter: add __iovec_copy_to_user()
> fuse: convert fuse to use iov_iter_copy_[to|from]_user
> iov_iter: ii_iovec_copy_to_user should pre-fault user pages
> dio: Convert direct_IO to use iov_iter
> dio: add bio_vec support to __blockdev_direct_IO()
> aio: add aio_kernel_() interface
> aio: add aio support for iov_iter arguments
> fs: create file_readable() and file_writable() functions
> fs: use read_iter and write_iter rather than aio_read and aio_write
> fs: add read_iter and write_iter to several file systems
> ocfs2: add support for read_iter and write_iter
> ext4: add support for read_iter and write_iter
> nfs: add support for read_iter, write_iter
> nfs: simplify swap
> btrfs: add support for read_iter and write_iter
> xfs: add support for read_iter and write_iter
> gfs2: Convert aio_read/write ops to read/write_iter
> udf: convert file ops from aio_read/write to read/write_iter
> afs: add support for read_iter and write_iter
> ecrpytfs: Convert aio_read/write ops to read/write_iter
> ubifs: convert file ops from aio_read/write to read/write_iter
>
> Hugh Dickins (1):
> tmpfs: add support for read_iter and write_iter
>
> Zach Brown (9):
> iov_iter: move into its own file
> iov_iter: add copy_to_user support
> iov_iter: hide iovec details behind ops function pointers
> iov_iter: add bvec support
> iov_iter: add a shorten call
> iov_iter: let callers extract iovecs and bio_vecs
> fs: pull iov_iter use higher up the stack
> bio: add bvec_length(), like iov_length()
> loop: use aio to perform io on the underlying file
>
> Documentation/filesystems/Locking | 6 +-
> Documentation/filesystems/vfs.txt | 12 +-
> drivers/block/loop.c | 148 ++++++++----
> drivers/char/raw.c | 4 +-
> drivers/mtd/nand/nandsim.c | 4 +-
> drivers/usb/gadget/storage_common.c | 4 +-
> fs/9p/vfs_addr.c | 12 +-
> fs/9p/vfs_file.c | 8 +-
> fs/Makefile | 2 +-
> fs/adfs/file.c | 4 +-
> fs/affs/file.c | 4 +-
> fs/afs/file.c | 4 +-
> fs/afs/internal.h | 3 +-
> fs/afs/write.c | 9 +-
> fs/aio.c | 152 ++++++++++++-
> fs/bad_inode.c | 14 ++
> fs/bfs/file.c | 4 +-
> fs/block_dev.c | 27 ++-
> fs/btrfs/file.c | 42 ++--
> fs/btrfs/inode.c | 63 +++---
> fs/ceph/addr.c | 3 +-
> fs/cifs/file.c | 4 +-
> fs/direct-io.c | 223 +++++++++++++------
> fs/ecryptfs/file.c | 15 +-
> fs/exofs/file.c | 4 +-
> fs/ext2/file.c | 4 +-
> fs/ext2/inode.c | 8 +-
> fs/ext3/file.c | 4 +-
> fs/ext3/inode.c | 15 +-
> fs/ext4/ext4.h | 3 +-
> fs/ext4/file.c | 34 +--
> fs/ext4/indirect.c | 16 +-
> fs/ext4/inode.c | 23 +-
> fs/f2fs/data.c | 4 +-
> fs/f2fs/file.c | 4 +-
> fs/fat/file.c | 4 +-
> fs/fat/inode.c | 10 +-
> fs/fuse/cuse.c | 10 +-
> fs/fuse/file.c | 90 ++++----
> fs/fuse/fuse_i.h | 5 +-
> fs/gfs2/aops.c | 7 +-
> fs/gfs2/file.c | 21 +-
> fs/hfs/inode.c | 11 +-
> fs/hfsplus/inode.c | 10 +-
> fs/hostfs/hostfs_kern.c | 4 +-
> fs/hpfs/file.c | 4 +-
> fs/internal.h | 4 +
> fs/iov-iter.c | 411 ++++++++++++++++++++++++++++++++++
> fs/jffs2/file.c | 8 +-
> fs/jfs/file.c | 4 +-
> fs/jfs/inode.c | 7 +-
> fs/logfs/file.c | 4 +-
> fs/minix/file.c | 4 +-
> fs/nfs/direct.c | 302 ++++++++++++++++---------
> fs/nfs/file.c | 33 ++-
> fs/nfs/internal.h | 4 +-
> fs/nfs/nfs4file.c | 4 +-
> fs/nilfs2/file.c | 4 +-
> fs/nilfs2/inode.c | 8 +-
> fs/ocfs2/aops.c | 8 +-
> fs/ocfs2/aops.h | 2 +-
> fs/ocfs2/file.c | 55 ++---
> fs/ocfs2/ocfs2_trace.h | 6 +-
> fs/omfs/file.c | 4 +-
> fs/ramfs/file-mmu.c | 4 +-
> fs/ramfs/file-nommu.c | 4 +-
> fs/read_write.c | 78 +++++--
> fs/reiserfs/file.c | 4 +-
> fs/reiserfs/inode.c | 7 +-
> fs/romfs/mmap-nommu.c | 2 +-
> fs/sysv/file.c | 4 +-
> fs/ubifs/file.c | 12 +-
> fs/udf/file.c | 13 +-
> fs/udf/inode.c | 10 +-
> fs/ufs/file.c | 4 +-
> fs/xfs/xfs_aops.c | 13 +-
> fs/xfs/xfs_file.c | 51 ++---
> include/linux/aio.h | 20 +-
> include/linux/bio.h | 8 +
> include/linux/blk_types.h | 2 -
> include/linux/fs.h | 165 ++++++++++++--
> include/linux/nfs_fs.h | 13 +-
> include/uapi/linux/aio_abi.h | 2 +
> include/uapi/linux/loop.h | 1 +
> mm/filemap.c | 433 ++++++++++++++----------------------
> mm/page_io.c | 15 +-
> mm/shmem.c | 61 ++---
> 87 files changed, 1862 insertions(+), 1002 deletions(-)
> create mode 100644 fs/iov-iter.c
>
> --
> 1.8.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
"Thought is the essence of where you are now."

2013-08-21 13:55:40

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V8 15/33] aio: add aio support for iov_iter arguments

On Thu, Jul 25, 2013 at 12:50:41PM -0500, Dave Kleikamp wrote:
> This adds iocb cmds which specify that memory is held in iov_iter
> structures. This lets kernel callers specify memory that can be
> expressed in an iov_iter, which includes pages in bio_vec arrays.
>
> Only kernel callers can provide an iov_iter so it doesn't make a lot of
> sense to expose the IOCB_CMD values for this as part of the user space
> ABI.

I don't think adding the IOCB_CMD_{READ,WRITE}_ITER operations to
include/uapi/linux/aio_abi.h is the right thing to do here -- they're
never going to be used by userland, and care certainly not part of the
abi we're presenting to userland. I'd suggest moving these opcodes to
include/linux/aio.h. Also, if you make the values > 16 bits, userland
will never be able to pass them in inadvertently (although things look
okay if that does happen at present).

-ben

> But kernel callers should also be able to perform the usual aio
> operations which suggests using the the existing operation namespace and
> support code.
>
> Signed-off-by: Dave Kleikamp <[email protected]>
> Cc: Zach Brown <[email protected]>
> ---
> fs/aio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/aio.h | 3 ++
> include/uapi/linux/aio_abi.h | 2 ++
> 3 files changed, 72 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index c65ba13..0da82c0 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -991,6 +991,48 @@ static ssize_t aio_setup_single_vector(int rw, struct kiocb *kiocb)
> return 0;
> }
>
> +static ssize_t aio_read_iter(struct kiocb *iocb)
> +{
> + struct file *file = iocb->ki_filp;
> + ssize_t ret;
> +
> + if (unlikely(!is_kernel_kiocb(iocb)))
> + return -EINVAL;
> +
> + if (unlikely(!(file->f_mode & FMODE_READ)))
> + return -EBADF;
> +
> + ret = security_file_permission(file, MAY_READ);
> + if (unlikely(ret))
> + return ret;
> +
> + if (!file->f_op->read_iter)
> + return -EINVAL;
> +
> + return file->f_op->read_iter(iocb, iocb->ki_iter, iocb->ki_pos);
> +}
> +
> +static ssize_t aio_write_iter(struct kiocb *iocb)
> +{
> + struct file *file = iocb->ki_filp;
> + ssize_t ret;
> +
> + if (unlikely(!is_kernel_kiocb(iocb)))
> + return -EINVAL;
> +
> + if (unlikely(!(file->f_mode & FMODE_WRITE)))
> + return -EBADF;
> +
> + ret = security_file_permission(file, MAY_WRITE);
> + if (unlikely(ret))
> + return ret;
> +
> + if (!file->f_op->write_iter)
> + return -EINVAL;
> +
> + return file->f_op->write_iter(iocb, iocb->ki_iter, iocb->ki_pos);
> +}
> +
> /*
> * aio_setup_iocb:
> * Performs the initial checks and aio retry method
> @@ -1042,6 +1084,14 @@ rw_common:
> ret = aio_rw_vect_retry(req, rw, rw_op);
> break;
>
> + case IOCB_CMD_READ_ITER:
> + ret = aio_read_iter(req);
> + break;
> +
> + case IOCB_CMD_WRITE_ITER:
> + ret = aio_write_iter(req);
> + break;
> +
> case IOCB_CMD_FDSYNC:
> if (!file->f_op->aio_fsync)
> return -EINVAL;
> @@ -1116,6 +1166,23 @@ void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
> }
> EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
>
> +/*
> + * The iter count must be set before calling here. Some filesystems uses
> + * iocb->ki_left as an indicator of the size of an IO.
> + */
> +void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
> + unsigned short op, struct iov_iter *iter, loff_t off)
> +{
> + iocb->ki_filp = filp;
> + iocb->ki_iter = iter;
> + iocb->ki_opcode = op;
> + iocb->ki_pos = off;
> + iocb->ki_nbytes = iov_iter_count(iter);
> + iocb->ki_left = iocb->ki_nbytes;
> + iocb->ki_ctx = (void *)-1;
> +}
> +EXPORT_SYMBOL_GPL(aio_kernel_init_iter);
> +
> void aio_kernel_init_callback(struct kiocb *iocb,
> void (*complete)(u64 user_data, long res),
> u64 user_data)
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index 014a75d..64d059d 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -66,6 +66,7 @@ struct kiocb {
> * this is the underlying eventfd context to deliver events to.
> */
> struct eventfd_ctx *ki_eventfd;
> + struct iov_iter *ki_iter;
> };
>
> static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -102,6 +103,8 @@ struct kiocb *aio_kernel_alloc(gfp_t gfp);
> void aio_kernel_free(struct kiocb *iocb);
> void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
> unsigned short op, void *ptr, size_t nr, loff_t off);
> +void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
> + unsigned short op, struct iov_iter *iter, loff_t off);
> void aio_kernel_init_callback(struct kiocb *iocb,
> void (*complete)(u64 user_data, long res),
> u64 user_data);
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index bb2554f..22ce4bd 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -44,6 +44,8 @@ enum {
> IOCB_CMD_NOOP = 6,
> IOCB_CMD_PREADV = 7,
> IOCB_CMD_PWRITEV = 8,
> + IOCB_CMD_READ_ITER = 9,
> + IOCB_CMD_WRITE_ITER = 10,
> };
>
> /*
> --
> 1.8.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
"Thought is the essence of where you are now."

2013-08-21 16:30:35

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

Ben,
First, let me apologize for neglecting to copy you and linux-aio on the
applicable patches. I've been carrying along this patchset, assuming I
had gotten the proper cc's correct a while back, but I somehow missed
the aio pieces.

On 08/21/2013 08:02 AM, Benjamin LaHaise wrote:
> Hello Dave,
>
> On Thu, Jul 25, 2013 at 12:50:26PM -0500, Dave Kleikamp wrote:
>> This patch series adds a kernel interface to fs/aio.c so that kernel code can
>> issue concurrent asynchronous IO to file systems. It adds an aio command and
>> file system methods which specify io memory with pages instead of userspace
>> addresses.
>
> First off, have you tested that this series actually works when merged with
> the pending AIO changes from Kent? There a git tree with those pending
> changes at git://git.kvack.org/~bcrl/aio-next.git , and they're in
> linux-next.

I've lightly tested the patchset against the linux-next tree, running a
fio job on loop-mounted filesystems of different fs types.

> One of the major problems your changeset continues to carry is that your
> new read_iter/write_iter operations permit blocking (implicitely), which
> really isn't what we want for aio. If you're going to introduce a new api,
> it should be made non-blocking, and enforce that non-blocking requirement
> (ie warn when read_iter/write_iter methods perform blockin operations,
> similar to the warnings when scheduling in atomic mode). This means more
> changes for some filesystem code involved, something that people have been
> avoiding for years, but which really needs to be done.

I'm not really sure how the read_iter and write_iter operations are more
likely to block than the current aio_read and aio_write operations. Am I
missing something?

Thanks,
Dave

2013-08-21 16:39:27

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Aug 21, 2013 at 11:30:22AM -0500, Dave Kleikamp wrote:
> Ben,
> First, let me apologize for neglecting to copy you and linux-aio on the
> applicable patches. I've been carrying along this patchset, assuming I
> had gotten the proper cc's correct a while back, but I somehow missed
> the aio pieces.

Thanks. Let's figure out how to tackle this best.

> On 08/21/2013 08:02 AM, Benjamin LaHaise wrote:
...
> > First off, have you tested that this series actually works when merged with
> > the pending AIO changes from Kent? There a git tree with those pending
> > changes at git://git.kvack.org/~bcrl/aio-next.git , and they're in
> > linux-next.
>
> I've lightly tested the patchset against the linux-next tree, running a
> fio job on loop-mounted filesystems of different fs types.

Good to hear.
>
> > One of the major problems your changeset continues to carry is that your
> > new read_iter/write_iter operations permit blocking (implicitely), which
> > really isn't what we want for aio. If you're going to introduce a new api,
> > it should be made non-blocking, and enforce that non-blocking requirement
> > (ie warn when read_iter/write_iter methods perform blockin operations,
> > similar to the warnings when scheduling in atomic mode). This means more
> > changes for some filesystem code involved, something that people have been
> > avoiding for years, but which really needs to be done.
>
> I'm not really sure how the read_iter and write_iter operations are more
> likely to block than the current aio_read and aio_write operations. Am I
> missing something?

What you say is true, however, my point is more that it will be far easier
to fix this issue by making it a hard constraint of a new API than it is
to do a system-wide retrofit. You're converting code over to use the new
API one by one, so adding a little bit more work to try and finally sort
out this issue while making those conversions would be vrey helpful.

I'm not saying that you should be required to write the code to cope with
this additional requirement (I'm perfectly happy to help with that, and
can probably get some time for that at $work), but more that if we're
going to be changing all of the filesystems, we might as well try to get
things right.

-ben

> Thanks,
> Dave

--
"Thought is the essence of where you are now."

2013-08-21 17:12:49

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On 08/21/2013 11:39 AM, Benjamin LaHaise wrote:
> On Wed, Aug 21, 2013 at 11:30:22AM -0500, Dave Kleikamp wrote:
>> Ben,
>> First, let me apologize for neglecting to copy you and linux-aio on the
>> applicable patches. I've been carrying along this patchset, assuming I
>> had gotten the proper cc's correct a while back, but I somehow missed
>> the aio pieces.
>
> Thanks. Let's figure out how to tackle this best.
>
>> On 08/21/2013 08:02 AM, Benjamin LaHaise wrote:
> ...
>>> First off, have you tested that this series actually works when merged with
>>> the pending AIO changes from Kent? There a git tree with those pending
>>> changes at git://git.kvack.org/~bcrl/aio-next.git , and they're in
>>> linux-next.
>>
>> I've lightly tested the patchset against the linux-next tree, running a
>> fio job on loop-mounted filesystems of different fs types.
>
> Good to hear.
>>
>>> One of the major problems your changeset continues to carry is that your
>>> new read_iter/write_iter operations permit blocking (implicitely), which
>>> really isn't what we want for aio. If you're going to introduce a new api,
>>> it should be made non-blocking, and enforce that non-blocking requirement
>>> (ie warn when read_iter/write_iter methods perform blockin operations,
>>> similar to the warnings when scheduling in atomic mode). This means more
>>> changes for some filesystem code involved, something that people have been
>>> avoiding for years, but which really needs to be done.
>>
>> I'm not really sure how the read_iter and write_iter operations are more
>> likely to block than the current aio_read and aio_write operations. Am I
>> missing something?
>
> What you say is true, however, my point is more that it will be far easier
> to fix this issue by making it a hard constraint of a new API than it is
> to do a system-wide retrofit. You're converting code over to use the new
> API one by one, so adding a little bit more work to try and finally sort
> out this issue while making those conversions would be vrey helpful.
>
> I'm not saying that you should be required to write the code to cope with
> this additional requirement (I'm perfectly happy to help with that, and
> can probably get some time for that at $work), but more that if we're
> going to be changing all of the filesystems, we might as well try to get
> things right.

I don't really intend to make the patchset any more complicated than it
already is. The read/write_iter operations are intended to be as near a
replacement as possible to aio_read/write with the added ability to deal
with both kernel and user pages. A completely non-blocking interface
would be great, but that's a bit of work I'd rather not have to wait
for. Maybe that requirement can be added later.

2013-08-21 19:30:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, 21 Aug 2013 09:02:31 -0400 Benjamin LaHaise <[email protected]> wrote:

> One of the major problems your changeset continues to carry is that your
> new read_iter/write_iter operations permit blocking (implicitely), which
> really isn't what we want for aio. If you're going to introduce a new api,
> it should be made non-blocking, and enforce that non-blocking requirement

It's been so incredibly long and I've forgotten everything AIO :(

In this context, "non-blocking" means no synchronous IO, yes? Even for
indirect blocks, etc. What about accidental D-state blockage in page
reclaim, or against random sleeping locks?

Also, why does this requirement exist? "99% async" is not good enough?
How come?

2013-08-21 20:24:53

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Aug 21, 2013 at 12:30:32PM -0700, Andrew Morton wrote:
> On Wed, 21 Aug 2013 09:02:31 -0400 Benjamin LaHaise <[email protected]> wrote:
>
> > One of the major problems your changeset continues to carry is that your
> > new read_iter/write_iter operations permit blocking (implicitely), which
> > really isn't what we want for aio. If you're going to introduce a new api,
> > it should be made non-blocking, and enforce that non-blocking requirement
>
> It's been so incredibly long and I've forgotten everything AIO :(
>
> In this context, "non-blocking" means no synchronous IO, yes? Even for
> indirect blocks, etc. What about accidental D-state blockage in page
> reclaim, or against random sleeping locks?

Those are all no-nos. Blocking for memory allocation for short durations
is okay, but not for wandering off into scan-the-world type ordeals (that
is, it should be avoided).

> Also, why does this requirement exist? "99% async" is not good enough?
> How come?

99% async is okay for the database folks, but not for all users. Think
unified event loops. For example, the application I'm currently working
on is using AIO to isolate disk access from blocking the main thread. If
things go off and block on random locks or on disk I/O, bad things happen,
like watchdogs triggering. One of the real world requirements we have is
that the application has to keep running even if the disks we're running
on go bad. With SANs and multipath involved, sometimes I/O can take tens
of seconds to complete. You also don't want to block operations that can
proceed by those that are presently blocked, as that reduces the available
parallelism to devices and increases overall latency.

I'll admit there's a lot of work to be done in this area, hence why I've
done some work on thread based AIO recently, but threads aren't great for
all use-cases. Ultimately something like Zach's schedulable stacks are
needed to get the overhead down to something reasonable.

Still, we shouldn't keep on propagating broken APIs that don't reflect
actual requirements.

-ben
--
"Thought is the essence of where you are now."

2013-08-23 15:48:19

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH V8 11/33] dio: Convert direct_IO to use iov_iter

On Thu, Jul 25, 2013 at 7:50 PM, Dave Kleikamp <[email protected]> wrote:
> Change the direct_IO aop to take an iov_iter argument rather than an iovec.
> This will get passed down through most filesystems so that only the
> __blockdev_direct_IO helper need be aware of whether user or kernel memory
> is being passed to the function.

Lustre in -next also needs to be updated:

drivers/staging/lustre/lustre/llite/rw26.c:549: warning:
initialization from incompatible pointer type

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2013-08-30 20:06:00

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 15/33] aio: add aio support for iov_iter arguments

Sorry for the lack of response. Getting back to this.

On 08/21/2013 08:55 AM, Benjamin LaHaise wrote:
> On Thu, Jul 25, 2013 at 12:50:41PM -0500, Dave Kleikamp wrote:
>> This adds iocb cmds which specify that memory is held in iov_iter
>> structures. This lets kernel callers specify memory that can be
>> expressed in an iov_iter, which includes pages in bio_vec arrays.
>>
>> Only kernel callers can provide an iov_iter so it doesn't make a lot of
>> sense to expose the IOCB_CMD values for this as part of the user space
>> ABI.
>
> I don't think adding the IOCB_CMD_{READ,WRITE}_ITER operations to
> include/uapi/linux/aio_abi.h is the right thing to do here -- they're
> never going to be used by userland, and care certainly not part of the
> abi we're presenting to userland. I'd suggest moving these opcodes to
> include/linux/aio.h.

Agreed.

> Also, if you make the values > 16 bits, userland
> will never be able to pass them in inadvertently (although things look
> okay if that does happen at present).

I'd have to change the declaration of ki_opcode to an int. This
shouldn't be a problem since it'll be padded to a long anyway.

>
> -ben
>
>> But kernel callers should also be able to perform the usual aio
>> operations which suggests using the the existing operation namespace and
>> support code.
>>
>> Signed-off-by: Dave Kleikamp <[email protected]>
>> Cc: Zach Brown <[email protected]>
>> ---
>> fs/aio.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/aio.h | 3 ++
>> include/uapi/linux/aio_abi.h | 2 ++
>> 3 files changed, 72 insertions(+)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index c65ba13..0da82c0 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -991,6 +991,48 @@ static ssize_t aio_setup_single_vector(int rw, struct kiocb *kiocb)
>> return 0;
>> }
>>
>> +static ssize_t aio_read_iter(struct kiocb *iocb)
>> +{
>> + struct file *file = iocb->ki_filp;
>> + ssize_t ret;
>> +
>> + if (unlikely(!is_kernel_kiocb(iocb)))
>> + return -EINVAL;
>> +
>> + if (unlikely(!(file->f_mode & FMODE_READ)))
>> + return -EBADF;
>> +
>> + ret = security_file_permission(file, MAY_READ);
>> + if (unlikely(ret))
>> + return ret;
>> +
>> + if (!file->f_op->read_iter)
>> + return -EINVAL;
>> +
>> + return file->f_op->read_iter(iocb, iocb->ki_iter, iocb->ki_pos);
>> +}
>> +
>> +static ssize_t aio_write_iter(struct kiocb *iocb)
>> +{
>> + struct file *file = iocb->ki_filp;
>> + ssize_t ret;
>> +
>> + if (unlikely(!is_kernel_kiocb(iocb)))
>> + return -EINVAL;
>> +
>> + if (unlikely(!(file->f_mode & FMODE_WRITE)))
>> + return -EBADF;
>> +
>> + ret = security_file_permission(file, MAY_WRITE);
>> + if (unlikely(ret))
>> + return ret;
>> +
>> + if (!file->f_op->write_iter)
>> + return -EINVAL;
>> +
>> + return file->f_op->write_iter(iocb, iocb->ki_iter, iocb->ki_pos);
>> +}
>> +
>> /*
>> * aio_setup_iocb:
>> * Performs the initial checks and aio retry method
>> @@ -1042,6 +1084,14 @@ rw_common:
>> ret = aio_rw_vect_retry(req, rw, rw_op);
>> break;
>>
>> + case IOCB_CMD_READ_ITER:
>> + ret = aio_read_iter(req);
>> + break;
>> +
>> + case IOCB_CMD_WRITE_ITER:
>> + ret = aio_write_iter(req);
>> + break;
>> +
>> case IOCB_CMD_FDSYNC:
>> if (!file->f_op->aio_fsync)
>> return -EINVAL;
>> @@ -1116,6 +1166,23 @@ void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
>> }
>> EXPORT_SYMBOL_GPL(aio_kernel_init_rw);
>>
>> +/*
>> + * The iter count must be set before calling here. Some filesystems uses
>> + * iocb->ki_left as an indicator of the size of an IO.
>> + */
>> +void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
>> + unsigned short op, struct iov_iter *iter, loff_t off)
>> +{
>> + iocb->ki_filp = filp;
>> + iocb->ki_iter = iter;
>> + iocb->ki_opcode = op;
>> + iocb->ki_pos = off;
>> + iocb->ki_nbytes = iov_iter_count(iter);
>> + iocb->ki_left = iocb->ki_nbytes;
>> + iocb->ki_ctx = (void *)-1;
>> +}
>> +EXPORT_SYMBOL_GPL(aio_kernel_init_iter);
>> +
>> void aio_kernel_init_callback(struct kiocb *iocb,
>> void (*complete)(u64 user_data, long res),
>> u64 user_data)
>> diff --git a/include/linux/aio.h b/include/linux/aio.h
>> index 014a75d..64d059d 100644
>> --- a/include/linux/aio.h
>> +++ b/include/linux/aio.h
>> @@ -66,6 +66,7 @@ struct kiocb {
>> * this is the underlying eventfd context to deliver events to.
>> */
>> struct eventfd_ctx *ki_eventfd;
>> + struct iov_iter *ki_iter;
>> };
>>
>> static inline bool is_sync_kiocb(struct kiocb *kiocb)
>> @@ -102,6 +103,8 @@ struct kiocb *aio_kernel_alloc(gfp_t gfp);
>> void aio_kernel_free(struct kiocb *iocb);
>> void aio_kernel_init_rw(struct kiocb *iocb, struct file *filp,
>> unsigned short op, void *ptr, size_t nr, loff_t off);
>> +void aio_kernel_init_iter(struct kiocb *iocb, struct file *filp,
>> + unsigned short op, struct iov_iter *iter, loff_t off);
>> void aio_kernel_init_callback(struct kiocb *iocb,
>> void (*complete)(u64 user_data, long res),
>> u64 user_data);
>> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
>> index bb2554f..22ce4bd 100644
>> --- a/include/uapi/linux/aio_abi.h
>> +++ b/include/uapi/linux/aio_abi.h
>> @@ -44,6 +44,8 @@ enum {
>> IOCB_CMD_NOOP = 6,
>> IOCB_CMD_PREADV = 7,
>> IOCB_CMD_PWRITEV = 8,
>> + IOCB_CMD_READ_ITER = 9,
>> + IOCB_CMD_WRITE_ITER = 10,
>> };
>>
>> /*
>> --
>> 1.8.3.4

2013-10-14 15:07:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

Ben,

are you fine with the series now? It's been in linux-next for a while
and it would be really helpful to get it in for the avarious places
trying to do in-kernel file aio without going through the page cache.

2013-10-14 21:29:14

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Mon, Oct 14, 2013 at 08:07:01AM -0700, Christoph Hellwig wrote:
> Ben,
>
> are you fine with the series now? It's been in linux-next for a while
> and it would be really helpful to get it in for the avarious places
> trying to do in-kernel file aio without going through the page cache.

No, I am not okay with it. The feedback I provided 2 months ago has yet to
be addressed.

-ben
--
"Thought is the essence of where you are now."

2013-10-15 16:55:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Mon, Oct 14, 2013 at 05:29:10PM -0400, Benjamin LaHaise wrote:
> On Mon, Oct 14, 2013 at 08:07:01AM -0700, Christoph Hellwig wrote:
> > Ben,
> >
> > are you fine with the series now? It's been in linux-next for a while
> > and it would be really helpful to get it in for the avarious places
> > trying to do in-kernel file aio without going through the page cache.
>
> No, I am not okay with it. The feedback I provided 2 months ago has yet to
> be addressed.

Maybe I'm missing something, but the only big discussion item was that
you'd want something totally unrelated (notification for blocking)
mashed into this patch set.

While I agree that getting that would be useful it is something that has
nothing to do with issueing aio from kernel space and holding this
patchset hostage for something you'd like to see but that was
complicated enough that no one even tried it for many years seems
entirely unreasonable.

If there are any other issues left that I have missed it would be nice
to get a pointer to it, or a quick brief.

2013-10-15 17:14:51

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Tue, Oct 15, 2013 at 09:55:20AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 14, 2013 at 05:29:10PM -0400, Benjamin LaHaise wrote:
> > On Mon, Oct 14, 2013 at 08:07:01AM -0700, Christoph Hellwig wrote:
> > > Ben,
> > >
> > > are you fine with the series now? It's been in linux-next for a while
> > > and it would be really helpful to get it in for the avarious places
> > > trying to do in-kernel file aio without going through the page cache.
> >
> > No, I am not okay with it. The feedback I provided 2 months ago has yet to
> > be addressed.
>
> Maybe I'm missing something, but the only big discussion item was that
> you'd want something totally unrelated (notification for blocking)
> mashed into this patch set.

No, that is not what I was refering to.

> While I agree that getting that would be useful it is something that has
> nothing to do with issueing aio from kernel space and holding this
> patchset hostage for something you'd like to see but that was
> complicated enough that no one even tried it for many years seems
> entirely unreasonable.
>
> If there are any other issues left that I have missed it would be nice
> to get a pointer to it, or a quick brief.

The item I was refering to is to removing the opcodes used for in-kernel
purposes from out of the range that the userland accessible opcodes can
reach. That is, put them above the 16 bit limit for userspace opcodes.
There is absolutely no reason to expose kernel internal opcodes via the
userspace exported includes. It's a simple and reasonable change, and I
see no reason for Dave not to make that modification. Until that is
done, I will nak the changes.

-ben
--
"Thought is the essence of where you are now."

2013-10-15 17:18:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Tue, Oct 15, 2013 at 01:14:47PM -0400, Benjamin LaHaise wrote:
> > While I agree that getting that would be useful it is something that has
> > nothing to do with issueing aio from kernel space and holding this
> > patchset hostage for something you'd like to see but that was
> > complicated enough that no one even tried it for many years seems
> > entirely unreasonable.
> >
> > If there are any other issues left that I have missed it would be nice
> > to get a pointer to it, or a quick brief.
>
> The item I was refering to is to removing the opcodes used for in-kernel
> purposes from out of the range that the userland accessible opcodes can
> reach. That is, put them above the 16 bit limit for userspace opcodes.
> There is absolutely no reason to expose kernel internal opcodes via the
> userspace exported includes. It's a simple and reasonable change, and I
> see no reason for Dave not to make that modification. Until that is
> done, I will nak the changes.

Oh, missed that. I totally agree that it needs to be done.

Dave, will you have time to do it soon or should I look into it myself?

2013-10-15 17:54:33

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On 10/15/2013 12:18 PM, Christoph Hellwig wrote:
> On Tue, Oct 15, 2013 at 01:14:47PM -0400, Benjamin LaHaise wrote:
>>> While I agree that getting that would be useful it is something that has
>>> nothing to do with issueing aio from kernel space and holding this
>>> patchset hostage for something you'd like to see but that was
>>> complicated enough that no one even tried it for many years seems
>>> entirely unreasonable.
>>>
>>> If there are any other issues left that I have missed it would be nice
>>> to get a pointer to it, or a quick brief.
>>
>> The item I was refering to is to removing the opcodes used for in-kernel
>> purposes from out of the range that the userland accessible opcodes can
>> reach. That is, put them above the 16 bit limit for userspace opcodes.
>> There is absolutely no reason to expose kernel internal opcodes via the
>> userspace exported includes. It's a simple and reasonable change, and I
>> see no reason for Dave not to make that modification. Until that is
>> done, I will nak the changes.
>
> Oh, missed that. I totally agree that it needs to be done.
>
> Dave, will you have time to do it soon or should I look into it myself?

I'll take care of it. I actually made this change and somehow misplaced it.

Sorry about that.

Dave

2014-12-31 20:38:07

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Thu, Jul 25, 2013 at 7:50 PM, Dave Kleikamp <[email protected]> wrote:
> This patch series adds a kernel interface to fs/aio.c so that kernel code can
> issue concurrent asynchronous IO to file systems. It adds an aio command and
> file system methods which specify io memory with pages instead of userspace
> addresses.
>
> This series was written to reduce the current overhead loop imposes by
> performing synchronus buffered file system IO from a kernel thread. These
> patches turn loop into a light weight layer that translates bios into iocbs.
>
> It introduces new file ops, read_iter() and write_iter(), that replace the
> aio_read() and aio_write() operations. The iov_iter structure can now contain
> either a user-space iovec or a kernel-space bio_vec. Since it would be
> overly complicated to replace every instance of aio_read() and aio_write(),
> the old operations are not removed, but file systems implementing the new
> ones need not keep the old ones.
>
> Verion 8 is little changed from Version 7 that I send out in March, just
> updated to the latest kernel. These patches apply to 3.11-rc2 and can
> also be found at:
>
> git://github.com/kleikamp/linux-shaggy.git aio_loop
>

What has happened to that aio_loop patchset?
Is it in Linux-next?
( /me started to play with "block: loop: convert to blk-mq (v3)", so I
recalled this other improvement. )

- Sedat -

> Asias He (1):
> block_dev: add support for read_iter, write_iter
>
> Dave Kleikamp (22):
> iov_iter: iov_iter_copy_from_user() should use non-atomic copy
> iov_iter: add __iovec_copy_to_user()
> fuse: convert fuse to use iov_iter_copy_[to|from]_user
> iov_iter: ii_iovec_copy_to_user should pre-fault user pages
> dio: Convert direct_IO to use iov_iter
> dio: add bio_vec support to __blockdev_direct_IO()
> aio: add aio_kernel_() interface
> aio: add aio support for iov_iter arguments
> fs: create file_readable() and file_writable() functions
> fs: use read_iter and write_iter rather than aio_read and aio_write
> fs: add read_iter and write_iter to several file systems
> ocfs2: add support for read_iter and write_iter
> ext4: add support for read_iter and write_iter
> nfs: add support for read_iter, write_iter
> nfs: simplify swap
> btrfs: add support for read_iter and write_iter
> xfs: add support for read_iter and write_iter
> gfs2: Convert aio_read/write ops to read/write_iter
> udf: convert file ops from aio_read/write to read/write_iter
> afs: add support for read_iter and write_iter
> ecrpytfs: Convert aio_read/write ops to read/write_iter
> ubifs: convert file ops from aio_read/write to read/write_iter
>
> Hugh Dickins (1):
> tmpfs: add support for read_iter and write_iter
>
> Zach Brown (9):
> iov_iter: move into its own file
> iov_iter: add copy_to_user support
> iov_iter: hide iovec details behind ops function pointers
> iov_iter: add bvec support
> iov_iter: add a shorten call
> iov_iter: let callers extract iovecs and bio_vecs
> fs: pull iov_iter use higher up the stack
> bio: add bvec_length(), like iov_length()
> loop: use aio to perform io on the underlying file
>
> Documentation/filesystems/Locking | 6 +-
> Documentation/filesystems/vfs.txt | 12 +-
> drivers/block/loop.c | 148 ++++++++----
> drivers/char/raw.c | 4 +-
> drivers/mtd/nand/nandsim.c | 4 +-
> drivers/usb/gadget/storage_common.c | 4 +-
> fs/9p/vfs_addr.c | 12 +-
> fs/9p/vfs_file.c | 8 +-
> fs/Makefile | 2 +-
> fs/adfs/file.c | 4 +-
> fs/affs/file.c | 4 +-
> fs/afs/file.c | 4 +-
> fs/afs/internal.h | 3 +-
> fs/afs/write.c | 9 +-
> fs/aio.c | 152 ++++++++++++-
> fs/bad_inode.c | 14 ++
> fs/bfs/file.c | 4 +-
> fs/block_dev.c | 27 ++-
> fs/btrfs/file.c | 42 ++--
> fs/btrfs/inode.c | 63 +++---
> fs/ceph/addr.c | 3 +-
> fs/cifs/file.c | 4 +-
> fs/direct-io.c | 223 +++++++++++++------
> fs/ecryptfs/file.c | 15 +-
> fs/exofs/file.c | 4 +-
> fs/ext2/file.c | 4 +-
> fs/ext2/inode.c | 8 +-
> fs/ext3/file.c | 4 +-
> fs/ext3/inode.c | 15 +-
> fs/ext4/ext4.h | 3 +-
> fs/ext4/file.c | 34 +--
> fs/ext4/indirect.c | 16 +-
> fs/ext4/inode.c | 23 +-
> fs/f2fs/data.c | 4 +-
> fs/f2fs/file.c | 4 +-
> fs/fat/file.c | 4 +-
> fs/fat/inode.c | 10 +-
> fs/fuse/cuse.c | 10 +-
> fs/fuse/file.c | 90 ++++----
> fs/fuse/fuse_i.h | 5 +-
> fs/gfs2/aops.c | 7 +-
> fs/gfs2/file.c | 21 +-
> fs/hfs/inode.c | 11 +-
> fs/hfsplus/inode.c | 10 +-
> fs/hostfs/hostfs_kern.c | 4 +-
> fs/hpfs/file.c | 4 +-
> fs/internal.h | 4 +
> fs/iov-iter.c | 411 ++++++++++++++++++++++++++++++++++
> fs/jffs2/file.c | 8 +-
> fs/jfs/file.c | 4 +-
> fs/jfs/inode.c | 7 +-
> fs/logfs/file.c | 4 +-
> fs/minix/file.c | 4 +-
> fs/nfs/direct.c | 302 ++++++++++++++++---------
> fs/nfs/file.c | 33 ++-
> fs/nfs/internal.h | 4 +-
> fs/nfs/nfs4file.c | 4 +-
> fs/nilfs2/file.c | 4 +-
> fs/nilfs2/inode.c | 8 +-
> fs/ocfs2/aops.c | 8 +-
> fs/ocfs2/aops.h | 2 +-
> fs/ocfs2/file.c | 55 ++---
> fs/ocfs2/ocfs2_trace.h | 6 +-
> fs/omfs/file.c | 4 +-
> fs/ramfs/file-mmu.c | 4 +-
> fs/ramfs/file-nommu.c | 4 +-
> fs/read_write.c | 78 +++++--
> fs/reiserfs/file.c | 4 +-
> fs/reiserfs/inode.c | 7 +-
> fs/romfs/mmap-nommu.c | 2 +-
> fs/sysv/file.c | 4 +-
> fs/ubifs/file.c | 12 +-
> fs/udf/file.c | 13 +-
> fs/udf/inode.c | 10 +-
> fs/ufs/file.c | 4 +-
> fs/xfs/xfs_aops.c | 13 +-
> fs/xfs/xfs_file.c | 51 ++---
> include/linux/aio.h | 20 +-
> include/linux/bio.h | 8 +
> include/linux/blk_types.h | 2 -
> include/linux/fs.h | 165 ++++++++++++--
> include/linux/nfs_fs.h | 13 +-
> include/uapi/linux/aio_abi.h | 2 +
> include/uapi/linux/loop.h | 1 +
> mm/filemap.c | 433 ++++++++++++++----------------------
> mm/page_io.c | 15 +-
> mm/shmem.c | 61 ++---
> 87 files changed, 1862 insertions(+), 1002 deletions(-)
> create mode 100644 fs/iov-iter.c
>
> --
> 1.8.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-12-31 21:53:24

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On 12/31/2014 02:38 PM, Sedat Dilek wrote:
>
> What has happened to that aio_loop patchset?
> Is it in Linux-next?
> ( /me started to play with "block: loop: convert to blk-mq (v3)", so I
> recalled this other improvement. )

It met with some harsh resistance, so I backed off on it. Then Al Viro
got busy re-writing the iov_iter infrastructure and I put my patchset on
the shelf to look at later. Then Ming Lei submitted more up-to-date
patchset: https://lkml.org/lkml/2014/8/6/175

It looks like Ming is currently only pushing the first half of that
patchset. I don't know what his plans are for the last three patches:

aio: add aio_kernel_() interface
fd/direct-io: introduce should_dirty for kernel aio
block: loop: support to submit I/O via kernel aio based

Dave

2014-12-31 22:35:13

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Wed, Dec 31, 2014 at 10:52 PM, Dave Kleikamp
<[email protected]> wrote:
> On 12/31/2014 02:38 PM, Sedat Dilek wrote:
>>
>> What has happened to that aio_loop patchset?
>> Is it in Linux-next?
>> ( /me started to play with "block: loop: convert to blk-mq (v3)", so I
>> recalled this other improvement. )
>
> It met with some harsh resistance, so I backed off on it. Then Al Viro
> got busy re-writing the iov_iter infrastructure and I put my patchset on
> the shelf to look at later. Then Ming Lei submitted more up-to-date
> patchset: https://lkml.org/lkml/2014/8/6/175
>
> It looks like Ming is currently only pushing the first half of that
> patchset. I don't know what his plans are for the last three patches:
>
> aio: add aio_kernel_() interface
> fd/direct-io: introduce should_dirty for kernel aio
> block: loop: support to submit I/O via kernel aio based
>

I tested with block-mq-v3 (for next-20141231) [1] and this looks promising [2].

Maybe Ming can say what the plan is with the missing parts.

- Sedat -

[1] http://marc.info/?l=linux-kernel&m=142003226701471&w=2
[2]http://marc.info/?l=linux-kernel&m=142006516408381&w=2

2015-01-01 00:52:53

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V8 00/33] loop: Issue O_DIRECT aio using bio_vec

On Thu, Jan 1, 2015 at 6:35 AM, Sedat Dilek <[email protected]> wrote:
> On Wed, Dec 31, 2014 at 10:52 PM, Dave Kleikamp
> <[email protected]> wrote:
>> On 12/31/2014 02:38 PM, Sedat Dilek wrote:
>>>
>>> What has happened to that aio_loop patchset?
>>> Is it in Linux-next?
>>> ( /me started to play with "block: loop: convert to blk-mq (v3)", so I
>>> recalled this other improvement. )
>>
>> It met with some harsh resistance, so I backed off on it. Then Al Viro
>> got busy re-writing the iov_iter infrastructure and I put my patchset on
>> the shelf to look at later. Then Ming Lei submitted more up-to-date
>> patchset: https://lkml.org/lkml/2014/8/6/175
>>
>> It looks like Ming is currently only pushing the first half of that
>> patchset. I don't know what his plans are for the last three patches:
>>
>> aio: add aio_kernel_() interface
>> fd/direct-io: introduce should_dirty for kernel aio
>> block: loop: support to submit I/O via kernel aio based
>>
>
> I tested with block-mq-v3 (for next-20141231) [1] and this looks promising [2].
>
> Maybe Ming can say what the plan is with the missing parts.

I have compared kernel aio based loop-mq(the other 3 aio patches
against loop-mq v2, [1]) with loop-mq v3, looks the data isn't
better than loop-mq v3.

kernel aio based approach requires direct I/O, at least direct write
shouldn't be good as page cache write, IMO.

So I think we need to investigate kernel aio based approach further
wrt. loop improvement.

[1] http://marc.info/?l=linux-kernel&m=140941494422520&w=2

Thanks,
Ming Lei
>
> - Sedat -
>
> [1] http://marc.info/?l=linux-kernel&m=142003226701471&w=2
> [2]http://marc.info/?l=linux-kernel&m=142006516408381&w=2
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html