2014-12-15 05:27:27

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 0/8] clean up and generalize swap-over-NFS

Hi, everyone,

This patch series contains all of the non-BTRFS changes that I've made
as a part of implementing swap file support on BTRFS. The BTRFS parts of
that series (https://lkml.org/lkml/2014/12/9/718) are still undergoing
development, and the non-BTRFS changes now outnumber those within BTRFS,
so it'll probably work best to get these in separately.

Long story short, the generic swap file infrastructure introduced for
swap-over-NFS isn't quite ready for other clients without making some
changes.

Before I forget, this patch series was built against cbfe0de in Linus'
tree (to avoid conflicts with the recent iov_iter work).

Patches 1 and 2 fix an issue with NFS and the swap file infrastructure
not following the direct_IO locking conventions, leading to locking
issues for anyone else trying to use the interface (discussed here:
https://lkml.org/lkml/2014/12/12/677).

Patch 3 removes the ITER_BVEC flag from the rw argument passed to
direct_IO, as many, but not all, direct_IO implementations expect either
rw == READ or rw == WRITE. The lack of documentation about what's
correct here is probably going to break something at some point, but
that's another conversation.

Patch 4 adds iov_iter_bvec for swap_writepage, the upcoming
swap_readpage change, and splice.

Patches 5 and 6 are preparation for patch 7, teaching the VFS and NFS to
handle ITER_BVEC reads.

Patch 7 is the biggest change in the series: it changes swap_readpage to
proxy through ->direct_IO rather than ->readpage. Using readpage for a
swapcache page requires all sorts of messy workarounds (see here for
context: https://lkml.org/lkml/2014/11/19/46). Patch 8 updates the
documentation accordingly.

Thanks!

Omar Sandoval (8):
nfs: follow direct I/O write locking convention
swap: lock i_mutex for swap_writepage direct_IO
swap: don't add ITER_BVEC flag to direct_IO rw
iov_iter: add iov_iter_bvec and convert callers
direct-io: don't dirty ITER_BVEC pages on read
nfs: don't dirty ITER_BVEC pages read through direct I/O
swap: use direct I/O for SWP_FILE swap_readpage
vfs: update swap_{,de}activate documentation

Documentation/filesystems/Locking | 7 +++---
Documentation/filesystems/vfs.txt | 7 +++---
fs/direct-io.c | 8 ++++---
fs/nfs/direct.c | 17 ++++++++-------
fs/nfs/file.c | 8 +++++--
fs/splice.c | 7 ++----
include/linux/uio.h | 2 ++
mm/iov_iter.c | 12 +++++++++++
mm/page_io.c | 45 ++++++++++++++++++++++++++++-----------
9 files changed, 76 insertions(+), 37 deletions(-)

--
2.1.3


2014-12-15 05:27:57

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 5/8] direct-io: don't dirty ITER_BVEC pages on read

Reads through the iov_iter infrastructure for kernel pages shouldn't be
dirtied by the direct I/O code.

This is based on Dave Kleikamp's and Ming Lei's previously posted
patches.

Cc: Ming Lei <[email protected]>
Acked-by: Dave Kleikamp <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
---
fs/direct-io.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..e542ce4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -120,6 +120,7 @@ struct dio {
spinlock_t bio_lock; /* protects BIO fields below */
int page_errors; /* errno from get_user_pages() */
int is_async; /* is IO async ? */
+ int should_dirty; /* should we mark read pages dirty? */
bool defer_completion; /* defer AIO completion to workqueue? */
int io_error; /* IO error in completion path */
unsigned long refcount; /* direct_io_worker() and bios */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
dio->refcount++;
spin_unlock_irqrestore(&dio->bio_lock, flags);

- if (dio->is_async && dio->rw == READ)
+ if (dio->is_async && dio->rw == READ && dio->should_dirty)
bio_set_pages_dirty(bio);

if (sdio->submit_io)
@@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
if (!uptodate)
dio->io_error = -EIO;

- if (dio->is_async && dio->rw == READ) {
+ if (dio->is_async && dio->rw == READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;

- if (dio->rw == READ && !PageCompound(page))
+ if (dio->rw == READ && !PageCompound(page) && dio->should_dirty)
set_page_dirty_lock(page);
page_cache_release(page);
}
@@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,

dio->inode = inode;
dio->rw = rw;
+ dio->should_dirty = !(iter->type & ITER_BVEC);

/*
* For AIO O_(D)SYNC writes we need to defer completions to a workqueue
--
2.1.3

2014-12-15 05:28:04

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 1/8] nfs: follow direct I/O write locking convention

The generic callers of direct_IO lock i_mutex before doing a write. NFS
doesn't use the generic write code, so it doesn't follow this
convention. This is now a problem because the interface introduced for
swap-over-NFS calls direct_IO for a write without holding i_mutex, but
other implementations of direct_IO will expect to have it locked.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/nfs/direct.c | 12 +++++-------
fs/nfs/file.c | 8 ++++++--
2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 10bf072..9402b96 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -906,17 +906,15 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
if (!count)
goto out;

- mutex_lock(&inode->i_mutex);
-
result = nfs_sync_mapping(mapping);
if (result)
- goto out_unlock;
+ goto out;

if (mapping->nrpages) {
result = invalidate_inode_pages2_range(mapping,
pos >> PAGE_CACHE_SHIFT, end);
if (result)
- goto out_unlock;
+ goto out;
}

task_io_account_write(count);
@@ -924,7 +922,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
result = -ENOMEM;
dreq = nfs_direct_req_alloc();
if (!dreq)
- goto out_unlock;
+ goto out;

dreq->inode = inode;
dreq->bytes_left = count;
@@ -960,12 +958,12 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter,
}
}
nfs_direct_req_release(dreq);
+
+ mutex_lock(&inode->i_mutex);
return result;

out_release:
nfs_direct_req_release(dreq);
-out_unlock:
- mutex_unlock(&inode->i_mutex);
out:
return result;
}
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2ab6f00..8b80276 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -675,8 +675,12 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
if (result)
return result;

- if (file->f_flags & O_DIRECT)
- return nfs_file_direct_write(iocb, from, pos);
+ if (file->f_flags & O_DIRECT) {
+ mutex_lock(&inode->i_mutex);
+ result = nfs_file_direct_write(iocb, from, pos);
+ mutex_unlock(&inode->i_mutex);
+ return result;
+ }

dprintk("NFS: write(%pD2, %zu@%Ld)\n",
file, count, (long long) pos);
--
2.1.3

2014-12-15 05:28:03

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 6/8] nfs: don't dirty ITER_BVEC pages read through direct I/O

As with the generic blockdev code, kernel pages shouldn't be dirtied by
the direct I/O path.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/nfs/direct.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 9402b96..a502b3f 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -88,6 +88,7 @@ struct nfs_direct_req {
struct pnfs_ds_commit_info ds_cinfo; /* Storage for cinfo */
struct work_struct work;
int flags;
+ int should_dirty; /* should we mark read pages dirty? */
#define NFS_ODIRECT_DO_COMMIT (1) /* an unstable reply was received */
#define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */
struct nfs_writeverf verf; /* unstable write verifier */
@@ -370,7 +371,8 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr)
struct nfs_page *req = nfs_list_entry(hdr->pages.next);
struct page *page = req->wb_page;

- if (!PageCompound(page) && bytes < hdr->good_bytes)
+ if (!PageCompound(page) && bytes < hdr->good_bytes &&
+ dreq->should_dirty)
set_page_dirty(page);
bytes += req->wb_bytes;
nfs_list_remove_request(req);
@@ -542,6 +544,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
dreq->inode = inode;
dreq->bytes_left = count;
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+ dreq->should_dirty = !(iter->type & ITER_BVEC);
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
result = PTR_ERR(l_ctx);
--
2.1.3

2014-12-15 05:27:54

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 4/8] iov_iter: add iov_iter_bvec and convert callers

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/splice.c | 7 ++-----
include/linux/uio.h | 2 ++
mm/iov_iter.c | 12 ++++++++++++
mm/page_io.c | 14 +++++---------
4 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 75c6058..7c7176f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1006,11 +1006,8 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
}

/* ... iov_iter */
- from.type = ITER_BVEC | WRITE;
- from.bvec = array;
- from.nr_segs = n;
- from.count = sd.total_len - left;
- from.iov_offset = 0;
+ iov_iter_bvec(&from, ITER_BVEC | WRITE, array, n,
+ sd.total_len - left);

/* ... and iocb */
init_sync_kiocb(&kiocb, out);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index bd8569a..d1a34b4 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -90,6 +90,8 @@ void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
void iov_iter_kvec(struct iov_iter *i, int direction, const struct kvec *iov,
unsigned long nr_segs, size_t count);
+void iov_iter_bvec(struct iov_iter *i, int direction, const struct bio_vec *bv,
+ unsigned long nr_segs, size_t count);
ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index a1599ca..c975bc4 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -513,6 +513,18 @@ void iov_iter_kvec(struct iov_iter *i, int direction,
}
EXPORT_SYMBOL(iov_iter_kvec);

+void iov_iter_bvec(struct iov_iter *i, int direction, const struct bio_vec *bv,
+ unsigned long nr_segs, size_t count)
+{
+ BUG_ON(!(direction & ITER_BVEC));
+ i->type = direction;
+ i->bvec = bv;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count;
+}
+EXPORT_SYMBOL(iov_iter_bvec);
+
unsigned long iov_iter_alignment(const struct iov_iter *i)
{
unsigned long res = 0;
diff --git a/mm/page_io.c b/mm/page_io.c
index c229f88..4741248 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -265,18 +265,14 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
struct file *swap_file = sis->swap_file;
struct inode *inode = file_inode(swap_file);
struct address_space *mapping = swap_file->f_mapping;
+ struct iov_iter from;
struct bio_vec bv = {
.bv_page = page,
- .bv_len = PAGE_SIZE,
- .bv_offset = 0
+ .bv_len = PAGE_SIZE,
+ .bv_offset = 0,
};
- struct iov_iter from = {
- .type = ITER_BVEC | WRITE,
- .count = PAGE_SIZE,
- .iov_offset = 0,
- .nr_segs = 1,
- };
- from.bvec = &bv; /* older gcc versions are broken */
+
+ iov_iter_bvec(&from, ITER_BVEC | WRITE, &bv, 1, PAGE_SIZE);

init_sync_kiocb(&kiocb, swap_file);
kiocb.ki_pos = page_file_offset(page);
--
2.1.3

2014-12-15 05:27:50

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 3/8] swap: don't add ITER_BVEC flag to direct_IO rw

The rw argument to direct_IO has some ill-defined semantics. Some
filesystems (e.g., ext4, FAT) decide whether they're doing a write with
rw == WRITE, but others (e.g., XFS) check rw & WRITE. Let's set a good
example in the swap file code and say ITER_BVEC belongs in
iov_iter->flags but not in rw. This caters to the least common
denominator and avoids a sweeping change of every direct_IO
implementation for now.

Signed-off-by: Omar Sandoval <[email protected]>
---
mm/page_io.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 1630ac0..c229f88 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -285,8 +285,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
set_page_writeback(page);
unlock_page(page);
mutex_lock(&inode->i_mutex);
- ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
- &kiocb, &from,
+ ret = mapping->a_ops->direct_IO(WRITE, &kiocb, &from,
kiocb.ki_pos);
mutex_unlock(&inode->i_mutex);
if (ret == PAGE_SIZE) {
--
2.1.3

2014-12-15 05:27:45

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
be held, but most direct_IO implementations do.

Signed-off-by: Omar Sandoval <[email protected]>
---
mm/page_io.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..1630ac0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -263,6 +263,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
if (sis->flags & SWP_FILE) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
+ struct inode *inode = file_inode(swap_file);
struct address_space *mapping = swap_file->f_mapping;
struct bio_vec bv = {
.bv_page = page,
@@ -283,9 +284,11 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,

set_page_writeback(page);
unlock_page(page);
+ mutex_lock(&inode->i_mutex);
ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
&kiocb, &from,
kiocb.ki_pos);
+ mutex_unlock(&inode->i_mutex);
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
ret = 0;
--
2.1.3

2014-12-15 05:29:20

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 8/8] vfs: update swap_{,de}activate documentation

Parameters were added to swap_activate in the same patch series that
introduced it without updating the documentation. Additionally, the
documentation claims that non-existent address space operations
swap_{in,out} are used for swap I/O, but it's (now) direct_IO.

Signed-off-by: Omar Sandoval <[email protected]>
---
Documentation/filesystems/Locking | 7 ++++---
Documentation/filesystems/vfs.txt | 7 ++++---
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b30753c..e72b4c3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -205,7 +205,8 @@ prototypes:
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
int (*error_remove_page)(struct address_space *, struct page *);
- int (*swap_activate)(struct file *);
+ int (*swap_activate)(struct swap_info_struct *, struct file *,
+ sector_t *);
int (*swap_deactivate)(struct file *);

locking rules:
@@ -230,8 +231,8 @@ migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
error_remove_page: yes
-swap_activate: no
-swap_deactivate: no
+swap_activate: yes
+swap_deactivate: no

->write_begin(), ->write_end(), ->sync_page() and ->readpage()
may be called from the request handler (/dev/loop).
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 43ce050..a7c7974 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -600,8 +600,9 @@ struct address_space_operations {
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page);
- int (*swap_activate)(struct file *);
- int (*swap_deactivate)(struct file *);
+ int (*swap_activate)(struct swap_info_struct *, struct file *,
+ sector_t *);
+ void (*swap_deactivate)(struct file *);
};

writepage: called by the VM to write a dirty page to backing store.
@@ -788,7 +789,7 @@ struct address_space_operations {
memory. A return value of zero indicates success,
in which case this file can be used to back swapspace. The
swapspace operations will be proxied to this address space's
- ->swap_{out,in} methods.
+ ->direct_IO method for both reads and writes.

swap_deactivate: Called during swapoff on files where swap_activate
was successful.
--
2.1.3

2014-12-15 05:29:45

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH 7/8] swap: use direct I/O for SWP_FILE swap_readpage

On Mon, Nov 17, 2014 at 07:48:17AM -0800, Christoph Hellwig wrote:
> With the new iov_iter infrastructure that supprots direct I/O to kernel
> pages please get rid of the ->readpage hack first. I'm still utterly
> disapoined that this crap ever got merged.

Signed-off-by: Omar Sandoval <[email protected]>
---
mm/page_io.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 4741248..956307c 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -346,12 +346,33 @@ int swap_readpage(struct page *page)
}

if (sis->flags & SWP_FILE) {
+ struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
+ struct iov_iter to;
+ struct bio_vec bv = {
+ .bv_page = page,
+ .bv_len = PAGE_SIZE,
+ .bv_offset = 0,
+ };
+
+ iov_iter_bvec(&to, ITER_BVEC | READ, &bv, 1, PAGE_SIZE);

- ret = mapping->a_ops->readpage(swap_file, page);
- if (!ret)
+ init_sync_kiocb(&kiocb, swap_file);
+ kiocb.ki_pos = page_file_offset(page);
+ kiocb.ki_nbytes = PAGE_SIZE;
+
+ ret = mapping->a_ops->direct_IO(READ, &kiocb, &to,
+ kiocb.ki_pos);
+ if (ret == PAGE_SIZE) {
+ SetPageUptodate(page);
count_vm_event(PSWPIN);
+ ret = 0;
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ unlock_page(page);
return ret;
}

--
2.1.3

2014-12-15 06:16:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 3/8] swap: don't add ITER_BVEC flag to direct_IO rw

On Sun, Dec 14, 2014 at 09:26:57PM -0800, Omar Sandoval wrote:
> The rw argument to direct_IO has some ill-defined semantics. Some
> filesystems (e.g., ext4, FAT) decide whether they're doing a write with
> rw == WRITE, but others (e.g., XFS) check rw & WRITE. Let's set a good
> example in the swap file code and say ITER_BVEC belongs in
> iov_iter->flags but not in rw. This caters to the least common
> denominator and avoids a sweeping change of every direct_IO
> implementation for now.

Frankly, this is bogus. If anything, let's just kill the first argument
completely - ->direct_IO() can always pick it from iter->type.

As for catering to the least common denominator... To hell with the lowest
common denominator. How many instances of ->direct_IO() do we have, anyway?
24 in the mainline (and we don't give a flying fuck for out-of-tree code, as
a matter of policy). Moreover, several are of "do nothing" variety.

FWIW, 'rw' is a mess. We used to have this:
READ: O_DIRECT read
WRITE: O_DIRECT write
KERNEL_WRITE: swapout

These days KERNEL_WRITE got replaced with ITER_BVEC | WRITE. The thing is,
we have a bunch of places where we explicitly checked for being _equal_ to
WRITE. I.e. the checks that gave a negative on swapouts. I suspect that most
of them are wrong and should trigger on all writes, including swapouts, but
I really didn't want to dig into that pile of fun back then. That's the
main reason why 'rw' argument has survived at all...

2014-12-15 06:17:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 6/8] nfs: don't dirty ITER_BVEC pages read through direct I/O

On Sun, Dec 14, 2014 at 09:27:00PM -0800, Omar Sandoval wrote:
> As with the generic blockdev code, kernel pages shouldn't be dirtied by
> the direct I/O path.

This really asks for an inlined helper (iter_is_bvec(iter) or something like
that)

2014-12-15 12:49:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 1/8] nfs: follow direct I/O write locking convention

On Mon, Dec 15, 2014 at 12:26 AM, Omar Sandoval <[email protected]> wrote:
> The generic callers of direct_IO lock i_mutex before doing a write. NFS
> doesn't use the generic write code, so it doesn't follow this
> convention. This is now a problem because the interface introduced for
> swap-over-NFS calls direct_IO for a write without holding i_mutex, but
> other implementations of direct_IO will expect to have it locked.

I really don't care much about swap-over-NFS performance; that's a
niche usage at best. I _do_ care about O_DIRECT performance, and the
ability to run multiple WRITE calls in parallel.

IOW: Patch NACKed... Please find another solution.

Trond

2014-12-15 15:42:10

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 1/8] nfs: follow direct I/O write locking convention

On Mon, Dec 15, 2014 at 07:49:20AM -0500, Trond Myklebust wrote:
> On Mon, Dec 15, 2014 at 12:26 AM, Omar Sandoval <[email protected]> wrote:
> > The generic callers of direct_IO lock i_mutex before doing a write. NFS
> > doesn't use the generic write code, so it doesn't follow this
> > convention. This is now a problem because the interface introduced for
> > swap-over-NFS calls direct_IO for a write without holding i_mutex, but
> > other implementations of direct_IO will expect to have it locked.
>
> I really don't care much about swap-over-NFS performance; that's a
> niche usage at best. I _do_ care about O_DIRECT performance, and the
> ability to run multiple WRITE calls in parallel.
>
> IOW: Patch NACKed... Please find another solution.
>
> Trond

So the patch formatting doesn't make it completely clear what's going on
here, but here's what the original nfs_file_direct_write code did:

- called with i_mutex unlocked
- collects stats and does some generic checks
- locks i_mutex
- syncs the mapping, schedules the write
- unlocks i_mutex
- waits for the write to complete if synchronous

After this patch, nfs_file_direct_write works like:

- called with i_mutex locked
- collects stats and does some generic checks
- syncs the mapping, schedules the write
- drops i_mutex
- waits for the write to complete if synchronous
- picks i_mutex back up

There's an extra lock and unlock as a result and a slightly longer
critical section, but we drop i_mutex to wait for the write, so multiple
writes still work in parallel.

--
Omar

2014-12-15 15:57:14

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 3/8] swap: don't add ITER_BVEC flag to direct_IO rw

On Mon, Dec 15, 2014 at 06:16:02AM +0000, Al Viro wrote:
> On Sun, Dec 14, 2014 at 09:26:57PM -0800, Omar Sandoval wrote:
> > The rw argument to direct_IO has some ill-defined semantics. Some
> > filesystems (e.g., ext4, FAT) decide whether they're doing a write with
> > rw == WRITE, but others (e.g., XFS) check rw & WRITE. Let's set a good
> > example in the swap file code and say ITER_BVEC belongs in
> > iov_iter->flags but not in rw. This caters to the least common
> > denominator and avoids a sweeping change of every direct_IO
> > implementation for now.
>
> Frankly, this is bogus. If anything, let's just kill the first argument
> completely - ->direct_IO() can always pick it from iter->type.
>
> As for catering to the least common denominator... To hell with the lowest
> common denominator. How many instances of ->direct_IO() do we have, anyway?
> 24 in the mainline (and we don't give a flying fuck for out-of-tree code, as
> a matter of policy). Moreover, several are of "do nothing" variety.
>
> FWIW, 'rw' is a mess. We used to have this:
> READ: O_DIRECT read
> WRITE: O_DIRECT write
> KERNEL_WRITE: swapout
>
> These days KERNEL_WRITE got replaced with ITER_BVEC | WRITE. The thing is,
> we have a bunch of places where we explicitly checked for being _equal_ to
> WRITE. I.e. the checks that gave a negative on swapouts. I suspect that most
> of them are wrong and should trigger on all writes, including swapouts, but
> I really didn't want to dig into that pile of fun back then. That's the
> main reason why 'rw' argument has survived at all...
>
In that case, I'll take a stab at nuking rw. I'm almost certain that
some of these are completely wrong (for example, of the form
if (rw == WRITE) do_write(); else do_read();). This isn't an immediate
problem for swap files on BTRFS, as __blockdev_direct_IO does a bitwise
test, so I think I'll split it out into its own series.

Thanks,
--
Omar

2014-12-15 16:27:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> be held, but most direct_IO implementations do.
I think you are speaking about direct IO writes only, aren't you? For DIO
reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
for direct IO writes. It uses it's internal rwlock for this (see
xfs_file_dio_aio_write()). So I think this is just wrong.

Honza

> Signed-off-by: Omar Sandoval <[email protected]>
> ---
> mm/page_io.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 955db8b..1630ac0 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -263,6 +263,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
> if (sis->flags & SWP_FILE) {
> struct kiocb kiocb;
> struct file *swap_file = sis->swap_file;
> + struct inode *inode = file_inode(swap_file);
> struct address_space *mapping = swap_file->f_mapping;
> struct bio_vec bv = {
> .bv_page = page,
> @@ -283,9 +284,11 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>
> set_page_writeback(page);
> unlock_page(page);
> + mutex_lock(&inode->i_mutex);
> ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
> &kiocb, &from,
> kiocb.ki_pos);
> + mutex_unlock(&inode->i_mutex);
> if (ret == PAGE_SIZE) {
> count_vm_event(PSWPOUT);
> ret = 0;
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-12-15 16:56:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > be held, but most direct_IO implementations do.
> I think you are speaking about direct IO writes only, aren't you? For DIO
> reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> for direct IO writes. It uses it's internal rwlock for this (see
> xfs_file_dio_aio_write()). So I think this is just wrong.

The problem is that the use of ->direct_IO by the swap code is a gross
layering violation. ->direct_IO is a callback for the filesystem, and
the swap code need to call ->read_iter instead of ->readpage and
->write_tier instead of ->direct_IO, and leave the locking to the
filesystem.

2014-12-15 22:11:12

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Mon, Dec 15, 2014 at 08:56:15AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> > On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > > be held, but most direct_IO implementations do.
> > I think you are speaking about direct IO writes only, aren't you? For DIO
> > reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> > necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> > for direct IO writes. It uses it's internal rwlock for this (see
> > xfs_file_dio_aio_write()). So I think this is just wrong.
>
> The problem is that the use of ->direct_IO by the swap code is a gross
> layering violation. ->direct_IO is a callback for the filesystem, and
> the swap code need to call ->read_iter instead of ->readpage and
> ->write_tier instead of ->direct_IO, and leave the locking to the
> filesystem.
>
Ok, I got the swap code working with ->read_iter/->write_iter without
too much trouble. I wanted to double check before I submit if there's
any gotchas involved with adding the O_DIRECT flag to a file pointer
after it has been opened -- swapon opens the swapfile before we know if
we're using the SWP_FILE infrastructure, and we need to add O_DIRECT so
->{read,write}_iter use direct I/O, but we can't add O_DIRECT to the
original open without excluding filesystems that support the old bmap
path but not direct I/O.

--
Omar

2014-12-16 08:35:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Mon, Dec 15, 2014 at 02:11:00PM -0800, Omar Sandoval wrote:
> Ok, I got the swap code working with ->read_iter/->write_iter without
> too much trouble. I wanted to double check before I submit if there's
> any gotchas involved with adding the O_DIRECT flag to a file pointer
> after it has been opened -- swapon opens the swapfile before we know if
> we're using the SWP_FILE infrastructure, and we need to add O_DIRECT so
> ->{read,write}_iter use direct I/O, but we can't add O_DIRECT to the
> original open without excluding filesystems that support the old bmap
> path but not direct I/O.

In general just adding O_DIRECT is a problem. However given that the
swap file is locked against any other access while in use it seems ok
in this particular case. Just make sure to clear it on swapoff, and
write a detailed comment explaining the situation.

2014-12-16 08:56:30

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Tue, Dec 16, 2014 at 12:35:43AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 15, 2014 at 02:11:00PM -0800, Omar Sandoval wrote:
> > Ok, I got the swap code working with ->read_iter/->write_iter without
> > too much trouble. I wanted to double check before I submit if there's
> > any gotchas involved with adding the O_DIRECT flag to a file pointer
> > after it has been opened -- swapon opens the swapfile before we know if
> > we're using the SWP_FILE infrastructure, and we need to add O_DIRECT so
> > ->{read,write}_iter use direct I/O, but we can't add O_DIRECT to the
> > original open without excluding filesystems that support the old bmap
> > path but not direct I/O.
>
> In general just adding O_DIRECT is a problem. However given that the
> swap file is locked against any other access while in use it seems ok
> in this particular case. Just make sure to clear it on swapoff, and
> write a detailed comment explaining the situation.

I'll admit that I'm a bit confused. I want to do this:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e..5145c09 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1728,6 +1728,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
}

if (mapping->a_ops->swap_activate) {
+ if (!mapping->a_ops->direct_IO)
+ return -EINVAL;
+ swap_file->f_flags |= O_DIRECT;
ret = mapping->a_ops->swap_activate(sis, swap_file, span);
if (!ret) {
sis->flags |= SWP_FILE;

This seems to be more or less equivalent to doing a fcntl(F_SETFL) to
add the O_DIRECT flag to swap_file (which is a struct file *). Swapoff
calls filp_close on swap_file, so I don't see why it's necessary to
clear the flag.

--
Omar

2014-12-17 08:06:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Tue, Dec 16, 2014 at 12:56:24AM -0800, Omar Sandoval wrote:
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1728,6 +1728,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
> }
>
> if (mapping->a_ops->swap_activate) {
> + if (!mapping->a_ops->direct_IO)
> + return -EINVAL;
> + swap_file->f_flags |= O_DIRECT;
> ret = mapping->a_ops->swap_activate(sis, swap_file, span);
> if (!ret) {
> sis->flags |= SWP_FILE;

This needs to hold swap_file->f_lock, but otherwise looks good.

> This seems to be more or less equivalent to doing a fcntl(F_SETFL) to
> add the O_DIRECT flag to swap_file (which is a struct file *). Swapoff
> calls filp_close on swap_file, so I don't see why it's necessary to
> clear the flag.

filp_lose doesn't nessecarily destroy the file structure, there might be
other reference to it, e.g. from dup() or descriptor passing.

2014-12-17 08:20:31

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Wed, Dec 17, 2014 at 12:06:10AM -0800, Christoph Hellwig wrote:

> > This seems to be more or less equivalent to doing a fcntl(F_SETFL) to
> > add the O_DIRECT flag to swap_file (which is a struct file *). Swapoff
> > calls filp_close on swap_file, so I don't see why it's necessary to
> > clear the flag.
>
> filp_lose doesn't nessecarily destroy the file structure, there might be
> other reference to it, e.g. from dup() or descriptor passing.

Where the hell would those other references come from? We open the damn
thing in sys_swapon(), never put it into descriptor tables, etc. and
the only reason why we use filp_close() instead of fput() is that we
would miss ->flush() otherwise.

Said that, why not simply *open* it with O_DIRECT to start with and be done
with that? It's not as if those guys came preopened by caller - swapon(2)
gets a pathname and does opening itself.

2014-12-17 08:24:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Wed, Dec 17, 2014 at 08:20:21AM +0000, Al Viro wrote:
> Where the hell would those other references come from? We open the damn
> thing in sys_swapon(), never put it into descriptor tables, etc. and
> the only reason why we use filp_close() instead of fput() is that we
> would miss ->flush() otherwise.
>
> Said that, why not simply *open* it with O_DIRECT to start with and be done
> with that? It's not as if those guys came preopened by caller - swapon(2)
> gets a pathname and does opening itself.

Oops, should have dug deeper into the code. For some reason I assumed
the fd is passed in from userspace.

The suggestion from Al is much better, given that we never do normal
I/O on the swapfile, just the bmap + direct bio submission which I hope
could go away in favor of the direct I/O variant in the long run.

2014-12-17 14:58:39

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Wed, Dec 17, 2014 at 12:24:37AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 17, 2014 at 08:20:21AM +0000, Al Viro wrote:
> > Where the hell would those other references come from? We open the damn
> > thing in sys_swapon(), never put it into descriptor tables, etc. and
> > the only reason why we use filp_close() instead of fput() is that we
> > would miss ->flush() otherwise.
> >
> > Said that, why not simply *open* it with O_DIRECT to start with and be done
> > with that? It's not as if those guys came preopened by caller - swapon(2)
> > gets a pathname and does opening itself.
>
> Oops, should have dug deeper into the code. For some reason I assumed
> the fd is passed in from userspace.
>
> The suggestion from Al is much better, given that we never do normal
> I/O on the swapfile, just the bmap + direct bio submission which I hope
> could go away in favor of the direct I/O variant in the long run.

See my previous message. If we use O_DIRECT on the original open, then
filesystems that implement bmap but not direct_IO will no longer work.
These are the ones that I found in my tree:

adfs
befs
bfs
ecryptfs
efs
freevxfs
hpfs
isofs
minix
ntfs
omfs
qnx4
qnx6
sysv
ufs

Several of these are read only, and I can't imagine that anyone is using
a swapfile on any of the rest, but if someone is, this would be a
regression, wouldn't it?

--
Omar

2014-12-17 18:53:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Wed, Dec 17, 2014 at 06:58:32AM -0800, Omar Sandoval wrote:
> See my previous message. If we use O_DIRECT on the original open, then
> filesystems that implement bmap but not direct_IO will no longer work.
> These are the ones that I found in my tree:

In the long run I don't think they are worth keeping. But to keep you
out of that discussion you can just try an open without O_DIRECT if the
open with the flag failed.

2014-12-17 22:03:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Wed, Dec 17, 2014 at 10:52:56AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 17, 2014 at 06:58:32AM -0800, Omar Sandoval wrote:
> > See my previous message. If we use O_DIRECT on the original open, then
> > filesystems that implement bmap but not direct_IO will no longer work.
> > These are the ones that I found in my tree:
>
> In the long run I don't think they are worth keeping. But to keep you
> out of that discussion you can just try an open without O_DIRECT if the
> open with the flag failed.

Umm... That's one possibility, of course (and if swapon(2) is on someone's
hotpath, I really would like to see what the hell they are doing - it has
to be interesting in a sick way).

Said that, there's an interesting problem with O_DIRECT. It's irrelevant
in this case, but it *can* be changed halfway through e.g write(2) and
AFAICS we have at least some suspicious codepaths. Look at
ext4_file_write_iter(), for example. We check O_DIRECT, then grab some
locks, then proceed to look at the results of that check, do some work...
and call __generic_file_write_iter(), which checks O_DIRECT again. If
it has been cleared (or, probably worse, set) it looks like we'll get
an interesting bunch of holes.

Should we just labda-expand that call of __generic_file_write_iter() and
replace its
if (unlikely(file->f_flags & O_DIRECT)) {
with
if (odirect)
to be guaranteed that it'll match the things we'd done before the call?

I'm looking through the callchains right now in search of similar places
right now, will follow up when I'm done...

BTW, speaking of read/write vs. swap - what's the story with e.g. AFS
write() checking IS_SWAPFILE() and failing with -EBUSY? Note that
* it's done before acquiring i_mutex, so it isn't race-free
* it's dubious from the POSIX POV - EBUSY isn't in the error
list for write(2).
* other filesystems generally don't have anything of that sort.
NFS does, but local ones do not...
Besides, do we even allow swapfiles on AFS?

2014-12-19 06:24:12

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Wed, Dec 17, 2014 at 10:03:13PM +0000, Al Viro wrote:
> On Wed, Dec 17, 2014 at 10:52:56AM -0800, Christoph Hellwig wrote:
> > On Wed, Dec 17, 2014 at 06:58:32AM -0800, Omar Sandoval wrote:
> > > See my previous message. If we use O_DIRECT on the original open, then
> > > filesystems that implement bmap but not direct_IO will no longer work.
> > > These are the ones that I found in my tree:
> >
> > In the long run I don't think they are worth keeping. But to keep you
> > out of that discussion you can just try an open without O_DIRECT if the
> > open with the flag failed.
>
> Umm... That's one possibility, of course (and if swapon(2) is on someone's
> hotpath, I really would like to see what the hell they are doing - it has
> to be interesting in a sick way).

If this is the approach you'd prefer, I'll go ahead and do that for v2.
I personally think it looks pretty kludgey, but I'm fine either way:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63f55cc..c1b3073 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2379,7 +2379,16 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
name = NULL;
goto bad_swap;
}
- swap_file = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
+ swap_file = file_open_name(name, O_RDWR | O_LARGEFILE | O_DIRECT, 0);
+ if (IS_ERR(swap_file) && PTR_ERR(swap_file) == -EINVAL)
+ swap_file = file_open_name(name, O_RDWR | O_LARGEFILE, 0);
if (IS_ERR(swap_file)) {
error = PTR_ERR(swap_file);
swap_file = NULL;

> BTW, speaking of read/write vs. swap - what's the story with e.g. AFS
> write() checking IS_SWAPFILE() and failing with -EBUSY? Note that
> * it's done before acquiring i_mutex, so it isn't race-free
> * it's dubious from the POSIX POV - EBUSY isn't in the error
> list for write(2).
> * other filesystems generally don't have anything of that sort.
> NFS does, but local ones do not...
> Besides, do we even allow swapfiles on AFS?

AFS doesn't implement ->bmap or ->swap_activate, so that code is dead,
probably cargo-culted from the NFS code. It seems pretty pointless, not
only because it's inconsistent with the local filesystems like you
mentioned, but also because it's trivial to bypass with O_DIRECT on NFS:

ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
unsigned long written = 0;
ssize_t result;
size_t count = iov_iter_count(from);
loff_t pos = iocb->ki_pos;

result = nfs_key_timeout_notify(file, inode);
if (result)
return result;

if (file->f_flags & O_DIRECT)
return nfs_file_direct_write(iocb, from, pos);

dprintk("NFS: write(%pD2, %zu@%Ld)\n",
file, count, (long long) pos);

result = -EBUSY;
if (IS_SWAPFILE(inode))
goto out_swapfile;

I think it's safe to scrap that code. However, this also led me to find that
NFS doesn't prevent truncates on an active swapfile. I'm submitting a patch for
that now.

--
Omar

2014-12-19 06:28:35

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Thu, Dec 18, 2014 at 10:24:05PM -0800, Omar Sandoval wrote:
> + swap_file = file_open_name(name, O_RDWR | O_LARGEFILE | O_DIRECT, 0);
> + if (IS_ERR(swap_file) && PTR_ERR(swap_file) == -EINVAL)

ITYM if (swap_file == ERR_PTR(-EINVAL)) here...

2014-12-20 06:51:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Mon, Dec 15, 2014 at 08:56:15AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> > On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > > be held, but most direct_IO implementations do.
> > I think you are speaking about direct IO writes only, aren't you? For DIO
> > reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> > necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> > for direct IO writes. It uses it's internal rwlock for this (see
> > xfs_file_dio_aio_write()). So I think this is just wrong.
>
> The problem is that the use of ->direct_IO by the swap code is a gross
> layering violation. ->direct_IO is a callback for the filesystem, and
> the swap code need to call ->read_iter instead of ->readpage and
> ->write_tier instead of ->direct_IO, and leave the locking to the
> filesystem.

The thing is, ->read_iter() and ->write_iter() might decide to fall back to
buffered IO path. XFS is unusual in that respect - there O_DIRECT ends up
with short write in such case. Other filesystems, OTOH...

2014-12-22 07:26:44

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Sat, Dec 20, 2014 at 06:51:33AM +0000, Al Viro wrote:
> On Mon, Dec 15, 2014 at 08:56:15AM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> > > On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > > > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > > > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > > > be held, but most direct_IO implementations do.
> > > I think you are speaking about direct IO writes only, aren't you? For DIO
> > > reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> > > necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> > > for direct IO writes. It uses it's internal rwlock for this (see
> > > xfs_file_dio_aio_write()). So I think this is just wrong.
> >
> > The problem is that the use of ->direct_IO by the swap code is a gross
> > layering violation. ->direct_IO is a callback for the filesystem, and
> > the swap code need to call ->read_iter instead of ->readpage and
> > ->write_tier instead of ->direct_IO, and leave the locking to the
> > filesystem.
>
> The thing is, ->read_iter() and ->write_iter() might decide to fall back to
> buffered IO path. XFS is unusual in that respect - there O_DIRECT ends up
> with short write in such case. Other filesystems, OTOH...

Alright, now what? Using ->direct_IO directly is pretty much a no go
because of the different locking conventions as was pointed out. Maybe
some "no, really, just direct I/O" iocb flag?

--
Omar

2014-12-23 09:37:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/8] swap: lock i_mutex for swap_writepage direct_IO

On Sat, Dec 20, 2014 at 06:51:33AM +0000, Al Viro wrote:
> > The problem is that the use of ->direct_IO by the swap code is a gross
> > layering violation. ->direct_IO is a callback for the filesystem, and
> > the swap code need to call ->read_iter instead of ->readpage and
> > ->write_tier instead of ->direct_IO, and leave the locking to the
> > filesystem.
>
> The thing is, ->read_iter() and ->write_iter() might decide to fall back to
> buffered IO path. XFS is unusual in that respect - there O_DIRECT ends up
> with short write in such case. Other filesystems, OTOH...

We'll just need a ->swap_activate method that makes sure we really do
direct I/O. For all filesystems currently suporting swap just checking
that all blocks are allocated (as the ->bmap path already does) should
be enough.