LinuxLists.cc - [PATCH v2 0/9] pnfs: layout pipelining

2016-05-11 10:21:24

Subject: [PATCH v2 0/9] pnfs: layout pipelining

v2:
- rework of LAYOUTGET retry handling.

This is v2 of the layout pipelining work. I've done some testing with
it, driving read and write I/Os against a file while forcing recalls to
occur. With this set, the client can successfully issue new LAYOUTGET
calls to handle new page I/Os while the still draining I/Os to the old
layout.

The main change from the last set is a rework of how we handle retryable
errors in LAYOUTGET. With the current code, once it has made a decision
to issue a LAYOUTGET, it will drive that RPC until completion. That's
not really ideal though. A usable layout could have shown up in the list
while the RPC was still in flight.

What we really want to do when we get back a retryable error is to
re-search for the layout from scratch. Thus, the last patch lifts the
code to retry the LAYOUTGET into pnfs_update_layout, so it can make an
intelligent decision about whether it's even necessary to reissue the
RPC at all.

Original cover letter from the RFC patchset follows:

--------------------------[snip]----------------------------

At Primary Data, one of the things we're most interested in is data
mobility. IOW, we want to be able to change the layout for an inode
seamlessly, with little interruption to I/O patterns.

The problem we have now is that CB_LAYOUTRECALLs interrupt I/O. When one
comes in, most pNFS servers refuse to hand out new layouts until the
recalled ones have been returned (or the client indicates that it no
longer knows about them). It doesn't have to be this way though. RFC5661
allows for concurrent LAYOUTGET and LAYOUTRETURN calls.

Furthermore, servers are expected to deal with old stateids in
LAYOUTRETURN. From RFC5661, section 18.44.3:

If the client returns the layout in response to a CB_LAYOUTRECALL
where the lor_recalltype field of the clora_recall field was
LAYOUTRECALL4_FILE, the client should use the lor_stateid value from
CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should
use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid
(from a previous LAYRETURN result). This is done to indicate the
point in time (in terms of layout stateid transitions) when the
recall was sent.

The way I'm interpreting this is that we can treat a LAYOUTRETURN with
an old stateid as returning all layouts that matched the given iomode,
at the time that that seqid was current.

With that, we can allow a LAYOUTGET on the same fh to proceed even when
there are still recalled layouts outstanding. This should allow the
client to pivot to a new layout while it's still draining I/Os
that are pinning the ones to be returned.

This patchset is a first draft of the client side piece that allows
this. Basically whenever we get a new layout segment, we'll tag it with
the seqid that was in the LAYOUTGET stateid that grants it.

When a CB_LAYOUTRECALL comes in, we tag the return seqid in the layout
header with the one that was in the request. When we do a LAYOUTRETURN
in response to a CB_LAYOUTRECALL, we craft the seqid such that we're
only returning the layouts that were recalled. Nothing that has been
granted since then will be returned.

I think I've done this in a way that the existing behavior is preserved
in the case where the server enforces the serialization of these
operations, but please do have a look and let me know if you see any
potential problems here. Testing this is still a WIP...

Jeff Layton (9):
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit
set
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
pnfs: fix bad error handling in send_layoutget
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: rework LAYOUTGET retry handling

fs/nfs/callback_proc.c | 3 +-
fs/nfs/flexfilelayout/flexfilelayout.c | 6 +-
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 26 +++-
fs/nfs/nfs42proc.c | 2 +-
fs/nfs/nfs4proc.c | 68 +++------
fs/nfs/pnfs.c | 243 +++++++++++++++++-------------
fs/nfs/pnfs.h | 12 +-
include/linux/nfs_xdr.h | 3 +-
8 files changed, 184 insertions(+), 179 deletions(-)

--
2.5.5

2016-05-11 10:21:25

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 1/9] pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set

Otherwise, we'll end up returning layouts that we've just received if
the client issues a new LAYOUTGET prior to the LAYOUTRETURN.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index 60d690dbc947..08cd5f856321 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -298,6 +298,8 @@ ff_lseg_merge(struct pnfs_layout_segment *new,
{
u64 new_end, old_end;

+ if (test_bit(NFS_LSEG_LAYOUTRETURN, &old->pls_flags))
+ return false;
if (new->pls_range.iomode != old->pls_range.iomode)
return false;
old_end = pnfs_calc_offset_end(old->pls_range.offset,
@@ -318,8 +320,6 @@ ff_lseg_merge(struct pnfs_layout_segment *new,
new_end);
if (test_bit(NFS_LSEG_ROC, &old->pls_flags))
set_bit(NFS_LSEG_ROC, &new->pls_flags);
- if (test_bit(NFS_LSEG_LAYOUTRETURN, &old->pls_flags))
- set_bit(NFS_LSEG_LAYOUTRETURN, &new->pls_flags);
return true;
}

--
2.5.5

2016-05-11 10:21:27

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 3/9] pnfs: keep track of the return sequence number in pnfs_layout_hdr

When we want to selectively do a LAYOUTRETURN, we need to specify a
stateid that represents most recent layout acquisition that is to be
returned.

When we mark a layout stateid to be returned, we update the return
sequence number in the layout header with that value, if it's newer
than the existing one. Then, when we go to do a LAYOUTRETURN on
layout header put, we overwrite the seqid in the stateid with the
saved one, and then zero it out.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 11 ++++++++---
fs/nfs/pnfs.h | 1 +
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 18b6f950e300..39432a3705b4 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -899,6 +899,7 @@ pnfs_prepare_layoutreturn(struct pnfs_layout_hdr *lo)
if (test_and_set_bit(NFS_LAYOUT_RETURN, &lo->plh_flags))
return false;
lo->plh_return_iomode = 0;
+ lo->plh_return_seq = 0;
pnfs_get_layout_hdr(lo);
clear_bit(NFS_LAYOUT_RETURN_REQUESTED, &lo->plh_flags);
return true;
@@ -969,6 +970,7 @@ static void pnfs_layoutreturn_before_put_layout_hdr(struct pnfs_layout_hdr *lo)
bool send;

nfs4_stateid_copy(&stateid, &lo->plh_stateid);
+ stateid.seqid = cpu_to_be32(lo->plh_return_seq);
iomode = lo->plh_return_iomode;
send = pnfs_prepare_layoutreturn(lo);
spin_unlock(&inode->i_lock);
@@ -1747,7 +1749,8 @@ out_forget_reply:
}

static void
-pnfs_set_plh_return_iomode(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode)
+pnfs_set_plh_return_info(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode,
+ u32 seq)
{
if (lo->plh_return_iomode == iomode)
return;
@@ -1755,6 +1758,8 @@ pnfs_set_plh_return_iomode(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode)
iomode = IOMODE_ANY;
lo->plh_return_iomode = iomode;
set_bit(NFS_LAYOUT_RETURN_REQUESTED, &lo->plh_flags);
+ if (!lo->plh_return_seq || pnfs_seqid_is_newer(seq, lo->plh_return_seq))
+ lo->plh_return_seq = seq;
}

/**
@@ -1793,7 +1798,7 @@ pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
continue;
remaining++;
set_bit(NFS_LSEG_LAYOUTRETURN, &lseg->pls_flags);
- pnfs_set_plh_return_iomode(lo, return_range->iomode);
+ pnfs_set_plh_return_info(lo, return_range->iomode, lseg->pls_seq);
}
return remaining;
}
@@ -1811,7 +1816,7 @@ void pnfs_error_mark_layout_for_return(struct inode *inode,
bool return_now = false;

spin_lock(&inode->i_lock);
- pnfs_set_plh_return_iomode(lo, range.iomode);
+ pnfs_set_plh_return_info(lo, range.iomode, lseg->pls_seq);
/*
* mark all matching lsegs so that we are sure to have no live
* segments at hand when sending layoutreturn. See pnfs_put_lseg()
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 361fa5494aa5..3476c9850678 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -195,6 +195,7 @@ struct pnfs_layout_hdr {
unsigned long plh_flags;
nfs4_stateid plh_stateid;
u32 plh_barrier; /* ignore lower seqids */
+ u32 plh_return_seq;
enum pnfs_iomode plh_return_iomode;
loff_t plh_lwb; /* last write byte for layoutcommit */
struct rpc_cred *plh_lc_cred; /* layoutcommit cred */
--
2.5.5

2016-05-11 10:21:26

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 2/9] pnfs: record sequence in pnfs_layout_segment when it's created

In later patches, we're going to teach the client to be more selective
about how it returns layouts. This means keeping a record of what the
stateid's seqid was at the time that the server handed out a layout
segment.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 1 +
fs/nfs/pnfs.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 5b404d926e08..18b6f950e300 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1697,6 +1697,7 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)

init_lseg(lo, lseg);
lseg->pls_range = res->range;
+ lseg->pls_seq = be32_to_cpu(res->stateid.seqid);

spin_lock(&ino->i_lock);
if (pnfs_layoutgets_blocked(lo)) {
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 7222d3a35439..361fa5494aa5 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -64,6 +64,7 @@ struct pnfs_layout_segment {
struct list_head pls_lc_list;
struct pnfs_layout_range pls_range;
atomic_t pls_refcount;
+ u32 pls_seq;
unsigned long pls_flags;
struct pnfs_layout_hdr *pls_layout;
struct work_struct pls_work;
--
2.5.5

2016-05-11 10:21:30

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 6/9] flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
index 5eff05e9eb4d..61094d59ee3a 100644
--- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
+++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
@@ -342,7 +342,23 @@ out:
return fh;
}

-/* Upon return, either ds is connected, or ds is NULL */
+/**
+ * nfs4_ff_layout_prepare_ds - prepare a DS connection for an RPC call
+ * @lseg: the layout segment we're operating on
+ * @ds_idx: index of the DS to use
+ * @fail_return: return layout on connect failure?
+ *
+ * Try to prepare a DS connection to accept an RPC call. This involves
+ * selecting a mirror to use and connecting the client to it if it's not
+ * already connected.
+ *
+ * Since we only need single functioning mirror to satisfy a read, we don't
+ * want to return the layout if there is one. For writes though, any down
+ * mirror should result in a LAYOUTRETURN. @fail_return is how we distinguish
+ * between the two cases.
+ *
+ * Returns a pointer to a connected DS object on success or NULL on failure.
+ */
struct nfs4_pnfs_ds *
nfs4_ff_layout_prepare_ds(struct pnfs_layout_segment *lseg, u32 ds_idx,
bool fail_return)
--
2.5.5

2016-05-11 10:21:31

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 8/9] pnfs: lift retry logic from send_layoutget to pnfs_update_layout

If we get back something like NFS4ERR_OLD_STATEID, that will be
translated into -EAGAIN, and the do/while loop in send_layoutget
will drive the call again.

This is not quite what we want, I think. An error like that is a
sign that something has changed. That something could have been a
concurrent LAYOUTGET that would give us a usable lseg.

Lift the retry logic into pnfs_update_layout instead. That allows
us to redo the layout search, and may spare us from having to issue
an RPC.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 67 ++++++++++++++++++++++++++++++-----------------------------
1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 5f6ed295acb5..ed3ab3e81f38 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -839,7 +839,6 @@ send_layoutget(struct pnfs_layout_hdr *lo,
struct inode *ino = lo->plh_inode;
struct nfs_server *server = NFS_SERVER(ino);
struct nfs4_layoutget *lgp;
- struct pnfs_layout_segment *lseg;
loff_t i_size;

dprintk("--> %s\n", __func__);
@@ -849,42 +848,30 @@ send_layoutget(struct pnfs_layout_hdr *lo,
* store in lseg. If we race with a concurrent seqid morphing
* op, then re-send the LAYOUTGET.
*/
- do {
- lgp = kzalloc(sizeof(*lgp), gfp_flags);
- if (lgp == NULL)
- return NULL;
-
- i_size = i_size_read(ino);
-
- lgp->args.minlength = PAGE_SIZE;
- if (lgp->args.minlength > range->length)
- lgp->args.minlength = range->length;
- if (range->iomode == IOMODE_READ) {
- if (range->offset >= i_size)
- lgp->args.minlength = 0;
- else if (i_size - range->offset < lgp->args.minlength)
- lgp->args.minlength = i_size - range->offset;
- }
- lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
- pnfs_copy_range(&lgp->args.range, range);
- lgp->args.type = server->pnfs_curr_ld->id;
- lgp->args.inode = ino;
- lgp->args.ctx = get_nfs_open_context(ctx);
- lgp->gfp_flags = gfp_flags;
- lgp->cred = lo->plh_lc_cred;
+ lgp = kzalloc(sizeof(*lgp), gfp_flags);
+ if (lgp == NULL)
+ return NULL;

- lseg = nfs4_proc_layoutget(lgp, gfp_flags);
- } while (lseg == ERR_PTR(-EAGAIN));
+ i_size = i_size_read(ino);

- if (IS_ERR(lseg)) {
- if (!nfs_error_is_fatal(PTR_ERR(lseg)))
- lseg = NULL;
- } else {
- pnfs_layout_clear_fail_bit(lo,
- pnfs_iomode_to_fail_bit(range->iomode));
+ lgp->args.minlength = PAGE_SIZE;
+ if (lgp->args.minlength > range->length)
+ lgp->args.minlength = range->length;
+ if (range->iomode == IOMODE_READ) {
+ if (range->offset >= i_size)
+ lgp->args.minlength = 0;
+ else if (i_size - range->offset < lgp->args.minlength)
+ lgp->args.minlength = i_size - range->offset;
}
+ lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
+ pnfs_copy_range(&lgp->args.range, range);
+ lgp->args.type = server->pnfs_curr_ld->id;
+ lgp->args.inode = ino;
+ lgp->args.ctx = get_nfs_open_context(ctx);
+ lgp->gfp_flags = gfp_flags;
+ lgp->cred = lo->plh_lc_cred;

- return lseg;
+ return nfs4_proc_layoutget(lgp, gfp_flags);
}

static void pnfs_clear_layoutcommit(struct inode *inode,
@@ -1646,6 +1633,20 @@ lookup_again:
arg.length = PAGE_ALIGN(arg.length);

lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
+ if (IS_ERR(lseg)) {
+ if (lseg == ERR_PTR(-EAGAIN)) {
+ if (first)
+ pnfs_clear_first_layoutget(lo);
+ pnfs_put_layout_hdr(lo);
+ goto lookup_again;
+ }
+
+ if (!nfs_error_is_fatal(PTR_ERR(lseg)))
+ lseg = NULL;
+ } else {
+ pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
+ }
+
atomic_dec(&lo->plh_outstanding);
trace_pnfs_update_layout(ino, pos, count, iomode, lo,
PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
--
2.5.5

2016-05-11 10:21:29

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 5/9] flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED

Setting just the NFS_LAYOUT_RETURN_REQUESTED flag doesn't do anything,
unless there are lsegs that are also being marked for return. At the
point where that happens this flag is also set, so these set_bit calls
don't do anything useful.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 2 --
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 8 +-------
2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index 08cd5f856321..7955cb7862d4 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -1249,8 +1249,6 @@ static int ff_layout_read_done_cb(struct rpc_task *task,
hdr->pgio_mirror_idx + 1,
&hdr->pgio_mirror_idx))
goto out_eagain;
- set_bit(NFS_LAYOUT_RETURN_REQUESTED,
- &hdr->lseg->pls_layout->plh_flags);
pnfs_read_resend_pnfs(hdr);
return task->tk_status;
case -NFS4ERR_RESET_TO_MDS:
diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
index 56296f3df19c..5eff05e9eb4d 100644
--- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
+++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
@@ -393,13 +393,7 @@ nfs4_ff_layout_prepare_ds(struct pnfs_layout_segment *lseg, u32 ds_idx,
mirror, lseg->pls_range.offset,
lseg->pls_range.length, NFS4ERR_NXIO,
OP_ILLEGAL, GFP_NOIO);
- if (!fail_return) {
- if (ff_layout_has_available_ds(lseg))
- set_bit(NFS_LAYOUT_RETURN_REQUESTED,
- &lseg->pls_layout->plh_flags);
- else
- pnfs_error_mark_layout_for_return(ino, lseg);
- } else
+ if (fail_return || !ff_layout_has_available_ds(lseg))
pnfs_error_mark_layout_for_return(ino, lseg);
ds = NULL;
}
--
2.5.5

2016-05-11 10:21:28

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 4/9] pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args

LAYOUTRETURN is "special" in that servers and clients are expected to
work with old stateids. When the client sends a LAYOUTRETURN with an old
stateid in it then the server is expected to only tear down layout
segments that were present when that seqid was current. Ensure that the
client handles its accounting accordingly.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/callback_proc.c | 3 ++-
fs/nfs/nfs42proc.c | 2 +-
fs/nfs/nfs4proc.c | 5 ++--
fs/nfs/pnfs.c | 64 +++++++++++++++++++++++++++++++++-----------------
fs/nfs/pnfs.h | 6 +++--
5 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 618ced381a14..755838df9996 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -217,7 +217,8 @@ static u32 initiate_file_draining(struct nfs_client *clp,
}

if (pnfs_mark_matching_lsegs_return(lo, &free_me_list,
- &args->cbl_range)) {
+ &args->cbl_range,
+ be32_to_cpu(args->cbl_stateid.seqid))) {
rv = NFS4_OK;
goto unlock;
}
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index dff83460e5a6..198bcc3e103d 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -232,7 +232,7 @@ nfs42_layoutstat_done(struct rpc_task *task, void *calldata)
* with the current stateid.
*/
set_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags);
- pnfs_mark_matching_lsegs_invalid(lo, &head, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &head, NULL, 0);
spin_unlock(&inode->i_lock);
pnfs_free_lseg_list(&head);
} else
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index bc2676c95e1b..c0d75be8cb69 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7930,7 +7930,7 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
* with the current stateid.
*/
set_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags);
- pnfs_mark_matching_lsegs_invalid(lo, &head, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &head, NULL, 0);
spin_unlock(&inode->i_lock);
pnfs_free_lseg_list(&head);
} else
@@ -8122,7 +8122,8 @@ static void nfs4_layoutreturn_release(void *calldata)

dprintk("--> %s\n", __func__);
spin_lock(&lo->plh_inode->i_lock);
- pnfs_mark_matching_lsegs_invalid(lo, &freeme, &lrp->args.range);
+ pnfs_mark_matching_lsegs_invalid(lo, &freeme, &lrp->args.range,
+ be32_to_cpu(lrp->args.stateid.seqid));
pnfs_mark_layout_returned_if_empty(lo);
if (lrp->res.lrs_present)
pnfs_set_layout_stateid(lo, &lrp->res.stateid, true);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 39432a3705b4..e6cad5ee5d29 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -270,7 +270,7 @@ pnfs_mark_layout_stateid_invalid(struct pnfs_layout_hdr *lo,
};

set_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags);
- return pnfs_mark_matching_lsegs_invalid(lo, lseg_list, &range);
+ return pnfs_mark_matching_lsegs_invalid(lo, lseg_list, &range, 0);
}

static int
@@ -308,7 +308,7 @@ pnfs_layout_io_set_failed(struct pnfs_layout_hdr *lo, u32 iomode)

spin_lock(&inode->i_lock);
pnfs_layout_set_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
- pnfs_mark_matching_lsegs_invalid(lo, &head, &range);
+ pnfs_mark_matching_lsegs_invalid(lo, &head, &range, 0);
spin_unlock(&inode->i_lock);
pnfs_free_lseg_list(&head);
dprintk("%s Setting layout IOMODE_%s fail bit\n", __func__,
@@ -522,13 +522,35 @@ static int mark_lseg_invalid(struct pnfs_layout_segment *lseg,
return rv;
}

-/* Returns count of number of matching invalid lsegs remaining in list
- * after call.
+/*
+ * Compare 2 layout stateid sequence ids, to see which is newer,
+ * taking into account wraparound issues.
+ */
+static bool pnfs_seqid_is_newer(u32 s1, u32 s2)
+{
+ return (s32)(s1 - s2) > 0;
+}
+
+/**
+ * pnfs_mark_matching_lsegs_invalid - tear down lsegs or mark them for later
+ * @lo: layout header containing the lsegs
+ * @tmp_list: list head where doomed lsegs should go
+ * @recall_range: optional recall range argument to match (may be NULL)
+ * @seq: only invalidate lsegs obtained prior to this sequence (may be 0)
+ *
+ * Walk the list of lsegs in the layout header, and tear down any that should
+ * be destroyed. If "recall_range" is specified then the segment must match
+ * that range. If "seq" is non-zero, then only match segments that were handed
+ * out at or before that sequence.
+ *
+ * Returns number of matching invalid lsegs remaining in list after scanning
+ * it and purging them.
*/
int
pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *recall_range)
+ const struct pnfs_layout_range *recall_range,
+ u32 seq)
{
struct pnfs_layout_segment *lseg, *next;
int remaining = 0;
@@ -540,10 +562,12 @@ pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
list_for_each_entry_safe(lseg, next, &lo->plh_segs, pls_list)
if (!recall_range ||
should_free_lseg(&lseg->pls_range, recall_range)) {
- dprintk("%s: freeing lseg %p iomode %d "
+ if (seq && pnfs_seqid_is_newer(lseg->pls_seq, seq))
+ continue;
+ dprintk("%s: freeing lseg %p iomode %d seq %u"
"offset %llu length %llu\n", __func__,
- lseg, lseg->pls_range.iomode, lseg->pls_range.offset,
- lseg->pls_range.length);
+ lseg, lseg->pls_range.iomode, lseg->pls_seq,
+ lseg->pls_range.offset, lseg->pls_range.length);
if (!mark_lseg_invalid(lseg, tmp_list))
remaining++;
}
@@ -730,15 +754,6 @@ pnfs_destroy_all_layouts(struct nfs_client *clp)
pnfs_destroy_layouts_byclid(clp, false);
}

-/*
- * Compare 2 layout stateid sequence ids, to see which is newer,
- * taking into account wraparound issues.
- */
-static bool pnfs_seqid_is_newer(u32 s1, u32 s2)
-{
- return (s32)(s1 - s2) > 0;
-}
-
/* update lo->plh_stateid with new if is more recent */
void
pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo, const nfs4_stateid *new,
@@ -1014,7 +1029,7 @@ _pnfs_return_layout(struct inode *ino)
pnfs_get_layout_hdr(lo);
empty = list_empty(&lo->plh_segs);
pnfs_clear_layoutcommit(ino, &tmp_list);
- pnfs_mark_matching_lsegs_invalid(lo, &tmp_list, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &tmp_list, NULL, 0);

if (NFS_SERVER(ino)->pnfs_curr_ld->return_range) {
struct pnfs_layout_range range = {
@@ -1721,7 +1736,7 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
* inode invalid, and don't bother validating the stateid
* sequence number.
*/
- pnfs_mark_matching_lsegs_invalid(lo, &free_me, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &free_me, NULL, 0);

nfs4_stateid_copy(&lo->plh_stateid, &res->stateid);
lo->plh_barrier = be32_to_cpu(res->stateid.seqid);
@@ -1775,7 +1790,8 @@ pnfs_set_plh_return_info(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode,
int
pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *return_range)
+ const struct pnfs_layout_range *return_range,
+ u32 seq)
{
struct pnfs_layout_segment *lseg, *next;
int remaining = 0;
@@ -1798,8 +1814,11 @@ pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
continue;
remaining++;
set_bit(NFS_LSEG_LAYOUTRETURN, &lseg->pls_flags);
- pnfs_set_plh_return_info(lo, return_range->iomode, lseg->pls_seq);
}
+
+ if (remaining)
+ pnfs_set_plh_return_info(lo, return_range->iomode, seq);
+
return remaining;
}

@@ -1822,7 +1841,8 @@ void pnfs_error_mark_layout_for_return(struct inode *inode,
* segments at hand when sending layoutreturn. See pnfs_put_lseg()
* for how it works.
*/
- if (!pnfs_mark_matching_lsegs_return(lo, &free_me, &range)) {
+ if (!pnfs_mark_matching_lsegs_return(lo, &free_me,
+ &range, lseg->pls_seq)) {
nfs4_stateid stateid;
enum pnfs_iomode iomode = lo->plh_return_iomode;

diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 3476c9850678..971068b58647 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -266,10 +266,12 @@ int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
struct nfs4_state *open_state);
int pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *recall_range);
+ const struct pnfs_layout_range *recall_range,
+ u32 seq);
int pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *recall_range);
+ const struct pnfs_layout_range *recall_range,
+ u32 seq);
bool pnfs_roc(struct inode *ino);
void pnfs_roc_release(struct inode *ino);
void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
--
2.5.5

2016-05-11 10:21:32

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 9/9] pnfs: rework LAYOUTGET retry handling

There are several problems in the way a stateid is selected for a
LAYOUTGET operation:

First, the list_empty test in pnfs_update_layout is not reliable. We may
have lsegs on the list that are in the process of being returned.

Also, we pick a stateid to use in the rpc_prepare op, but that makes
it difficult to serialize LAYOUTGETs that use the open stateid. That
serialization is done in pnfs_update_layout, which occurs well before
rpc_prepare.

Between those two events, the i_lock is dropped and reacquired.
pnfs_update_layout can find that the list has lsegs in it and not do any
serialization, but then later pnfs_choose_layoutget_stateid ends up
choosing the open stateid.

Fix this by selecting the stateid to use in the LAYOUTGET earlier,
when we're searching for a usable layout segment. This way we can
do it all while holding the i_lock the first time, and ensure that
we serialize any LAYOUTGET call that uses a non-layout stateid.

This also means a rework of how retryable errors are handled as we must
update the stateid if we need to retransmit.

Currently, when we need to retry a LAYOUTGET because of an error, we
drive that retry via the rpc state machine. That's not really ideal
though. Most of those errors boil down to the fact that the layout state
has changed in some fashion. Thus, what we really want to do is to
re-search for a layout and take it from there.

Fix nfs4_layoutget_done not to requeue the rpc_task on a retryable
error, but rather to just set the tk_status to -EAGAIN. This tells the
upper layers to retry the whole thing from scratch, searching for an
lseg, and then re-driving the LAYOUTGET.

In order to handle errors like NFS4ERR_DELAY properly, we must also pass
a pointer to the timeout field, that is now moved to the stack in
pnfs_update_layout.

The complicating factor is -NFS4ERR_RECALLCONFLICT, as that involves a
timeout after which we give up and return NULL back to the caller. So,
take special care not to clobber that error with -EAGAIN, so the upper
layers can distinguish between the two.

Finally, when send_layoutget gets back an -EAGAIN, and the timeout is
set, have it just sleep that long via schedule_timeout.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/nfs4proc.c | 63 +++++++-----------------------
fs/nfs/pnfs.c | 102 ++++++++++++++++++++++++------------------------
fs/nfs/pnfs.h | 4 --
include/linux/nfs_xdr.h | 3 +-
4 files changed, 67 insertions(+), 105 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index c0d75be8cb69..4359217ba813 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -416,6 +416,7 @@ static int nfs4_do_handle_exception(struct nfs_server *server,
case -NFS4ERR_DELAY:
nfs_inc_server_stats(server, NFSIOS_DELAY);
case -NFS4ERR_GRACE:
+ case -NFS4ERR_RECALLCONFLICT:
exception->delay = 1;
return 0;

@@ -7824,23 +7825,11 @@ nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
struct nfs4_layoutget *lgp = calldata;
struct nfs_server *server = NFS_SERVER(lgp->args.inode);
struct nfs4_session *session = nfs4_get_session(server);
- int ret;

dprintk("--> %s\n", __func__);
- /* Note the is a race here, where a CB_LAYOUTRECALL can come in
- * right now covering the LAYOUTGET we are about to send.
- * However, that is not so catastrophic, and there seems
- * to be no way to prevent it completely.
- */
- if (nfs41_setup_sequence(session, &lgp->args.seq_args,
- &lgp->res.seq_res, task))
- return;
- ret = pnfs_choose_layoutget_stateid(&lgp->args.stateid,
- NFS_I(lgp->args.inode)->layout,
- &lgp->args.range,
- lgp->args.ctx->state);
- if (ret < 0)
- rpc_exit(task, ret);
+ nfs41_setup_sequence(session, &lgp->args.seq_args,
+ &lgp->res.seq_res, task);
+ dprintk("<-- %s\n", __func__);
}

static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
@@ -7850,7 +7839,6 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
struct nfs_server *server = NFS_SERVER(inode);
struct pnfs_layout_hdr *lo;
struct nfs4_state *state = NULL;
- unsigned long timeo, now, giveup;

dprintk("--> %s tk_status => %d\n", __func__, -task->tk_status);

@@ -7883,35 +7871,9 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
case -NFS4ERR_LAYOUTTRYLATER:
if (lgp->args.minlength == 0)
goto out_overflow;
- /*
- * NFS4ERR_RECALLCONFLICT is when conflict with self (must recall
- * existing layout before getting a new one).
- */
- case -NFS4ERR_RECALLCONFLICT:
- timeo = rpc_get_timeout(task->tk_client);
- giveup = lgp->args.timestamp + timeo;
- now = jiffies;
- if (time_after(giveup, now)) {
- unsigned long delay;
-
- /* Delay for:
- * - Not less then NFS4_POLL_RETRY_MIN.
- * - One last time a jiffie before we give up
- * - exponential backoff (time_now minus start_attempt)
- */
- delay = max_t(unsigned long, NFS4_POLL_RETRY_MIN,
- min((giveup - now - 1),
- now - lgp->args.timestamp));
-
- dprintk("%s: NFS4ERR_RECALLCONFLICT waiting %lu\n",
- __func__, delay);
- rpc_delay(task, delay);
- /* Do not call nfs4_async_handle_error() */
- goto out_restart;
- }
- break;
case -NFS4ERR_EXPIRED:
case -NFS4ERR_BAD_STATEID:
+ *lgp->timeout = 0;
spin_lock(&inode->i_lock);
if (nfs4_stateid_match(&lgp->args.stateid,
&lgp->args.ctx->state->stateid)) {
@@ -7937,15 +7899,21 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
spin_unlock(&inode->i_lock);
goto out_restart;
}
- if (nfs4_async_handle_error(task, server, state, &lgp->timeout) == -EAGAIN)
+
+ /*
+ * Don't clobber -NFS4ERR_RECALLCONFLICT, as we want the upper layers
+ * to be able to distinguish between that and -EAGAIN so that we can
+ * properly give up in the RECALLCONFLICT case.
+ */
+ if (nfs4_async_handle_error(task, server, state, lgp->timeout) == -EAGAIN &&
+ task->tk_status != -NFS4ERR_RECALLCONFLICT)
goto out_restart;
out:
dprintk("<-- %s\n", __func__);
return;
out_restart:
- task->tk_status = 0;
- rpc_restart_call_prepare(task);
- return;
+ task->tk_status = -EAGAIN;
+ goto out;
out_overflow:
task->tk_status = -EOVERFLOW;
goto out;
@@ -8050,7 +8018,6 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
return ERR_PTR(-ENOMEM);
}
lgp->args.layout.pglen = max_pages * PAGE_SIZE;
- lgp->args.timestamp = jiffies;

lgp->res.layoutp = &lgp->args.layout;
lgp->res.seq_res.sr_slot = NULL;
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index ed3ab3e81f38..80259bfefec4 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -796,45 +796,12 @@ pnfs_layoutgets_blocked(const struct pnfs_layout_hdr *lo)
test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags);
}

-int
-pnfs_choose_layoutget_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo,
- const struct pnfs_layout_range *range,
- struct nfs4_state *open_state)
-{
- int status = 0;
-
- dprintk("--> %s\n", __func__);
- spin_lock(&lo->plh_inode->i_lock);
- if (pnfs_layoutgets_blocked(lo)) {
- status = -EAGAIN;
- } else if (!nfs4_valid_open_stateid(open_state)) {
- status = -EBADF;
- } else if (list_empty(&lo->plh_segs) ||
- test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
- int seq;
-
- do {
- seq = read_seqbegin(&open_state->seqlock);
- nfs4_stateid_copy(dst, &open_state->stateid);
- } while (read_seqretry(&open_state->seqlock, seq));
- } else
- nfs4_stateid_copy(dst, &lo->plh_stateid);
- spin_unlock(&lo->plh_inode->i_lock);
- dprintk("<-- %s\n", __func__);
- return status;
-}
-
-/*
-* Get layout from server.
-* for now, assume that whole file layouts are requested.
-* arg->offset: 0
-* arg->length: all ones
-*/
static struct pnfs_layout_segment *
send_layoutget(struct pnfs_layout_hdr *lo,
struct nfs_open_context *ctx,
+ nfs4_stateid *stateid,
const struct pnfs_layout_range *range,
- gfp_t gfp_flags)
+ long *timeout, gfp_t gfp_flags)
{
struct inode *ino = lo->plh_inode;
struct nfs_server *server = NFS_SERVER(ino);
@@ -868,8 +835,10 @@ send_layoutget(struct pnfs_layout_hdr *lo,
lgp->args.type = server->pnfs_curr_ld->id;
lgp->args.inode = ino;
lgp->args.ctx = get_nfs_open_context(ctx);
+ nfs4_stateid_copy(&lgp->args.stateid, stateid);
lgp->gfp_flags = gfp_flags;
lgp->cred = lo->plh_lc_cred;
+ lgp->timeout = timeout;

return nfs4_proc_layoutget(lgp, gfp_flags);
}
@@ -1511,11 +1480,14 @@ pnfs_update_layout(struct inode *ino,
.offset = pos,
.length = count,
};
- unsigned pg_offset;
+ unsigned pg_offset, seq;
struct nfs_server *server = NFS_SERVER(ino);
struct nfs_client *clp = server->nfs_client;
struct pnfs_layout_hdr *lo;
struct pnfs_layout_segment *lseg = NULL;
+ nfs4_stateid stateid;
+ long timeout = 0;
+ unsigned long giveup = jiffies + rpc_get_timeout(server->client);
bool first;

if (!pnfs_enabled_sb(NFS_SERVER(ino))) {
@@ -1562,9 +1534,28 @@ lookup_again:
goto out_unlock;
}

- first = list_empty(&lo->plh_segs);
- if (first) {
- /* The first layoutget for the file. Need to serialize per
+ lseg = pnfs_find_lseg(lo, &arg);
+ if (lseg) {
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo,
+ PNFS_UPDATE_LAYOUT_FOUND_CACHED);
+ goto out_unlock;
+ }
+
+ if (!nfs4_valid_open_stateid(ctx->state)) {
+ lseg = ERR_PTR(-EBADF);
+ goto out_unlock;
+ }
+
+ /*
+ * Choose a stateid for the LAYOUTGET. If we don't have a layout
+ * stateid, or it has been invalidated, then we must use the open
+ * stateid.
+ */
+ if (lo->plh_stateid.seqid == 0 ||
+ test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
+
+ /*
+ * The first layoutget for the file. Need to serialize per
* RFC 5661 Errata 3208.
*/
if (test_and_set_bit(NFS_LAYOUT_FIRST_LAYOUTGET,
@@ -1575,16 +1566,14 @@ lookup_again:
pnfs_put_layout_hdr(lo);
goto lookup_again;
}
+
+ first = true;
+ do {
+ seq = read_seqbegin(&ctx->state->seqlock);
+ nfs4_stateid_copy(&stateid, &ctx->state->stateid);
+ } while (read_seqretry(&ctx->state->seqlock, seq));
} else {
- /* Check to see if the layout for the given range
- * already exists
- */
- lseg = pnfs_find_lseg(lo, &arg);
- if (lseg) {
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
- PNFS_UPDATE_LAYOUT_FOUND_CACHED);
- goto out_unlock;
- }
+ nfs4_stateid_copy(&stateid, &lo->plh_stateid);
}

/*
@@ -1632,13 +1621,24 @@ lookup_again:
if (arg.length != NFS4_MAX_UINT64)
arg.length = PAGE_ALIGN(arg.length);

- lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
+ lseg = send_layoutget(lo, ctx, &stateid, &arg, &timeout, gfp_flags);
if (IS_ERR(lseg)) {
- if (lseg == ERR_PTR(-EAGAIN)) {
+ if (lseg == ERR_PTR(-EAGAIN) ||
+ lseg == ERR_PTR(-NFS4ERR_RECALLCONFLICT)) {
if (first)
pnfs_clear_first_layoutget(lo);
pnfs_put_layout_hdr(lo);
- goto lookup_again;
+ if (timeout) {
+ if (lseg == ERR_PTR(-NFS4ERR_RECALLCONFLICT) &&
+ time_after(jiffies + timeout, giveup)) {
+ lseg = NULL;
+ } else {
+ schedule_timeout(timeout);
+ goto lookup_again;
+ }
+ } else {
+ goto lookup_again;
+ }
}

if (!nfs_error_is_fatal(PTR_ERR(lseg)))
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 971068b58647..02d27cb90dd2 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -260,10 +260,6 @@ void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo);
void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
const nfs4_stateid *new,
bool update_barrier);
-int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
- struct pnfs_layout_hdr *lo,
- const struct pnfs_layout_range *range,
- struct nfs4_state *open_state);
int pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
const struct pnfs_layout_range *recall_range,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index cb9982d8f38f..bdcf0262efc3 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -233,7 +233,6 @@ struct nfs4_layoutget_args {
struct inode *inode;
struct nfs_open_context *ctx;
nfs4_stateid stateid;
- unsigned long timestamp;
struct nfs4_layoutdriver_data layout;
};

@@ -251,7 +250,7 @@ struct nfs4_layoutget {
struct nfs4_layoutget_res res;
struct rpc_cred *cred;
gfp_t gfp_flags;
- long timeout;
+ long *timeout;
};

struct nfs4_getdeviceinfo_args {
--
2.5.5

2016-05-11 10:21:35

by Jeff Layton

[permalink] [raw]

Subject: [PATCH v2 7/9] pnfs: fix bad error handling in send_layoutget

Currently, the code will clear the fail bit if we get back a fatal
error. I don't think that's correct -- we only want to clear that
bit if the layoutget succeeds.

Fixes: 0bcbf039f6 (nfs: handle request add failure properly)
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e6cad5ee5d29..5f6ed295acb5 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -876,11 +876,13 @@ send_layoutget(struct pnfs_layout_hdr *lo,
lseg = nfs4_proc_layoutget(lgp, gfp_flags);
} while (lseg == ERR_PTR(-EAGAIN));

- if (IS_ERR(lseg) && !nfs_error_is_fatal(PTR_ERR(lseg)))
- lseg = NULL;
- else
+ if (IS_ERR(lseg)) {
+ if (!nfs_error_is_fatal(PTR_ERR(lseg)))
+ lseg = NULL;
+ } else {
pnfs_layout_clear_fail_bit(lo,
pnfs_iomode_to_fail_bit(range->iomode));
+ }

return lseg;
}
--
2.5.5

2016-05-11 15:29:47

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH v2 8/9] pnfs: lift retry logic from send_layoutget to pnfs_update_layout

On Wed, 2016-05-11 at 06:21 -0400, Jeff Layton wrote:
> If we get back something like NFS4ERR_OLD_STATEID, that will be
> translated into -EAGAIN, and the do/while loop in send_layoutget
> will drive the call again.
>
> This is not quite what we want, I think. An error like that is a
> sign that something has changed. That something could have been a
> concurrent LAYOUTGET that would give us a usable lseg.
>
> Lift the retry logic into pnfs_update_layout instead. That allows
> us to redo the layout search, and may spare us from having to issue
> an RPC.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/pnfs.c | 67 ++++++++++++++++++++++++++++++-----------------------------
> 1 file changed, 34 insertions(+), 33 deletions(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 5f6ed295acb5..ed3ab3e81f38 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -839,7 +839,6 @@ send_layoutget(struct pnfs_layout_hdr *lo,
> struct inode *ino = lo->plh_inode;
> struct nfs_server *server = NFS_SERVER(ino);
> struct nfs4_layoutget *lgp;
> - struct pnfs_layout_segment *lseg;
> loff_t i_size;
>
> dprintk("--> %s\n", __func__);
> @@ -849,42 +848,30 @@ send_layoutget(struct pnfs_layout_hdr *lo,
> * store in lseg. If we race with a concurrent seqid morphing
> * op, then re-send the LAYOUTGET.
> */
> - do {
> - lgp = kzalloc(sizeof(*lgp), gfp_flags);
> - if (lgp == NULL)
> - return NULL;
> -

Just spotted this bug. The above should return ERR_PTR(-ENOMEM). Fixed
in my nfs-4.7 branch.

> - i_size = i_size_read(ino);
> -
> - lgp->args.minlength = PAGE_SIZE;
> - if (lgp->args.minlength > range->length)
> - lgp->args.minlength = range->length;
> - if (range->iomode == IOMODE_READ) {
> - if (range->offset >= i_size)
> - lgp->args.minlength = 0;
> - else if (i_size - range->offset < lgp->args.minlength)
> - lgp->args.minlength = i_size - range->offset;
> - }
> - lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
> - pnfs_copy_range(&lgp->args.range, range);
> - lgp->args.type = server->pnfs_curr_ld->id;
> - lgp->args.inode = ino;
> - lgp->args.ctx = get_nfs_open_context(ctx);
> - lgp->gfp_flags = gfp_flags;
> - lgp->cred = lo->plh_lc_cred;
> + lgp = kzalloc(sizeof(*lgp), gfp_flags);
> + if (lgp == NULL)
> + return NULL;
>
> - lseg = nfs4_proc_layoutget(lgp, gfp_flags);
> - } while (lseg == ERR_PTR(-EAGAIN));
> + i_size = i_size_read(ino);
>
> - if (IS_ERR(lseg)) {
> - if (!nfs_error_is_fatal(PTR_ERR(lseg)))
> - lseg = NULL;
> - } else {
> - pnfs_layout_clear_fail_bit(lo,
> - pnfs_iomode_to_fail_bit(range->iomode));
> + lgp->args.minlength = PAGE_SIZE;
> + if (lgp->args.minlength > range->length)
> + lgp->args.minlength = range->length;
> + if (range->iomode == IOMODE_READ) {
> + if (range->offset >= i_size)
> + lgp->args.minlength = 0;
> + else if (i_size - range->offset < lgp->args.minlength)
> + lgp->args.minlength = i_size - range->offset;
> }
> + lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
> + pnfs_copy_range(&lgp->args.range, range);
> + lgp->args.type = server->pnfs_curr_ld->id;
> + lgp->args.inode = ino;
> + lgp->args.ctx = get_nfs_open_context(ctx);
> + lgp->gfp_flags = gfp_flags;
> + lgp->cred = lo->plh_lc_cred;
>
> - return lseg;
> + return nfs4_proc_layoutget(lgp, gfp_flags);
> }
>
> static void pnfs_clear_layoutcommit(struct inode *inode,
> @@ -1646,6 +1633,20 @@ lookup_again:
> arg.length = PAGE_ALIGN(arg.length);
>
> lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
> + if (IS_ERR(lseg)) {
> + if (lseg == ERR_PTR(-EAGAIN)) {
> + if (first)
> + pnfs_clear_first_layoutget(lo);
> + pnfs_put_layout_hdr(lo);
> + goto lookup_again;
> + }
> +
> + if (!nfs_error_is_fatal(PTR_ERR(lseg)))
> + lseg = NULL;
> + } else {
> + pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> + }
> +
> atomic_dec(&lo->plh_outstanding);
> trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
--
Jeff Layton <[email protected]>