v4:
- fix error handling in send_layoutget/pnfs_update_layout
v3:
- move more of the LAYOUTGET error handling out of the state machine
- better tracepoints
- cleanup and fixes in pnfs_layout_process
- patches to prevent client from flipping to in-band I/O
- bugfixes
v2:
- rework of LAYOUTGET retry handling.
v4 for the set, a fix for the error handling bug that Trond spotted in
send_layoutget. Also fix a minor comment typo that Tom spotted.
Original cover letter from the RFC patchset follows:
--------------------------[snip]----------------------------
At Primary Data, one of the things we're most interested in is data
mobility. IOW, we want to be able to change the layout for an inode
seamlessly, with little interruption to I/O patterns.
The problem we have now is that CB_LAYOUTRECALLs interrupt I/O. When one
comes in, most pNFS servers refuse to hand out new layouts until the
recalled ones have been returned (or the client indicates that it no
longer knows about them). It doesn't have to be this way though. RFC5661
allows for concurrent LAYOUTGET and LAYOUTRETURN calls.
Furthermore, servers are expected to deal with old stateids in
LAYOUTRETURN. From RFC5661, section 18.44.3:
If the client returns the layout in response to a CB_LAYOUTRECALL
where the lor_recalltype field of the clora_recall field was
LAYOUTRECALL4_FILE, the client should use the lor_stateid value from
CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should
use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid
(from a previous LAYRETURN result). This is done to indicate the
point in time (in terms of layout stateid transitions) when the
recall was sent.
The way I'm interpreting this is that we can treat a LAYOUTRETURN with
an old stateid as returning all layouts that matched the given iomode,
at the time that that seqid was current.
With that, we can allow a LAYOUTGET on the same fh to proceed even when
there are still recalled layouts outstanding. This should allow the
client to pivot to a new layout while it's still draining I/Os
that are pinning the ones to be returned.
This patchset is a first draft of the client side piece that allows
this. Basically whenever we get a new layout segment, we'll tag it with
the seqid that was in the LAYOUTGET stateid that grants it.
When a CB_LAYOUTRECALL comes in, we tag the return seqid in the layout
header with the one that was in the request. When we do a LAYOUTRETURN
in response to a CB_LAYOUTRECALL, we craft the seqid such that we're
only returning the layouts that were recalled. Nothing that has been
granted since then will be returned.
I think I've done this in a way that the existing behavior is preserved
in the case where the server enforces the serialization of these
operations, but please do have a look and let me know if you see any
potential problems here. Testing this is still a WIP...
Jeff Layton (10):
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit
set
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
pnfs: fix bad error handling in send_layoutget
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: rework LAYOUTGET retry handling
pnfs: make pnfs_layout_process more robust
Tom Haynes (2):
pNFS/flexfiles: When checking for available DSes, conditionally check
for MDS io
pNFS/flexfiles: When initing reads or writes, we might have to retry
connecting to DSes
Trond Myklebust (1):
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
fs/nfs/callback_proc.c | 3 +-
fs/nfs/flexfilelayout/flexfilelayout.c | 63 ++++---
fs/nfs/flexfilelayout/flexfilelayout.h | 1 +
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 32 +++-
fs/nfs/nfs42proc.c | 2 +-
fs/nfs/nfs4proc.c | 120 +++++-------
fs/nfs/nfs4trace.h | 10 +-
fs/nfs/pnfs.c | 300 +++++++++++++++++-------------
fs/nfs/pnfs.h | 14 +-
include/linux/errno.h | 1 +
include/linux/nfs4.h | 2 +
include/linux/nfs_xdr.h | 2 -
12 files changed, 299 insertions(+), 251 deletions(-)
--
2.5.5
From: Trond Myklebust <[email protected]>
This patch fixes a problem whereby the pNFS client falls back to doing
reads and writes through the metadata server even when the layout flag
FF_FLAGS_NO_IO_THRU_MDS is set.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 23 ++++++-----------------
1 file changed, 6 insertions(+), 17 deletions(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index 60d690dbc947..51f6660a2247 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -1294,7 +1294,7 @@ ff_layout_set_layoutcommit(struct nfs_pgio_header *hdr)
}
static bool
-ff_layout_reset_to_mds(struct pnfs_layout_segment *lseg, int idx)
+ff_layout_device_unavailable(struct pnfs_layout_segment *lseg, int idx)
{
/* No mirroring for now */
struct nfs4_deviceid_node *node = FF_LAYOUT_DEVID_NODE(lseg, idx);
@@ -1331,16 +1331,10 @@ static int ff_layout_read_prepare_common(struct rpc_task *task,
rpc_exit(task, -EIO);
return -EIO;
}
- if (ff_layout_reset_to_mds(hdr->lseg, hdr->pgio_mirror_idx)) {
- dprintk("%s task %u reset io to MDS\n", __func__, task->tk_pid);
- if (ff_layout_has_available_ds(hdr->lseg))
- pnfs_read_resend_pnfs(hdr);
- else
- ff_layout_reset_read(hdr);
- rpc_exit(task, 0);
+ if (ff_layout_device_unavailable(hdr->lseg, hdr->pgio_mirror_idx)) {
+ rpc_exit(task, -EHOSTDOWN);
return -EAGAIN;
}
- hdr->pgio_done_cb = ff_layout_read_done_cb;
ff_layout_read_record_layoutstats_start(task, hdr);
return 0;
@@ -1530,14 +1524,8 @@ static int ff_layout_write_prepare_common(struct rpc_task *task,
return -EIO;
}
- if (ff_layout_reset_to_mds(hdr->lseg, hdr->pgio_mirror_idx)) {
- bool retry_pnfs;
-
- retry_pnfs = ff_layout_has_available_ds(hdr->lseg);
- dprintk("%s task %u reset io to %s\n", __func__,
- task->tk_pid, retry_pnfs ? "pNFS" : "MDS");
- ff_layout_reset_write(hdr, retry_pnfs);
- rpc_exit(task, 0);
+ if (ff_layout_device_unavailable(hdr->lseg, hdr->pgio_mirror_idx)) {
+ rpc_exit(task, -EHOSTDOWN);
return -EAGAIN;
}
@@ -1754,6 +1742,7 @@ ff_layout_read_pagelist(struct nfs_pgio_header *hdr)
dprintk("%s USE DS: %s cl_count %d vers %d\n", __func__,
ds->ds_remotestr, atomic_read(&ds->ds_clp->cl_count), vers);
+ hdr->pgio_done_cb = ff_layout_read_done_cb;
atomic_inc(&ds->ds_clp->cl_count);
hdr->ds_clp = ds->ds_clp;
fh = nfs4_ff_layout_select_ds_fh(lseg, idx);
--
2.5.5
Otherwise, we'll end up returning layouts that we've just received if
the client issues a new LAYOUTGET prior to the LAYOUTRETURN.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index f58cd2a14d5e..127a65fa639a 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -298,6 +298,8 @@ ff_lseg_merge(struct pnfs_layout_segment *new,
{
u64 new_end, old_end;
+ if (test_bit(NFS_LSEG_LAYOUTRETURN, &old->pls_flags))
+ return false;
if (new->pls_range.iomode != old->pls_range.iomode)
return false;
old_end = pnfs_calc_offset_end(old->pls_range.offset,
@@ -318,8 +320,6 @@ ff_lseg_merge(struct pnfs_layout_segment *new,
new_end);
if (test_bit(NFS_LSEG_ROC, &old->pls_flags))
set_bit(NFS_LSEG_ROC, &new->pls_flags);
- if (test_bit(NFS_LSEG_LAYOUTRETURN, &old->pls_flags))
- set_bit(NFS_LSEG_LAYOUTRETURN, &new->pls_flags);
return true;
}
--
2.5.5
From: Tom Haynes <[email protected]>
Whenever we check to see if we have the needed number of DSes for the
action, we may also have to check to see whether IO is allowed to go to
the MDS or not.
[jlayton: fix merge conflict due to lack of localio patches here]
Signed-off-by: Tom Haynes <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 5 ++---
fs/nfs/flexfilelayout/flexfilelayout.h | 1 +
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 6 ++++++
3 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index 51f6660a2247..f538ca6bbe81 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -1101,8 +1101,7 @@ static int ff_layout_async_handle_error_v4(struct rpc_task *task,
rpc_wake_up(&tbl->slot_tbl_waitq);
/* fall through */
default:
- if (ff_layout_no_fallback_to_mds(lseg) ||
- ff_layout_has_available_ds(lseg))
+ if (ff_layout_avoid_mds_available_ds(lseg))
return -NFS4ERR_RESET_TO_PNFS;
reset:
dprintk("%s Retry through MDS. Error %d\n", __func__,
@@ -1764,7 +1763,7 @@ ff_layout_read_pagelist(struct nfs_pgio_header *hdr)
return PNFS_ATTEMPTED;
out_failed:
- if (ff_layout_has_available_ds(lseg))
+ if (ff_layout_avoid_mds_available_ds(lseg))
return PNFS_TRY_AGAIN;
return PNFS_NOT_ATTEMPTED;
}
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.h b/fs/nfs/flexfilelayout/flexfilelayout.h
index 1318c77aeb35..b54058122647 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.h
+++ b/fs/nfs/flexfilelayout/flexfilelayout.h
@@ -191,4 +191,5 @@ nfs4_ff_find_or_create_ds_client(struct pnfs_layout_segment *lseg,
struct rpc_cred *ff_layout_get_ds_cred(struct pnfs_layout_segment *lseg,
u32 ds_idx, struct rpc_cred *mdscred);
bool ff_layout_has_available_ds(struct pnfs_layout_segment *lseg);
+bool ff_layout_avoid_mds_available_ds(struct pnfs_layout_segment *lseg);
#endif /* FS_NFS_NFS4FLEXFILELAYOUT_H */
diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
index 56296f3df19c..c52ca75081a8 100644
--- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
+++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
@@ -540,6 +540,12 @@ bool ff_layout_has_available_ds(struct pnfs_layout_segment *lseg)
return ff_rw_layout_has_available_ds(lseg);
}
+bool ff_layout_avoid_mds_available_ds(struct pnfs_layout_segment *lseg)
+{
+ return ff_layout_no_fallback_to_mds(lseg) ||
+ ff_layout_has_available_ds(lseg);
+}
+
module_param(dataserver_retrans, uint, 0644);
MODULE_PARM_DESC(dataserver_retrans, "The number of times the NFSv4.1 client "
"retries a request before it attempts further "
--
2.5.5
From: Tom Haynes <[email protected]>
If we are initializing reads or writes and can not connect to a DS, then
check whether or not IO is allowed through the MDS. If it is allowed,
reset to the MDS. Else, fail the layout segment and force a retry
of a new layout segment.
Signed-off-by: Tom Haynes <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index f538ca6bbe81..f58cd2a14d5e 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -847,8 +847,13 @@ ff_layout_pg_init_read(struct nfs_pageio_descriptor *pgio,
goto out_mds;
ds = ff_layout_choose_best_ds_for_read(pgio->pg_lseg, 0, &ds_idx);
- if (!ds)
- goto out_mds;
+ if (!ds) {
+ if (ff_layout_no_fallback_to_mds(pgio->pg_lseg))
+ goto out_pnfs;
+ else
+ goto out_mds;
+ }
+
mirror = FF_LAYOUT_COMP(pgio->pg_lseg, ds_idx);
pgio->pg_mirror_idx = ds_idx;
@@ -862,6 +867,12 @@ out_mds:
pnfs_put_lseg(pgio->pg_lseg);
pgio->pg_lseg = NULL;
nfs_pageio_reset_read_mds(pgio);
+ return;
+
+out_pnfs:
+ pnfs_set_lo_fail(pgio->pg_lseg);
+ pnfs_put_lseg(pgio->pg_lseg);
+ pgio->pg_lseg = NULL;
}
static void
@@ -904,8 +915,12 @@ ff_layout_pg_init_write(struct nfs_pageio_descriptor *pgio,
for (i = 0; i < pgio->pg_mirror_count; i++) {
ds = nfs4_ff_layout_prepare_ds(pgio->pg_lseg, i, true);
- if (!ds)
- goto out_mds;
+ if (!ds) {
+ if (ff_layout_no_fallback_to_mds(pgio->pg_lseg))
+ goto out_pnfs;
+ else
+ goto out_mds;
+ }
pgm = &pgio->pg_mirrors[i];
mirror = FF_LAYOUT_COMP(pgio->pg_lseg, i);
pgm->pg_bsize = mirror->mirror_ds->ds_versions[0].wsize;
@@ -917,6 +932,12 @@ out_mds:
pnfs_put_lseg(pgio->pg_lseg);
pgio->pg_lseg = NULL;
nfs_pageio_reset_write_mds(pgio);
+ return;
+
+out_pnfs:
+ pnfs_set_lo_fail(pgio->pg_lseg);
+ pnfs_put_lseg(pgio->pg_lseg);
+ pgio->pg_lseg = NULL;
}
static unsigned int
--
2.5.5
In later patches, we're going to teach the client to be more selective
about how it returns layouts. This means keeping a record of what the
stateid's seqid was at the time that the server handed out a layout
segment.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 1 +
fs/nfs/pnfs.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 5b404d926e08..18b6f950e300 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1697,6 +1697,7 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
init_lseg(lo, lseg);
lseg->pls_range = res->range;
+ lseg->pls_seq = be32_to_cpu(res->stateid.seqid);
spin_lock(&ino->i_lock);
if (pnfs_layoutgets_blocked(lo)) {
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 7222d3a35439..361fa5494aa5 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -64,6 +64,7 @@ struct pnfs_layout_segment {
struct list_head pls_lc_list;
struct pnfs_layout_range pls_range;
atomic_t pls_refcount;
+ u32 pls_seq;
unsigned long pls_flags;
struct pnfs_layout_hdr *pls_layout;
struct work_struct pls_work;
--
2.5.5
When we want to selectively do a LAYOUTRETURN, we need to specify a
stateid that represents most recent layout acquisition that is to be
returned.
When we mark a layout stateid to be returned, we update the return
sequence number in the layout header with that value, if it's newer
than the existing one. Then, when we go to do a LAYOUTRETURN on
layout header put, we overwrite the seqid in the stateid with the
saved one, and then zero it out.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 11 ++++++++---
fs/nfs/pnfs.h | 1 +
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 18b6f950e300..39432a3705b4 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -899,6 +899,7 @@ pnfs_prepare_layoutreturn(struct pnfs_layout_hdr *lo)
if (test_and_set_bit(NFS_LAYOUT_RETURN, &lo->plh_flags))
return false;
lo->plh_return_iomode = 0;
+ lo->plh_return_seq = 0;
pnfs_get_layout_hdr(lo);
clear_bit(NFS_LAYOUT_RETURN_REQUESTED, &lo->plh_flags);
return true;
@@ -969,6 +970,7 @@ static void pnfs_layoutreturn_before_put_layout_hdr(struct pnfs_layout_hdr *lo)
bool send;
nfs4_stateid_copy(&stateid, &lo->plh_stateid);
+ stateid.seqid = cpu_to_be32(lo->plh_return_seq);
iomode = lo->plh_return_iomode;
send = pnfs_prepare_layoutreturn(lo);
spin_unlock(&inode->i_lock);
@@ -1747,7 +1749,8 @@ out_forget_reply:
}
static void
-pnfs_set_plh_return_iomode(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode)
+pnfs_set_plh_return_info(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode,
+ u32 seq)
{
if (lo->plh_return_iomode == iomode)
return;
@@ -1755,6 +1758,8 @@ pnfs_set_plh_return_iomode(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode)
iomode = IOMODE_ANY;
lo->plh_return_iomode = iomode;
set_bit(NFS_LAYOUT_RETURN_REQUESTED, &lo->plh_flags);
+ if (!lo->plh_return_seq || pnfs_seqid_is_newer(seq, lo->plh_return_seq))
+ lo->plh_return_seq = seq;
}
/**
@@ -1793,7 +1798,7 @@ pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
continue;
remaining++;
set_bit(NFS_LSEG_LAYOUTRETURN, &lseg->pls_flags);
- pnfs_set_plh_return_iomode(lo, return_range->iomode);
+ pnfs_set_plh_return_info(lo, return_range->iomode, lseg->pls_seq);
}
return remaining;
}
@@ -1811,7 +1816,7 @@ void pnfs_error_mark_layout_for_return(struct inode *inode,
bool return_now = false;
spin_lock(&inode->i_lock);
- pnfs_set_plh_return_iomode(lo, range.iomode);
+ pnfs_set_plh_return_info(lo, range.iomode, lseg->pls_seq);
/*
* mark all matching lsegs so that we are sure to have no live
* segments at hand when sending layoutreturn. See pnfs_put_lseg()
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 361fa5494aa5..3476c9850678 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -195,6 +195,7 @@ struct pnfs_layout_hdr {
unsigned long plh_flags;
nfs4_stateid plh_stateid;
u32 plh_barrier; /* ignore lower seqids */
+ u32 plh_return_seq;
enum pnfs_iomode plh_return_iomode;
loff_t plh_lwb; /* last write byte for layoutcommit */
struct rpc_cred *plh_lc_cred; /* layoutcommit cred */
--
2.5.5
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
index 5532cb14e800..b70133de9505 100644
--- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
+++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
@@ -342,7 +342,23 @@ out:
return fh;
}
-/* Upon return, either ds is connected, or ds is NULL */
+/**
+ * nfs4_ff_layout_prepare_ds - prepare a DS connection for an RPC call
+ * @lseg: the layout segment we're operating on
+ * @ds_idx: index of the DS to use
+ * @fail_return: return layout on connect failure?
+ *
+ * Try to prepare a DS connection to accept an RPC call. This involves
+ * selecting a mirror to use and connecting the client to it if it's not
+ * already connected.
+ *
+ * Since we only need a single functioning mirror to satisfy a read, we don't
+ * want to return the layout if there is one. For writes though, any down
+ * mirror should result in a LAYOUTRETURN. @fail_return is how we distinguish
+ * between the two cases.
+ *
+ * Returns a pointer to a connected DS object on success or NULL on failure.
+ */
struct nfs4_pnfs_ds *
nfs4_ff_layout_prepare_ds(struct pnfs_layout_segment *lseg, u32 ds_idx,
bool fail_return)
--
2.5.5
Setting just the NFS_LAYOUT_RETURN_REQUESTED flag doesn't do anything,
unless there are lsegs that are also being marked for return. At the
point where that happens this flag is also set, so these set_bit calls
don't do anything useful.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/flexfilelayout/flexfilelayout.c | 2 --
fs/nfs/flexfilelayout/flexfilelayoutdev.c | 8 +-------
2 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c
index 127a65fa639a..1c902b01db9e 100644
--- a/fs/nfs/flexfilelayout/flexfilelayout.c
+++ b/fs/nfs/flexfilelayout/flexfilelayout.c
@@ -1269,8 +1269,6 @@ static int ff_layout_read_done_cb(struct rpc_task *task,
hdr->pgio_mirror_idx + 1,
&hdr->pgio_mirror_idx))
goto out_eagain;
- set_bit(NFS_LAYOUT_RETURN_REQUESTED,
- &hdr->lseg->pls_layout->plh_flags);
pnfs_read_resend_pnfs(hdr);
return task->tk_status;
case -NFS4ERR_RESET_TO_MDS:
diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
index c52ca75081a8..5532cb14e800 100644
--- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c
+++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c
@@ -393,13 +393,7 @@ nfs4_ff_layout_prepare_ds(struct pnfs_layout_segment *lseg, u32 ds_idx,
mirror, lseg->pls_range.offset,
lseg->pls_range.length, NFS4ERR_NXIO,
OP_ILLEGAL, GFP_NOIO);
- if (!fail_return) {
- if (ff_layout_has_available_ds(lseg))
- set_bit(NFS_LAYOUT_RETURN_REQUESTED,
- &lseg->pls_layout->plh_flags);
- else
- pnfs_error_mark_layout_for_return(ino, lseg);
- } else
+ if (fail_return || !ff_layout_has_available_ds(lseg))
pnfs_error_mark_layout_for_return(ino, lseg);
ds = NULL;
}
--
2.5.5
LAYOUTRETURN is "special" in that servers and clients are expected to
work with old stateids. When the client sends a LAYOUTRETURN with an old
stateid in it then the server is expected to only tear down layout
segments that were present when that seqid was current. Ensure that the
client handles its accounting accordingly.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/callback_proc.c | 3 ++-
fs/nfs/nfs42proc.c | 2 +-
fs/nfs/nfs4proc.c | 5 ++--
fs/nfs/pnfs.c | 64 +++++++++++++++++++++++++++++++++-----------------
fs/nfs/pnfs.h | 6 +++--
5 files changed, 52 insertions(+), 28 deletions(-)
diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 618ced381a14..755838df9996 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -217,7 +217,8 @@ static u32 initiate_file_draining(struct nfs_client *clp,
}
if (pnfs_mark_matching_lsegs_return(lo, &free_me_list,
- &args->cbl_range)) {
+ &args->cbl_range,
+ be32_to_cpu(args->cbl_stateid.seqid))) {
rv = NFS4_OK;
goto unlock;
}
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index dff83460e5a6..198bcc3e103d 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -232,7 +232,7 @@ nfs42_layoutstat_done(struct rpc_task *task, void *calldata)
* with the current stateid.
*/
set_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags);
- pnfs_mark_matching_lsegs_invalid(lo, &head, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &head, NULL, 0);
spin_unlock(&inode->i_lock);
pnfs_free_lseg_list(&head);
} else
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index bc2676c95e1b..c0d75be8cb69 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7930,7 +7930,7 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
* with the current stateid.
*/
set_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags);
- pnfs_mark_matching_lsegs_invalid(lo, &head, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &head, NULL, 0);
spin_unlock(&inode->i_lock);
pnfs_free_lseg_list(&head);
} else
@@ -8122,7 +8122,8 @@ static void nfs4_layoutreturn_release(void *calldata)
dprintk("--> %s\n", __func__);
spin_lock(&lo->plh_inode->i_lock);
- pnfs_mark_matching_lsegs_invalid(lo, &freeme, &lrp->args.range);
+ pnfs_mark_matching_lsegs_invalid(lo, &freeme, &lrp->args.range,
+ be32_to_cpu(lrp->args.stateid.seqid));
pnfs_mark_layout_returned_if_empty(lo);
if (lrp->res.lrs_present)
pnfs_set_layout_stateid(lo, &lrp->res.stateid, true);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 39432a3705b4..e6cad5ee5d29 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -270,7 +270,7 @@ pnfs_mark_layout_stateid_invalid(struct pnfs_layout_hdr *lo,
};
set_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags);
- return pnfs_mark_matching_lsegs_invalid(lo, lseg_list, &range);
+ return pnfs_mark_matching_lsegs_invalid(lo, lseg_list, &range, 0);
}
static int
@@ -308,7 +308,7 @@ pnfs_layout_io_set_failed(struct pnfs_layout_hdr *lo, u32 iomode)
spin_lock(&inode->i_lock);
pnfs_layout_set_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
- pnfs_mark_matching_lsegs_invalid(lo, &head, &range);
+ pnfs_mark_matching_lsegs_invalid(lo, &head, &range, 0);
spin_unlock(&inode->i_lock);
pnfs_free_lseg_list(&head);
dprintk("%s Setting layout IOMODE_%s fail bit\n", __func__,
@@ -522,13 +522,35 @@ static int mark_lseg_invalid(struct pnfs_layout_segment *lseg,
return rv;
}
-/* Returns count of number of matching invalid lsegs remaining in list
- * after call.
+/*
+ * Compare 2 layout stateid sequence ids, to see which is newer,
+ * taking into account wraparound issues.
+ */
+static bool pnfs_seqid_is_newer(u32 s1, u32 s2)
+{
+ return (s32)(s1 - s2) > 0;
+}
+
+/**
+ * pnfs_mark_matching_lsegs_invalid - tear down lsegs or mark them for later
+ * @lo: layout header containing the lsegs
+ * @tmp_list: list head where doomed lsegs should go
+ * @recall_range: optional recall range argument to match (may be NULL)
+ * @seq: only invalidate lsegs obtained prior to this sequence (may be 0)
+ *
+ * Walk the list of lsegs in the layout header, and tear down any that should
+ * be destroyed. If "recall_range" is specified then the segment must match
+ * that range. If "seq" is non-zero, then only match segments that were handed
+ * out at or before that sequence.
+ *
+ * Returns number of matching invalid lsegs remaining in list after scanning
+ * it and purging them.
*/
int
pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *recall_range)
+ const struct pnfs_layout_range *recall_range,
+ u32 seq)
{
struct pnfs_layout_segment *lseg, *next;
int remaining = 0;
@@ -540,10 +562,12 @@ pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
list_for_each_entry_safe(lseg, next, &lo->plh_segs, pls_list)
if (!recall_range ||
should_free_lseg(&lseg->pls_range, recall_range)) {
- dprintk("%s: freeing lseg %p iomode %d "
+ if (seq && pnfs_seqid_is_newer(lseg->pls_seq, seq))
+ continue;
+ dprintk("%s: freeing lseg %p iomode %d seq %u"
"offset %llu length %llu\n", __func__,
- lseg, lseg->pls_range.iomode, lseg->pls_range.offset,
- lseg->pls_range.length);
+ lseg, lseg->pls_range.iomode, lseg->pls_seq,
+ lseg->pls_range.offset, lseg->pls_range.length);
if (!mark_lseg_invalid(lseg, tmp_list))
remaining++;
}
@@ -730,15 +754,6 @@ pnfs_destroy_all_layouts(struct nfs_client *clp)
pnfs_destroy_layouts_byclid(clp, false);
}
-/*
- * Compare 2 layout stateid sequence ids, to see which is newer,
- * taking into account wraparound issues.
- */
-static bool pnfs_seqid_is_newer(u32 s1, u32 s2)
-{
- return (s32)(s1 - s2) > 0;
-}
-
/* update lo->plh_stateid with new if is more recent */
void
pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo, const nfs4_stateid *new,
@@ -1014,7 +1029,7 @@ _pnfs_return_layout(struct inode *ino)
pnfs_get_layout_hdr(lo);
empty = list_empty(&lo->plh_segs);
pnfs_clear_layoutcommit(ino, &tmp_list);
- pnfs_mark_matching_lsegs_invalid(lo, &tmp_list, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &tmp_list, NULL, 0);
if (NFS_SERVER(ino)->pnfs_curr_ld->return_range) {
struct pnfs_layout_range range = {
@@ -1721,7 +1736,7 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
* inode invalid, and don't bother validating the stateid
* sequence number.
*/
- pnfs_mark_matching_lsegs_invalid(lo, &free_me, NULL);
+ pnfs_mark_matching_lsegs_invalid(lo, &free_me, NULL, 0);
nfs4_stateid_copy(&lo->plh_stateid, &res->stateid);
lo->plh_barrier = be32_to_cpu(res->stateid.seqid);
@@ -1775,7 +1790,8 @@ pnfs_set_plh_return_info(struct pnfs_layout_hdr *lo, enum pnfs_iomode iomode,
int
pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *return_range)
+ const struct pnfs_layout_range *return_range,
+ u32 seq)
{
struct pnfs_layout_segment *lseg, *next;
int remaining = 0;
@@ -1798,8 +1814,11 @@ pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
continue;
remaining++;
set_bit(NFS_LSEG_LAYOUTRETURN, &lseg->pls_flags);
- pnfs_set_plh_return_info(lo, return_range->iomode, lseg->pls_seq);
}
+
+ if (remaining)
+ pnfs_set_plh_return_info(lo, return_range->iomode, seq);
+
return remaining;
}
@@ -1822,7 +1841,8 @@ void pnfs_error_mark_layout_for_return(struct inode *inode,
* segments at hand when sending layoutreturn. See pnfs_put_lseg()
* for how it works.
*/
- if (!pnfs_mark_matching_lsegs_return(lo, &free_me, &range)) {
+ if (!pnfs_mark_matching_lsegs_return(lo, &free_me,
+ &range, lseg->pls_seq)) {
nfs4_stateid stateid;
enum pnfs_iomode iomode = lo->plh_return_iomode;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 3476c9850678..971068b58647 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -266,10 +266,12 @@ int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
struct nfs4_state *open_state);
int pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *recall_range);
+ const struct pnfs_layout_range *recall_range,
+ u32 seq);
int pnfs_mark_matching_lsegs_return(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
- const struct pnfs_layout_range *recall_range);
+ const struct pnfs_layout_range *recall_range,
+ u32 seq);
bool pnfs_roc(struct inode *ino);
void pnfs_roc_release(struct inode *ino);
void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
--
2.5.5
Currently, the code will clear the fail bit if we get back a fatal
error. I don't think that's correct -- we want to clear that bit
if we do not get a fatal error.
Fixes: 0bcbf039f6 (nfs: handle request add failure properly)
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e6cad5ee5d29..eb90b3af3446 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -876,11 +876,16 @@ send_layoutget(struct pnfs_layout_hdr *lo,
lseg = nfs4_proc_layoutget(lgp, gfp_flags);
} while (lseg == ERR_PTR(-EAGAIN));
- if (IS_ERR(lseg) && !nfs_error_is_fatal(PTR_ERR(lseg)))
- lseg = NULL;
- else
+ if (IS_ERR(lseg)) {
+ if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
+ pnfs_layout_clear_fail_bit(lo,
+ pnfs_iomode_to_fail_bit(range->iomode));
+ lseg = NULL;
+ }
+ } else {
pnfs_layout_clear_fail_bit(lo,
pnfs_iomode_to_fail_bit(range->iomode));
+ }
return lseg;
}
--
2.5.5
If we get back something like NFS4ERR_OLD_STATEID, that will be
translated into -EAGAIN, and the do/while loop in send_layoutget
will drive the call again.
This is not quite what we want, I think. An error like that is a
sign that something has changed. That something could have been a
concurrent LAYOUTGET that would give us a usable lseg.
Lift the retry logic into pnfs_update_layout instead. That allows
us to redo the layout search, and may spare us from having to issue
an RPC.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 72 +++++++++++++++++++++++++++++------------------------------
1 file changed, 36 insertions(+), 36 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index eb90b3af3446..bc2c3ec98d32 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -839,7 +839,6 @@ send_layoutget(struct pnfs_layout_hdr *lo,
struct inode *ino = lo->plh_inode;
struct nfs_server *server = NFS_SERVER(ino);
struct nfs4_layoutget *lgp;
- struct pnfs_layout_segment *lseg;
loff_t i_size;
dprintk("--> %s\n", __func__);
@@ -849,45 +848,30 @@ send_layoutget(struct pnfs_layout_hdr *lo,
* store in lseg. If we race with a concurrent seqid morphing
* op, then re-send the LAYOUTGET.
*/
- do {
- lgp = kzalloc(sizeof(*lgp), gfp_flags);
- if (lgp == NULL)
- return NULL;
-
- i_size = i_size_read(ino);
-
- lgp->args.minlength = PAGE_SIZE;
- if (lgp->args.minlength > range->length)
- lgp->args.minlength = range->length;
- if (range->iomode == IOMODE_READ) {
- if (range->offset >= i_size)
- lgp->args.minlength = 0;
- else if (i_size - range->offset < lgp->args.minlength)
- lgp->args.minlength = i_size - range->offset;
- }
- lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
- pnfs_copy_range(&lgp->args.range, range);
- lgp->args.type = server->pnfs_curr_ld->id;
- lgp->args.inode = ino;
- lgp->args.ctx = get_nfs_open_context(ctx);
- lgp->gfp_flags = gfp_flags;
- lgp->cred = lo->plh_lc_cred;
+ lgp = kzalloc(sizeof(*lgp), gfp_flags);
+ if (lgp == NULL)
+ return ERR_PTR(-ENOMEM);
- lseg = nfs4_proc_layoutget(lgp, gfp_flags);
- } while (lseg == ERR_PTR(-EAGAIN));
+ i_size = i_size_read(ino);
- if (IS_ERR(lseg)) {
- if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
- pnfs_layout_clear_fail_bit(lo,
- pnfs_iomode_to_fail_bit(range->iomode));
- lseg = NULL;
- }
- } else {
- pnfs_layout_clear_fail_bit(lo,
- pnfs_iomode_to_fail_bit(range->iomode));
+ lgp->args.minlength = PAGE_SIZE;
+ if (lgp->args.minlength > range->length)
+ lgp->args.minlength = range->length;
+ if (range->iomode == IOMODE_READ) {
+ if (range->offset >= i_size)
+ lgp->args.minlength = 0;
+ else if (i_size - range->offset < lgp->args.minlength)
+ lgp->args.minlength = i_size - range->offset;
}
+ lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
+ pnfs_copy_range(&lgp->args.range, range);
+ lgp->args.type = server->pnfs_curr_ld->id;
+ lgp->args.inode = ino;
+ lgp->args.ctx = get_nfs_open_context(ctx);
+ lgp->gfp_flags = gfp_flags;
+ lgp->cred = lo->plh_lc_cred;
- return lseg;
+ return nfs4_proc_layoutget(lgp, gfp_flags);
}
static void pnfs_clear_layoutcommit(struct inode *inode,
@@ -1649,6 +1633,22 @@ lookup_again:
arg.length = PAGE_ALIGN(arg.length);
lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
+ if (IS_ERR(lseg)) {
+ if (lseg == ERR_PTR(-EAGAIN)) {
+ if (first)
+ pnfs_clear_first_layoutget(lo);
+ pnfs_put_layout_hdr(lo);
+ goto lookup_again;
+ }
+
+ if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
+ pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
+ lseg = NULL;
+ }
+ } else {
+ pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
+ }
+
atomic_dec(&lo->plh_outstanding);
trace_pnfs_update_layout(ino, pos, count, iomode, lo,
PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
--
2.5.5
It can return NULL if layoutgets are blocked currently. Fix it to return
-EAGAIN in that case, so we can properly handle it in pnfs_update_layout.
Also, clean up and simplify the error handling -- eliminate "status" and
just use "lseg".
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/pnfs.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e97895b427c0..deb609c9cd8a 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1708,21 +1708,19 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
struct pnfs_layout_segment *lseg;
struct inode *ino = lo->plh_inode;
LIST_HEAD(free_me);
- int status = -EINVAL;
if (!pnfs_sanity_check_layout_range(&res->range))
- goto out;
+ return ERR_PTR(-EINVAL);
/* Inject layout blob into I/O device driver */
lseg = NFS_SERVER(ino)->pnfs_curr_ld->alloc_lseg(lo, res, lgp->gfp_flags);
- if (!lseg || IS_ERR(lseg)) {
+ if (IS_ERR_OR_NULL(lseg)) {
if (!lseg)
- status = -ENOMEM;
- else
- status = PTR_ERR(lseg);
- dprintk("%s: Could not allocate layout: error %d\n",
- __func__, status);
- goto out;
+ lseg = ERR_PTR(-ENOMEM);
+
+ dprintk("%s: Could not allocate layout: error %ld\n",
+ __func__, PTR_ERR(lseg));
+ return lseg;
}
init_lseg(lo, lseg);
@@ -1732,15 +1730,14 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
spin_lock(&ino->i_lock);
if (pnfs_layoutgets_blocked(lo)) {
dprintk("%s forget reply due to state\n", __func__);
- goto out_forget_reply;
+ goto out_forget;
}
if (nfs4_stateid_match_other(&lo->plh_stateid, &res->stateid)) {
/* existing state ID, make sure the sequence number matches. */
if (pnfs_layout_stateid_blocked(lo, &res->stateid)) {
dprintk("%s forget reply due to sequence\n", __func__);
- status = -EAGAIN;
- goto out_forget_reply;
+ goto out_forget;
}
pnfs_set_layout_stateid(lo, &res->stateid, false);
} else {
@@ -1766,14 +1763,12 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
spin_unlock(&ino->i_lock);
pnfs_free_lseg_list(&free_me);
return lseg;
-out:
- return ERR_PTR(status);
-out_forget_reply:
+out_forget:
spin_unlock(&ino->i_lock);
lseg->pls_layout = lo;
NFS_SERVER(ino)->pnfs_curr_ld->free_lseg(lseg);
- goto out;
+ return ERR_PTR(-EAGAIN);
}
static void
--
2.5.5
There are several problems in the way a stateid is selected for a
LAYOUTGET operation:
We pick a stateid to use in the RPC prepare op, but that makes
it difficult to serialize LAYOUTGETs that use the open stateid. That
serialization is done in pnfs_update_layout, which occurs well before
the rpc_prepare operation.
Between those two events, the i_lock is dropped and reacquired.
pnfs_update_layout can find that the list has lsegs in it and not do any
serialization, but then later pnfs_choose_layoutget_stateid ends up
choosing the open stateid.
This patch changes the client to select the stateid to use in the
LAYOUTGET earlier, when we're searching for a usable layout segment.
This way we can do it all while holding the i_lock the first time, and
ensure that we serialize any LAYOUTGET call that uses a non-layout
stateid.
This also means a rework of how LAYOUTGET replies are handled, as we
must now get the latest stateid if we want to retransmit in response
to a retryable error.
Most of those errors boil down to the fact that the layout state has
changed in some fashion. Thus, what we really want to do is to re-search
for a layout when it fails with a retryable error, so that we can avoid
reissuing the RPC at all if possible.
While the LAYOUTGET RPC is async, the initiating thread always waits for
it to complete, so it's effectively synchronous anyway. Currently, when
we need to retry a LAYOUTGET because of an error, we drive that retry
via the rpc state machine.
This means that once the call has been submitted, it runs until it
completes. So, we must move the error handling for this RPC out of the
rpc_call_done operation and into the caller.
In order to handle errors like NFS4ERR_DELAY properly, we must also
pass a pointer to the sliding timeout, which is now moved to the stack
in pnfs_update_layout.
The complicating errors are -NFS4ERR_RECALLCONFLICT and
-NFS4ERR_LAYOUTTRYLATER, as those involve a timeout after which we give
up and return NULL back to the caller. So, there is some special
handling for those errors to ensure that the layers driving the retries
can handle that appropriately.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/nfs4proc.c | 115 ++++++++++++++++----------------------
fs/nfs/nfs4trace.h | 10 +++-
fs/nfs/pnfs.c | 144 +++++++++++++++++++++++++-----------------------
fs/nfs/pnfs.h | 6 +-
include/linux/errno.h | 1 +
include/linux/nfs4.h | 2 +
include/linux/nfs_xdr.h | 2 -
7 files changed, 136 insertions(+), 144 deletions(-)
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index c0d75be8cb69..c2583ca6c8b6 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -416,6 +416,7 @@ static int nfs4_do_handle_exception(struct nfs_server *server,
case -NFS4ERR_DELAY:
nfs_inc_server_stats(server, NFSIOS_DELAY);
case -NFS4ERR_GRACE:
+ case -NFS4ERR_RECALLCONFLICT:
exception->delay = 1;
return 0;
@@ -7824,40 +7825,34 @@ nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
struct nfs4_layoutget *lgp = calldata;
struct nfs_server *server = NFS_SERVER(lgp->args.inode);
struct nfs4_session *session = nfs4_get_session(server);
- int ret;
dprintk("--> %s\n", __func__);
- /* Note the is a race here, where a CB_LAYOUTRECALL can come in
- * right now covering the LAYOUTGET we are about to send.
- * However, that is not so catastrophic, and there seems
- * to be no way to prevent it completely.
- */
- if (nfs41_setup_sequence(session, &lgp->args.seq_args,
- &lgp->res.seq_res, task))
- return;
- ret = pnfs_choose_layoutget_stateid(&lgp->args.stateid,
- NFS_I(lgp->args.inode)->layout,
- &lgp->args.range,
- lgp->args.ctx->state);
- if (ret < 0)
- rpc_exit(task, ret);
+ nfs41_setup_sequence(session, &lgp->args.seq_args,
+ &lgp->res.seq_res, task);
+ dprintk("<-- %s\n", __func__);
}
static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
{
struct nfs4_layoutget *lgp = calldata;
+
+ dprintk("--> %s\n", __func__);
+ nfs41_sequence_done(task, &lgp->res.seq_res);
+ dprintk("<-- %s\n", __func__);
+}
+
+static int
+nfs4_layoutget_handle_exception(struct rpc_task *task,
+ struct nfs4_layoutget *lgp, struct nfs4_exception *exception)
+{
struct inode *inode = lgp->args.inode;
struct nfs_server *server = NFS_SERVER(inode);
struct pnfs_layout_hdr *lo;
- struct nfs4_state *state = NULL;
- unsigned long timeo, now, giveup;
+ int status = task->tk_status;
dprintk("--> %s tk_status => %d\n", __func__, -task->tk_status);
- if (!nfs41_sequence_done(task, &lgp->res.seq_res))
- goto out;
-
- switch (task->tk_status) {
+ switch (status) {
case 0:
goto out;
@@ -7867,57 +7862,43 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
* retry go inband.
*/
case -NFS4ERR_LAYOUTUNAVAILABLE:
- task->tk_status = -ENODATA;
+ status = -ENODATA;
goto out;
/*
* NFS4ERR_BADLAYOUT means the MDS cannot return a layout of
* length lgp->args.minlength != 0 (see RFC5661 section 18.43.3).
*/
case -NFS4ERR_BADLAYOUT:
- goto out_overflow;
+ status = -EOVERFLOW;
+ goto out;
/*
* NFS4ERR_LAYOUTTRYLATER is a conflict with another client
* (or clients) writing to the same RAID stripe except when
* the minlength argument is 0 (see RFC5661 section 18.43.3).
+ *
+ * Treat it like we would RECALLCONFLICT -- we retry for a little
+ * while, and then eventually give up.
*/
case -NFS4ERR_LAYOUTTRYLATER:
- if (lgp->args.minlength == 0)
- goto out_overflow;
- /*
- * NFS4ERR_RECALLCONFLICT is when conflict with self (must recall
- * existing layout before getting a new one).
- */
- case -NFS4ERR_RECALLCONFLICT:
- timeo = rpc_get_timeout(task->tk_client);
- giveup = lgp->args.timestamp + timeo;
- now = jiffies;
- if (time_after(giveup, now)) {
- unsigned long delay;
-
- /* Delay for:
- * - Not less then NFS4_POLL_RETRY_MIN.
- * - One last time a jiffie before we give up
- * - exponential backoff (time_now minus start_attempt)
- */
- delay = max_t(unsigned long, NFS4_POLL_RETRY_MIN,
- min((giveup - now - 1),
- now - lgp->args.timestamp));
-
- dprintk("%s: NFS4ERR_RECALLCONFLICT waiting %lu\n",
- __func__, delay);
- rpc_delay(task, delay);
- /* Do not call nfs4_async_handle_error() */
- goto out_restart;
+ if (lgp->args.minlength == 0) {
+ status = -EOVERFLOW;
+ goto out;
}
- break;
+ /* Fallthrough */
+ case -NFS4ERR_RECALLCONFLICT:
+ nfs4_handle_exception(server, -NFS4ERR_RECALLCONFLICT,
+ exception);
+ status = -ERECALLCONFLICT;
+ goto out;
case -NFS4ERR_EXPIRED:
case -NFS4ERR_BAD_STATEID:
+ exception->timeout = 0;
spin_lock(&inode->i_lock);
if (nfs4_stateid_match(&lgp->args.stateid,
&lgp->args.ctx->state->stateid)) {
spin_unlock(&inode->i_lock);
/* If the open stateid was bad, then recover it. */
- state = lgp->args.ctx->state;
+ exception->state = lgp->args.ctx->state;
break;
}
lo = NFS_I(inode)->layout;
@@ -7935,20 +7916,16 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
pnfs_free_lseg_list(&head);
} else
spin_unlock(&inode->i_lock);
- goto out_restart;
+ status = -EAGAIN;
+ goto out;
}
- if (nfs4_async_handle_error(task, server, state, &lgp->timeout) == -EAGAIN)
- goto out_restart;
+
+ status = nfs4_handle_exception(server, status, exception);
+ if (exception->retry)
+ status = -EAGAIN;
out:
dprintk("<-- %s\n", __func__);
- return;
-out_restart:
- task->tk_status = 0;
- rpc_restart_call_prepare(task);
- return;
-out_overflow:
- task->tk_status = -EOVERFLOW;
- goto out;
+ return status;
}
static size_t max_response_pages(struct nfs_server *server)
@@ -8017,7 +7994,7 @@ static const struct rpc_call_ops nfs4_layoutget_call_ops = {
};
struct pnfs_layout_segment *
-nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
+nfs4_proc_layoutget(struct nfs4_layoutget *lgp, long *timeout, gfp_t gfp_flags)
{
struct inode *inode = lgp->args.inode;
struct nfs_server *server = NFS_SERVER(inode);
@@ -8037,6 +8014,7 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
.flags = RPC_TASK_ASYNC,
};
struct pnfs_layout_segment *lseg = NULL;
+ struct nfs4_exception exception = { .timeout = *timeout };
int status = 0;
dprintk("--> %s\n", __func__);
@@ -8050,7 +8028,6 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
return ERR_PTR(-ENOMEM);
}
lgp->args.layout.pglen = max_pages * PAGE_SIZE;
- lgp->args.timestamp = jiffies;
lgp->res.layoutp = &lgp->args.layout;
lgp->res.seq_res.sr_slot = NULL;
@@ -8060,13 +8037,17 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
if (IS_ERR(task))
return ERR_CAST(task);
status = nfs4_wait_for_completion_rpc_task(task);
- if (status == 0)
- status = task->tk_status;
+ if (status == 0) {
+ status = nfs4_layoutget_handle_exception(task, lgp, &exception);
+ *timeout = exception.timeout;
+ }
+
trace_nfs4_layoutget(lgp->args.ctx,
&lgp->args.range,
&lgp->res.range,
&lgp->res.stateid,
status);
+
/* if layoutp->len is 0, nfs4_layoutget_prepare called rpc_exit */
if (status == 0 && lgp->res.layoutp->len)
lseg = pnfs_layout_process(lgp);
diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
index 2c8d05dae5b1..9c150b153782 100644
--- a/fs/nfs/nfs4trace.h
+++ b/fs/nfs/nfs4trace.h
@@ -1520,6 +1520,8 @@ DEFINE_NFS4_INODE_EVENT(nfs4_layoutreturn_on_close);
{ PNFS_UPDATE_LAYOUT_FOUND_CACHED, "found cached" }, \
{ PNFS_UPDATE_LAYOUT_RETURN, "layoutreturn" }, \
{ PNFS_UPDATE_LAYOUT_BLOCKED, "layouts blocked" }, \
+ { PNFS_UPDATE_LAYOUT_INVALID_OPEN, "invalid open" }, \
+ { PNFS_UPDATE_LAYOUT_RETRY, "retrying" }, \
{ PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET, "sent layoutget" })
TRACE_EVENT(pnfs_update_layout,
@@ -1528,9 +1530,10 @@ TRACE_EVENT(pnfs_update_layout,
u64 count,
enum pnfs_iomode iomode,
struct pnfs_layout_hdr *lo,
+ struct pnfs_layout_segment *lseg,
enum pnfs_update_layout_reason reason
),
- TP_ARGS(inode, pos, count, iomode, lo, reason),
+ TP_ARGS(inode, pos, count, iomode, lo, lseg, reason),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(u64, fileid)
@@ -1540,6 +1543,7 @@ TRACE_EVENT(pnfs_update_layout,
__field(enum pnfs_iomode, iomode)
__field(int, layoutstateid_seq)
__field(u32, layoutstateid_hash)
+ __field(long, lseg)
__field(enum pnfs_update_layout_reason, reason)
),
TP_fast_assign(
@@ -1559,11 +1563,12 @@ TRACE_EVENT(pnfs_update_layout,
__entry->layoutstateid_seq = 0;
__entry->layoutstateid_hash = 0;
}
+ __entry->lseg = (long)lseg;
),
TP_printk(
"fileid=%02x:%02x:%llu fhandle=0x%08x "
"iomode=%s pos=%llu count=%llu "
- "layoutstateid=%d:0x%08x (%s)",
+ "layoutstateid=%d:0x%08x lseg=0x%lx (%s)",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long long)__entry->fileid,
__entry->fhandle,
@@ -1571,6 +1576,7 @@ TRACE_EVENT(pnfs_update_layout,
(unsigned long long)__entry->pos,
(unsigned long long)__entry->count,
__entry->layoutstateid_seq, __entry->layoutstateid_hash,
+ __entry->lseg,
show_pnfs_update_layout_reason(__entry->reason)
)
);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index bc2c3ec98d32..e97895b427c0 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -796,45 +796,18 @@ pnfs_layoutgets_blocked(const struct pnfs_layout_hdr *lo)
test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags);
}
-int
-pnfs_choose_layoutget_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo,
- const struct pnfs_layout_range *range,
- struct nfs4_state *open_state)
-{
- int status = 0;
-
- dprintk("--> %s\n", __func__);
- spin_lock(&lo->plh_inode->i_lock);
- if (pnfs_layoutgets_blocked(lo)) {
- status = -EAGAIN;
- } else if (!nfs4_valid_open_stateid(open_state)) {
- status = -EBADF;
- } else if (list_empty(&lo->plh_segs) ||
- test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
- int seq;
-
- do {
- seq = read_seqbegin(&open_state->seqlock);
- nfs4_stateid_copy(dst, &open_state->stateid);
- } while (read_seqretry(&open_state->seqlock, seq));
- } else
- nfs4_stateid_copy(dst, &lo->plh_stateid);
- spin_unlock(&lo->plh_inode->i_lock);
- dprintk("<-- %s\n", __func__);
- return status;
-}
-
/*
-* Get layout from server.
-* for now, assume that whole file layouts are requested.
-* arg->offset: 0
-* arg->length: all ones
-*/
+ * Get layout from server.
+ * for now, assume that whole file layouts are requested.
+ * arg->offset: 0
+ * arg->length: all ones
+ */
static struct pnfs_layout_segment *
send_layoutget(struct pnfs_layout_hdr *lo,
struct nfs_open_context *ctx,
+ nfs4_stateid *stateid,
const struct pnfs_layout_range *range,
- gfp_t gfp_flags)
+ long *timeout, gfp_t gfp_flags)
{
struct inode *ino = lo->plh_inode;
struct nfs_server *server = NFS_SERVER(ino);
@@ -868,10 +841,11 @@ send_layoutget(struct pnfs_layout_hdr *lo,
lgp->args.type = server->pnfs_curr_ld->id;
lgp->args.inode = ino;
lgp->args.ctx = get_nfs_open_context(ctx);
+ nfs4_stateid_copy(&lgp->args.stateid, stateid);
lgp->gfp_flags = gfp_flags;
lgp->cred = lo->plh_lc_cred;
- return nfs4_proc_layoutget(lgp, gfp_flags);
+ return nfs4_proc_layoutget(lgp, timeout, gfp_flags);
}
static void pnfs_clear_layoutcommit(struct inode *inode,
@@ -1511,27 +1485,30 @@ pnfs_update_layout(struct inode *ino,
.offset = pos,
.length = count,
};
- unsigned pg_offset;
+ unsigned pg_offset, seq;
struct nfs_server *server = NFS_SERVER(ino);
struct nfs_client *clp = server->nfs_client;
- struct pnfs_layout_hdr *lo;
+ struct pnfs_layout_hdr *lo = NULL;
struct pnfs_layout_segment *lseg = NULL;
+ nfs4_stateid stateid;
+ long timeout = 0;
+ unsigned long giveup = jiffies + rpc_get_timeout(server->client);
bool first;
if (!pnfs_enabled_sb(NFS_SERVER(ino))) {
- trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_NO_PNFS);
goto out;
}
if (iomode == IOMODE_READ && i_size_read(ino) == 0) {
- trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_RD_ZEROLEN);
goto out;
}
if (pnfs_within_mdsthreshold(ctx, ino, iomode)) {
- trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_MDSTHRESH);
goto out;
}
@@ -1542,14 +1519,14 @@ lookup_again:
lo = pnfs_find_alloc_layout(ino, ctx, gfp_flags);
if (lo == NULL) {
spin_unlock(&ino->i_lock);
- trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_NOMEM);
goto out;
}
/* Do we even need to bother with this? */
if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_BULK_RECALL);
dprintk("%s matches recall, use MDS\n", __func__);
goto out_unlock;
@@ -1557,14 +1534,34 @@ lookup_again:
/* if LAYOUTGET already failed once we don't try again */
if (pnfs_layout_io_test_failed(lo, iomode)) {
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_IO_TEST_FAIL);
goto out_unlock;
}
- first = list_empty(&lo->plh_segs);
- if (first) {
- /* The first layoutget for the file. Need to serialize per
+ lseg = pnfs_find_lseg(lo, &arg);
+ if (lseg) {
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
+ PNFS_UPDATE_LAYOUT_FOUND_CACHED);
+ goto out_unlock;
+ }
+
+ if (!nfs4_valid_open_stateid(ctx->state)) {
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
+ PNFS_UPDATE_LAYOUT_INVALID_OPEN);
+ goto out_unlock;
+ }
+
+ /*
+ * Choose a stateid for the LAYOUTGET. If we don't have a layout
+ * stateid, or it has been invalidated, then we must use the open
+ * stateid.
+ */
+ if (lo->plh_stateid.seqid == 0 ||
+ test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
+
+ /*
+ * The first layoutget for the file. Need to serialize per
* RFC 5661 Errata 3208.
*/
if (test_and_set_bit(NFS_LAYOUT_FIRST_LAYOUTGET,
@@ -1573,18 +1570,17 @@ lookup_again:
wait_on_bit(&lo->plh_flags, NFS_LAYOUT_FIRST_LAYOUTGET,
TASK_UNINTERRUPTIBLE);
pnfs_put_layout_hdr(lo);
+ dprintk("%s retrying\n", __func__);
goto lookup_again;
}
+
+ first = true;
+ do {
+ seq = read_seqbegin(&ctx->state->seqlock);
+ nfs4_stateid_copy(&stateid, &ctx->state->stateid);
+ } while (read_seqretry(&ctx->state->seqlock, seq));
} else {
- /* Check to see if the layout for the given range
- * already exists
- */
- lseg = pnfs_find_lseg(lo, &arg);
- if (lseg) {
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
- PNFS_UPDATE_LAYOUT_FOUND_CACHED);
- goto out_unlock;
- }
+ nfs4_stateid_copy(&stateid, &lo->plh_stateid);
}
/*
@@ -1599,15 +1595,17 @@ lookup_again:
pnfs_clear_first_layoutget(lo);
pnfs_put_layout_hdr(lo);
dprintk("%s retrying\n", __func__);
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo,
+ lseg, PNFS_UPDATE_LAYOUT_RETRY);
goto lookup_again;
}
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_RETURN);
goto out_put_layout_hdr;
}
if (pnfs_layoutgets_blocked(lo)) {
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
PNFS_UPDATE_LAYOUT_BLOCKED);
goto out_unlock;
}
@@ -1632,26 +1630,36 @@ lookup_again:
if (arg.length != NFS4_MAX_UINT64)
arg.length = PAGE_ALIGN(arg.length);
- lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
+ lseg = send_layoutget(lo, ctx, &stateid, &arg, &timeout, gfp_flags);
+ trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
+ PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
if (IS_ERR(lseg)) {
- if (lseg == ERR_PTR(-EAGAIN)) {
+ switch(PTR_ERR(lseg)) {
+ case -ERECALLCONFLICT:
+ if (time_after(jiffies, giveup))
+ lseg = NULL;
+ /* Fallthrough */
+ case -EAGAIN:
+ pnfs_put_layout_hdr(lo);
if (first)
pnfs_clear_first_layoutget(lo);
- pnfs_put_layout_hdr(lo);
- goto lookup_again;
- }
-
- if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
- pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
- lseg = NULL;
+ if (lseg) {
+ trace_pnfs_update_layout(ino, pos, count,
+ iomode, lo, lseg, PNFS_UPDATE_LAYOUT_RETRY);
+ goto lookup_again;
+ }
+ /* Fallthrough */
+ default:
+ if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
+ pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
+ lseg = NULL;
+ }
}
} else {
pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
}
atomic_dec(&lo->plh_outstanding);
- trace_pnfs_update_layout(ino, pos, count, iomode, lo,
- PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
out_put_layout_hdr:
if (first)
pnfs_clear_first_layoutget(lo);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 971068b58647..f9f3331bef49 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -228,7 +228,7 @@ extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
struct pnfs_device *dev,
struct rpc_cred *cred);
-extern struct pnfs_layout_segment* nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags);
+extern struct pnfs_layout_segment* nfs4_proc_layoutget(struct nfs4_layoutget *lgp, long *timeout, gfp_t gfp_flags);
extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp, bool sync);
/* pnfs.c */
@@ -260,10 +260,6 @@ void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo);
void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
const nfs4_stateid *new,
bool update_barrier);
-int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
- struct pnfs_layout_hdr *lo,
- const struct pnfs_layout_range *range,
- struct nfs4_state *open_state);
int pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
struct list_head *tmp_list,
const struct pnfs_layout_range *recall_range,
diff --git a/include/linux/errno.h b/include/linux/errno.h
index 89627b9187f9..7ce9fb1b7d28 100644
--- a/include/linux/errno.h
+++ b/include/linux/errno.h
@@ -28,5 +28,6 @@
#define EBADTYPE 527 /* Type not supported by server */
#define EJUKEBOX 528 /* Request initiated, but will not complete before timeout */
#define EIOCBQUEUED 529 /* iocb queued, will get completion event */
+#define ERECALLCONFLICT 530 /* conflict with recalled state */
#endif
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 011433478a14..f4870a330290 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -621,7 +621,9 @@ enum pnfs_update_layout_reason {
PNFS_UPDATE_LAYOUT_IO_TEST_FAIL,
PNFS_UPDATE_LAYOUT_FOUND_CACHED,
PNFS_UPDATE_LAYOUT_RETURN,
+ PNFS_UPDATE_LAYOUT_RETRY,
PNFS_UPDATE_LAYOUT_BLOCKED,
+ PNFS_UPDATE_LAYOUT_INVALID_OPEN,
PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET,
};
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index cb9982d8f38f..a4cb8a33ae2c 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -233,7 +233,6 @@ struct nfs4_layoutget_args {
struct inode *inode;
struct nfs_open_context *ctx;
nfs4_stateid stateid;
- unsigned long timestamp;
struct nfs4_layoutdriver_data layout;
};
@@ -251,7 +250,6 @@ struct nfs4_layoutget {
struct nfs4_layoutget_res res;
struct rpc_cred *cred;
gfp_t gfp_flags;
- long timeout;
};
struct nfs4_getdeviceinfo_args {
--
2.5.5
Jeff Layton <[email protected]> writes:
> There are several problems in the way a stateid is selected for a
> LAYOUTGET operation:
>
> We pick a stateid to use in the RPC prepare op, but that makes
> it difficult to serialize LAYOUTGETs that use the open stateid. That
> serialization is done in pnfs_update_layout, which occurs well before
> the rpc_prepare operation.
>
> Between those two events, the i_lock is dropped and reacquired.
> pnfs_update_layout can find that the list has lsegs in it and not do any
> serialization, but then later pnfs_choose_layoutget_stateid ends up
> choosing the open stateid.
>
> This patch changes the client to select the stateid to use in the
> LAYOUTGET earlier, when we're searching for a usable layout segment.
> This way we can do it all while holding the i_lock the first time, and
> ensure that we serialize any LAYOUTGET call that uses a non-layout
> stateid.
>
> This also means a rework of how LAYOUTGET replies are handled, as we
> must now get the latest stateid if we want to retransmit in response
> to a retryable error.
>
> Most of those errors boil down to the fact that the layout state has
> changed in some fashion. Thus, what we really want to do is to re-search
> for a layout when it fails with a retryable error, so that we can avoid
> reissuing the RPC at all if possible.
>
> While the LAYOUTGET RPC is async, the initiating thread always waits for
> it to complete, so it's effectively synchronous anyway. Currently, when
> we need to retry a LAYOUTGET because of an error, we drive that retry
> via the rpc state machine.
>
> This means that once the call has been submitted, it runs until it
> completes. So, we must move the error handling for this RPC out of the
> rpc_call_done operation and into the caller.
>
> In order to handle errors like NFS4ERR_DELAY properly, we must also
> pass a pointer to the sliding timeout, which is now moved to the stack
> in pnfs_update_layout.
>
> The complicating errors are -NFS4ERR_RECALLCONFLICT and
> -NFS4ERR_LAYOUTTRYLATER, as those involve a timeout after which we give
> up and return NULL back to the caller. So, there is some special
> handling for those errors to ensure that the layers driving the retries
> can handle that appropriately.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/nfs4proc.c | 115 ++++++++++++++++----------------------
> fs/nfs/nfs4trace.h | 10 +++-
> fs/nfs/pnfs.c | 144 +++++++++++++++++++++++++-----------------------
> fs/nfs/pnfs.h | 6 +-
> include/linux/errno.h | 1 +
> include/linux/nfs4.h | 2 +
> include/linux/nfs_xdr.h | 2 -
> 7 files changed, 136 insertions(+), 144 deletions(-)
>
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index c0d75be8cb69..c2583ca6c8b6 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -416,6 +416,7 @@ static int nfs4_do_handle_exception(struct nfs_server *server,
> case -NFS4ERR_DELAY:
> nfs_inc_server_stats(server, NFSIOS_DELAY);
> case -NFS4ERR_GRACE:
> + case -NFS4ERR_RECALLCONFLICT:
> exception->delay = 1;
> return 0;
>
> @@ -7824,40 +7825,34 @@ nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
> struct nfs4_layoutget *lgp = calldata;
> struct nfs_server *server = NFS_SERVER(lgp->args.inode);
> struct nfs4_session *session = nfs4_get_session(server);
> - int ret;
>
> dprintk("--> %s\n", __func__);
> - /* Note the is a race here, where a CB_LAYOUTRECALL can come in
> - * right now covering the LAYOUTGET we are about to send.
> - * However, that is not so catastrophic, and there seems
> - * to be no way to prevent it completely.
> - */
> - if (nfs41_setup_sequence(session, &lgp->args.seq_args,
> - &lgp->res.seq_res, task))
> - return;
> - ret = pnfs_choose_layoutget_stateid(&lgp->args.stateid,
> - NFS_I(lgp->args.inode)->layout,
> - &lgp->args.range,
> - lgp->args.ctx->state);
> - if (ret < 0)
> - rpc_exit(task, ret);
> + nfs41_setup_sequence(session, &lgp->args.seq_args,
> + &lgp->res.seq_res, task);
> + dprintk("<-- %s\n", __func__);
> }
>
> static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> {
> struct nfs4_layoutget *lgp = calldata;
> +
> + dprintk("--> %s\n", __func__);
> + nfs41_sequence_done(task, &lgp->res.seq_res);
> + dprintk("<-- %s\n", __func__);
> +}
> +
> +static int
> +nfs4_layoutget_handle_exception(struct rpc_task *task,
> + struct nfs4_layoutget *lgp, struct nfs4_exception *exception)
> +{
> struct inode *inode = lgp->args.inode;
> struct nfs_server *server = NFS_SERVER(inode);
> struct pnfs_layout_hdr *lo;
> - struct nfs4_state *state = NULL;
> - unsigned long timeo, now, giveup;
> + int status = task->tk_status;
>
> dprintk("--> %s tk_status => %d\n", __func__, -task->tk_status);
>
> - if (!nfs41_sequence_done(task, &lgp->res.seq_res))
> - goto out;
> -
> - switch (task->tk_status) {
> + switch (status) {
> case 0:
> goto out;
>
> @@ -7867,57 +7862,43 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> * retry go inband.
> */
> case -NFS4ERR_LAYOUTUNAVAILABLE:
> - task->tk_status = -ENODATA;
> + status = -ENODATA;
> goto out;
> /*
> * NFS4ERR_BADLAYOUT means the MDS cannot return a layout of
> * length lgp->args.minlength != 0 (see RFC5661 section 18.43.3).
> */
> case -NFS4ERR_BADLAYOUT:
> - goto out_overflow;
> + status = -EOVERFLOW;
> + goto out;
> /*
> * NFS4ERR_LAYOUTTRYLATER is a conflict with another client
> * (or clients) writing to the same RAID stripe except when
> * the minlength argument is 0 (see RFC5661 section 18.43.3).
> + *
> + * Treat it like we would RECALLCONFLICT -- we retry for a little
> + * while, and then eventually give up.
> */
> case -NFS4ERR_LAYOUTTRYLATER:
> - if (lgp->args.minlength == 0)
> - goto out_overflow;
> - /*
> - * NFS4ERR_RECALLCONFLICT is when conflict with self (must recall
> - * existing layout before getting a new one).
> - */
> - case -NFS4ERR_RECALLCONFLICT:
> - timeo = rpc_get_timeout(task->tk_client);
> - giveup = lgp->args.timestamp + timeo;
> - now = jiffies;
> - if (time_after(giveup, now)) {
> - unsigned long delay;
> -
> - /* Delay for:
> - * - Not less then NFS4_POLL_RETRY_MIN.
> - * - One last time a jiffie before we give up
> - * - exponential backoff (time_now minus start_attempt)
> - */
> - delay = max_t(unsigned long, NFS4_POLL_RETRY_MIN,
> - min((giveup - now - 1),
> - now - lgp->args.timestamp));
> -
> - dprintk("%s: NFS4ERR_RECALLCONFLICT waiting %lu\n",
> - __func__, delay);
> - rpc_delay(task, delay);
> - /* Do not call nfs4_async_handle_error() */
> - goto out_restart;
> + if (lgp->args.minlength == 0) {
> + status = -EOVERFLOW;
> + goto out;
> }
> - break;
> + /* Fallthrough */
> + case -NFS4ERR_RECALLCONFLICT:
> + nfs4_handle_exception(server, -NFS4ERR_RECALLCONFLICT,
> + exception);
> + status = -ERECALLCONFLICT;
> + goto out;
> case -NFS4ERR_EXPIRED:
> case -NFS4ERR_BAD_STATEID:
> + exception->timeout = 0;
> spin_lock(&inode->i_lock);
> if (nfs4_stateid_match(&lgp->args.stateid,
> &lgp->args.ctx->state->stateid)) {
> spin_unlock(&inode->i_lock);
> /* If the open stateid was bad, then recover it. */
> - state = lgp->args.ctx->state;
> + exception->state = lgp->args.ctx->state;
> break;
> }
> lo = NFS_I(inode)->layout;
> @@ -7935,20 +7916,16 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> pnfs_free_lseg_list(&head);
> } else
> spin_unlock(&inode->i_lock);
> - goto out_restart;
> + status = -EAGAIN;
> + goto out;
> }
> - if (nfs4_async_handle_error(task, server, state, &lgp->timeout) == -EAGAIN)
> - goto out_restart;
> +
> + status = nfs4_handle_exception(server, status, exception);
> + if (exception->retry)
> + status = -EAGAIN;
> out:
> dprintk("<-- %s\n", __func__);
> - return;
> -out_restart:
> - task->tk_status = 0;
> - rpc_restart_call_prepare(task);
> - return;
> -out_overflow:
> - task->tk_status = -EOVERFLOW;
> - goto out;
> + return status;
> }
>
> static size_t max_response_pages(struct nfs_server *server)
> @@ -8017,7 +7994,7 @@ static const struct rpc_call_ops nfs4_layoutget_call_ops = {
> };
>
> struct pnfs_layout_segment *
> -nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> +nfs4_proc_layoutget(struct nfs4_layoutget *lgp, long *timeout, gfp_t gfp_flags)
> {
> struct inode *inode = lgp->args.inode;
> struct nfs_server *server = NFS_SERVER(inode);
> @@ -8037,6 +8014,7 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> .flags = RPC_TASK_ASYNC,
> };
> struct pnfs_layout_segment *lseg = NULL;
> + struct nfs4_exception exception = { .timeout = *timeout };
> int status = 0;
>
> dprintk("--> %s\n", __func__);
> @@ -8050,7 +8028,6 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> return ERR_PTR(-ENOMEM);
> }
> lgp->args.layout.pglen = max_pages * PAGE_SIZE;
> - lgp->args.timestamp = jiffies;
>
> lgp->res.layoutp = &lgp->args.layout;
> lgp->res.seq_res.sr_slot = NULL;
> @@ -8060,13 +8037,17 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> if (IS_ERR(task))
> return ERR_CAST(task);
> status = nfs4_wait_for_completion_rpc_task(task);
> - if (status == 0)
> - status = task->tk_status;
> + if (status == 0) {
> + status = nfs4_layoutget_handle_exception(task, lgp, &exception);
> + *timeout = exception.timeout;
> + }
> +
> trace_nfs4_layoutget(lgp->args.ctx,
> &lgp->args.range,
> &lgp->res.range,
> &lgp->res.stateid,
> status);
> +
> /* if layoutp->len is 0, nfs4_layoutget_prepare called rpc_exit */
> if (status == 0 && lgp->res.layoutp->len)
> lseg = pnfs_layout_process(lgp);
> diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
> index 2c8d05dae5b1..9c150b153782 100644
> --- a/fs/nfs/nfs4trace.h
> +++ b/fs/nfs/nfs4trace.h
> @@ -1520,6 +1520,8 @@ DEFINE_NFS4_INODE_EVENT(nfs4_layoutreturn_on_close);
> { PNFS_UPDATE_LAYOUT_FOUND_CACHED, "found cached" }, \
> { PNFS_UPDATE_LAYOUT_RETURN, "layoutreturn" }, \
> { PNFS_UPDATE_LAYOUT_BLOCKED, "layouts blocked" }, \
> + { PNFS_UPDATE_LAYOUT_INVALID_OPEN, "invalid open" }, \
> + { PNFS_UPDATE_LAYOUT_RETRY, "retrying" }, \
> { PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET, "sent layoutget" })
>
> TRACE_EVENT(pnfs_update_layout,
> @@ -1528,9 +1530,10 @@ TRACE_EVENT(pnfs_update_layout,
> u64 count,
> enum pnfs_iomode iomode,
> struct pnfs_layout_hdr *lo,
> + struct pnfs_layout_segment *lseg,
> enum pnfs_update_layout_reason reason
> ),
> - TP_ARGS(inode, pos, count, iomode, lo, reason),
> + TP_ARGS(inode, pos, count, iomode, lo, lseg, reason),
> TP_STRUCT__entry(
> __field(dev_t, dev)
> __field(u64, fileid)
> @@ -1540,6 +1543,7 @@ TRACE_EVENT(pnfs_update_layout,
> __field(enum pnfs_iomode, iomode)
> __field(int, layoutstateid_seq)
> __field(u32, layoutstateid_hash)
> + __field(long, lseg)
> __field(enum pnfs_update_layout_reason, reason)
> ),
> TP_fast_assign(
> @@ -1559,11 +1563,12 @@ TRACE_EVENT(pnfs_update_layout,
> __entry->layoutstateid_seq = 0;
> __entry->layoutstateid_hash = 0;
> }
> + __entry->lseg = (long)lseg;
> ),
> TP_printk(
> "fileid=%02x:%02x:%llu fhandle=0x%08x "
> "iomode=%s pos=%llu count=%llu "
> - "layoutstateid=%d:0x%08x (%s)",
> + "layoutstateid=%d:0x%08x lseg=0x%lx (%s)",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> (unsigned long long)__entry->fileid,
> __entry->fhandle,
> @@ -1571,6 +1576,7 @@ TRACE_EVENT(pnfs_update_layout,
> (unsigned long long)__entry->pos,
> (unsigned long long)__entry->count,
> __entry->layoutstateid_seq, __entry->layoutstateid_hash,
> + __entry->lseg,
> show_pnfs_update_layout_reason(__entry->reason)
> )
> );
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index bc2c3ec98d32..e97895b427c0 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -796,45 +796,18 @@ pnfs_layoutgets_blocked(const struct pnfs_layout_hdr *lo)
> test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags);
> }
>
> -int
> -pnfs_choose_layoutget_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo,
> - const struct pnfs_layout_range *range,
> - struct nfs4_state *open_state)
> -{
> - int status = 0;
> -
> - dprintk("--> %s\n", __func__);
> - spin_lock(&lo->plh_inode->i_lock);
> - if (pnfs_layoutgets_blocked(lo)) {
> - status = -EAGAIN;
> - } else if (!nfs4_valid_open_stateid(open_state)) {
> - status = -EBADF;
> - } else if (list_empty(&lo->plh_segs) ||
> - test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
> - int seq;
> -
> - do {
> - seq = read_seqbegin(&open_state->seqlock);
> - nfs4_stateid_copy(dst, &open_state->stateid);
> - } while (read_seqretry(&open_state->seqlock, seq));
> - } else
> - nfs4_stateid_copy(dst, &lo->plh_stateid);
> - spin_unlock(&lo->plh_inode->i_lock);
> - dprintk("<-- %s\n", __func__);
> - return status;
> -}
> -
> /*
> -* Get layout from server.
> -* for now, assume that whole file layouts are requested.
> -* arg->offset: 0
> -* arg->length: all ones
> -*/
> + * Get layout from server.
> + * for now, assume that whole file layouts are requested.
> + * arg->offset: 0
> + * arg->length: all ones
> + */
> static struct pnfs_layout_segment *
> send_layoutget(struct pnfs_layout_hdr *lo,
> struct nfs_open_context *ctx,
> + nfs4_stateid *stateid,
> const struct pnfs_layout_range *range,
> - gfp_t gfp_flags)
> + long *timeout, gfp_t gfp_flags)
> {
> struct inode *ino = lo->plh_inode;
> struct nfs_server *server = NFS_SERVER(ino);
> @@ -868,10 +841,11 @@ send_layoutget(struct pnfs_layout_hdr *lo,
> lgp->args.type = server->pnfs_curr_ld->id;
> lgp->args.inode = ino;
> lgp->args.ctx = get_nfs_open_context(ctx);
> + nfs4_stateid_copy(&lgp->args.stateid, stateid);
> lgp->gfp_flags = gfp_flags;
> lgp->cred = lo->plh_lc_cred;
>
> - return nfs4_proc_layoutget(lgp, gfp_flags);
> + return nfs4_proc_layoutget(lgp, timeout, gfp_flags);
> }
>
> static void pnfs_clear_layoutcommit(struct inode *inode,
> @@ -1511,27 +1485,30 @@ pnfs_update_layout(struct inode *ino,
> .offset = pos,
> .length = count,
> };
> - unsigned pg_offset;
> + unsigned pg_offset, seq;
> struct nfs_server *server = NFS_SERVER(ino);
> struct nfs_client *clp = server->nfs_client;
> - struct pnfs_layout_hdr *lo;
> + struct pnfs_layout_hdr *lo = NULL;
> struct pnfs_layout_segment *lseg = NULL;
> + nfs4_stateid stateid;
> + long timeout = 0;
> + unsigned long giveup = jiffies + rpc_get_timeout(server->client);
> bool first;
>
> if (!pnfs_enabled_sb(NFS_SERVER(ino))) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_NO_PNFS);
> goto out;
> }
>
> if (iomode == IOMODE_READ && i_size_read(ino) == 0) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_RD_ZEROLEN);
> goto out;
> }
>
> if (pnfs_within_mdsthreshold(ctx, ino, iomode)) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_MDSTHRESH);
> goto out;
> }
> @@ -1542,14 +1519,14 @@ lookup_again:
> lo = pnfs_find_alloc_layout(ino, ctx, gfp_flags);
> if (lo == NULL) {
> spin_unlock(&ino->i_lock);
> - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_NOMEM);
> goto out;
> }
>
> /* Do we even need to bother with this? */
> if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_BULK_RECALL);
> dprintk("%s matches recall, use MDS\n", __func__);
> goto out_unlock;
> @@ -1557,14 +1534,34 @@ lookup_again:
>
> /* if LAYOUTGET already failed once we don't try again */
> if (pnfs_layout_io_test_failed(lo, iomode)) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_IO_TEST_FAIL);
> goto out_unlock;
> }
>
> - first = list_empty(&lo->plh_segs);
> - if (first) {
> - /* The first layoutget for the file. Need to serialize per
> + lseg = pnfs_find_lseg(lo, &arg);
> + if (lseg) {
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> + PNFS_UPDATE_LAYOUT_FOUND_CACHED);
> + goto out_unlock;
> + }
> +
> + if (!nfs4_valid_open_stateid(ctx->state)) {
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> + PNFS_UPDATE_LAYOUT_INVALID_OPEN);
> + goto out_unlock;
> + }
> +
> + /*
> + * Choose a stateid for the LAYOUTGET. If we don't have a layout
> + * stateid, or it has been invalidated, then we must use the open
> + * stateid.
> + */
> + if (lo->plh_stateid.seqid == 0 ||
> + test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
> +
> + /*
> + * The first layoutget for the file. Need to serialize per
> * RFC 5661 Errata 3208.
> */
> if (test_and_set_bit(NFS_LAYOUT_FIRST_LAYOUTGET,
> @@ -1573,18 +1570,17 @@ lookup_again:
> wait_on_bit(&lo->plh_flags, NFS_LAYOUT_FIRST_LAYOUTGET,
> TASK_UNINTERRUPTIBLE);
> pnfs_put_layout_hdr(lo);
> + dprintk("%s retrying\n", __func__);
> goto lookup_again;
> }
> +
> + first = true;
> + do {
> + seq = read_seqbegin(&ctx->state->seqlock);
> + nfs4_stateid_copy(&stateid, &ctx->state->stateid);
> + } while (read_seqretry(&ctx->state->seqlock, seq));
> } else {
> - /* Check to see if the layout for the given range
> - * already exists
> - */
> - lseg = pnfs_find_lseg(lo, &arg);
> - if (lseg) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> - PNFS_UPDATE_LAYOUT_FOUND_CACHED);
> - goto out_unlock;
> - }
> + nfs4_stateid_copy(&stateid, &lo->plh_stateid);
> }
>
> /*
> @@ -1599,15 +1595,17 @@ lookup_again:
> pnfs_clear_first_layoutget(lo);
> pnfs_put_layout_hdr(lo);
> dprintk("%s retrying\n", __func__);
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> + lseg, PNFS_UPDATE_LAYOUT_RETRY);
> goto lookup_again;
> }
> - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_RETURN);
> goto out_put_layout_hdr;
> }
>
> if (pnfs_layoutgets_blocked(lo)) {
> - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> PNFS_UPDATE_LAYOUT_BLOCKED);
> goto out_unlock;
> }
> @@ -1632,26 +1630,36 @@ lookup_again:
> if (arg.length != NFS4_MAX_UINT64)
> arg.length = PAGE_ALIGN(arg.length);
>
> - lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
> + lseg = send_layoutget(lo, ctx, &stateid, &arg, &timeout, gfp_flags);
> + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> + PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
> if (IS_ERR(lseg)) {
> - if (lseg == ERR_PTR(-EAGAIN)) {
> + switch(PTR_ERR(lseg)) {
> + case -ERECALLCONFLICT:
> + if (time_after(jiffies, giveup))
> + lseg = NULL;
> + /* Fallthrough */
> + case -EAGAIN:
> + pnfs_put_layout_hdr(lo);
> if (first)
> pnfs_clear_first_layoutget(lo);
> - pnfs_put_layout_hdr(lo);
> - goto lookup_again;
> - }
> -
> - if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> - pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> - lseg = NULL;
> + if (lseg) {
> + trace_pnfs_update_layout(ino, pos, count,
> + iomode, lo, lseg, PNFS_UPDATE_LAYOUT_RETRY);
> + goto lookup_again;
> + }
> + /* Fallthrough */
> + default:
> + if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> + pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> + lseg = NULL;
> + }
Seems like in the past, a non-fatal-error used to trigger the opposite
behavior, where this would set the fail_bit? Shouldn't that be the
behavior for -NFS4ERR_LAYOUTUNAVAILABLE (which is mapped to -ENODATA)
etc...?
> }
> } else {
> pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> }
>
> atomic_dec(&lo->plh_outstanding);
> - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> - PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
> out_put_layout_hdr:
> if (first)
> pnfs_clear_first_layoutget(lo);
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 971068b58647..f9f3331bef49 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -228,7 +228,7 @@ extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
> extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
> struct pnfs_device *dev,
> struct rpc_cred *cred);
> -extern struct pnfs_layout_segment* nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags);
> +extern struct pnfs_layout_segment* nfs4_proc_layoutget(struct nfs4_layoutget *lgp, long *timeout, gfp_t gfp_flags);
> extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp, bool sync);
>
> /* pnfs.c */
> @@ -260,10 +260,6 @@ void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo);
> void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
> const nfs4_stateid *new,
> bool update_barrier);
> -int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
> - struct pnfs_layout_hdr *lo,
> - const struct pnfs_layout_range *range,
> - struct nfs4_state *open_state);
> int pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
> struct list_head *tmp_list,
> const struct pnfs_layout_range *recall_range,
> diff --git a/include/linux/errno.h b/include/linux/errno.h
> index 89627b9187f9..7ce9fb1b7d28 100644
> --- a/include/linux/errno.h
> +++ b/include/linux/errno.h
> @@ -28,5 +28,6 @@
> #define EBADTYPE 527 /* Type not supported by server */
> #define EJUKEBOX 528 /* Request initiated, but will not complete before timeout */
> #define EIOCBQUEUED 529 /* iocb queued, will get completion event */
> +#define ERECALLCONFLICT 530 /* conflict with recalled state */
>
> #endif
> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> index 011433478a14..f4870a330290 100644
> --- a/include/linux/nfs4.h
> +++ b/include/linux/nfs4.h
> @@ -621,7 +621,9 @@ enum pnfs_update_layout_reason {
> PNFS_UPDATE_LAYOUT_IO_TEST_FAIL,
> PNFS_UPDATE_LAYOUT_FOUND_CACHED,
> PNFS_UPDATE_LAYOUT_RETURN,
> + PNFS_UPDATE_LAYOUT_RETRY,
> PNFS_UPDATE_LAYOUT_BLOCKED,
> + PNFS_UPDATE_LAYOUT_INVALID_OPEN,
> PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET,
> };
>
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index cb9982d8f38f..a4cb8a33ae2c 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -233,7 +233,6 @@ struct nfs4_layoutget_args {
> struct inode *inode;
> struct nfs_open_context *ctx;
> nfs4_stateid stateid;
> - unsigned long timestamp;
> struct nfs4_layoutdriver_data layout;
> };
>
> @@ -251,7 +250,6 @@ struct nfs4_layoutget {
> struct nfs4_layoutget_res res;
> struct rpc_cred *cred;
> gfp_t gfp_flags;
> - long timeout;
> };
>
> struct nfs4_getdeviceinfo_args {
--
Andrew W. Elble
[email protected]
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
On Tue, 2016-06-28 at 08:10 -0400, Andrew W Elble wrote:
> Jeff Layton <[email protected]> writes:
>
> >
> > There are several problems in the way a stateid is selected for a
> > LAYOUTGET operation:
> >
> > We pick a stateid to use in the RPC prepare op, but that makes
> > it difficult to serialize LAYOUTGETs that use the open stateid. That
> > serialization is done in pnfs_update_layout, which occurs well before
> > the rpc_prepare operation.
> >
> > Between those two events, the i_lock is dropped and reacquired.
> > pnfs_update_layout can find that the list has lsegs in it and not do any
> > serialization, but then later pnfs_choose_layoutget_stateid ends up
> > choosing the open stateid.
> >
> > This patch changes the client to select the stateid to use in the
> > LAYOUTGET earlier, when we're searching for a usable layout segment.
> > This way we can do it all while holding the i_lock the first time, and
> > ensure that we serialize any LAYOUTGET call that uses a non-layout
> > stateid.
> >
> > This also means a rework of how LAYOUTGET replies are handled, as we
> > must now get the latest stateid if we want to retransmit in response
> > to a retryable error.
> >
> > Most of those errors boil down to the fact that the layout state has
> > changed in some fashion. Thus, what we really want to do is to re-search
> > for a layout when it fails with a retryable error, so that we can avoid
> > reissuing the RPC at all if possible.
> >
> > While the LAYOUTGET RPC is async, the initiating thread always waits for
> > it to complete, so it's effectively synchronous anyway. Currently, when
> > we need to retry a LAYOUTGET because of an error, we drive that retry
> > via the rpc state machine.
> >
> > This means that once the call has been submitted, it runs until it
> > completes. So, we must move the error handling for this RPC out of the
> > rpc_call_done operation and into the caller.
> >
> > In order to handle errors like NFS4ERR_DELAY properly, we must also
> > pass a pointer to the sliding timeout, which is now moved to the stack
> > in pnfs_update_layout.
> >
> > The complicating errors are -NFS4ERR_RECALLCONFLICT and
> > -NFS4ERR_LAYOUTTRYLATER, as those involve a timeout after which we give
> > up and return NULL back to the caller. So, there is some special
> > handling for those errors to ensure that the layers driving the retries
> > can handle that appropriately.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfs/nfs4proc.c | 115 ++++++++++++++++----------------------
> > fs/nfs/nfs4trace.h | 10 +++-
> > fs/nfs/pnfs.c | 144 +++++++++++++++++++++++++-----------------------
> > fs/nfs/pnfs.h | 6 +-
> > include/linux/errno.h | 1 +
> > include/linux/nfs4.h | 2 +
> > include/linux/nfs_xdr.h | 2 -
> > 7 files changed, 136 insertions(+), 144 deletions(-)
> >
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index c0d75be8cb69..c2583ca6c8b6 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -416,6 +416,7 @@ static int nfs4_do_handle_exception(struct nfs_server *server,
> > case -NFS4ERR_DELAY:
> > nfs_inc_server_stats(server, NFSIOS_DELAY);
> > case -NFS4ERR_GRACE:
> > + case -NFS4ERR_RECALLCONFLICT:
> > exception->delay = 1;
> > return 0;
> >
> > @@ -7824,40 +7825,34 @@ nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
> > struct nfs4_layoutget *lgp = calldata;
> > struct nfs_server *server = NFS_SERVER(lgp->args.inode);
> > struct nfs4_session *session = nfs4_get_session(server);
> > - int ret;
> >
> > dprintk("--> %s\n", __func__);
> > - /* Note the is a race here, where a CB_LAYOUTRECALL can come in
> > - * right now covering the LAYOUTGET we are about to send.
> > - * However, that is not so catastrophic, and there seems
> > - * to be no way to prevent it completely.
> > - */
> > - if (nfs41_setup_sequence(session, &lgp->args.seq_args,
> > - &lgp->res.seq_res, task))
> > - return;
> > - ret = pnfs_choose_layoutget_stateid(&lgp->args.stateid,
> > - NFS_I(lgp->args.inode)->layout,
> > - &lgp->args.range,
> > - lgp->args.ctx->state);
> > - if (ret < 0)
> > - rpc_exit(task, ret);
> > + nfs41_setup_sequence(session, &lgp->args.seq_args,
> > + &lgp->res.seq_res, task);
> > + dprintk("<-- %s\n", __func__);
> > }
> >
> > static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> > {
> > struct nfs4_layoutget *lgp = calldata;
> > +
> > + dprintk("--> %s\n", __func__);
> > + nfs41_sequence_done(task, &lgp->res.seq_res);
> > + dprintk("<-- %s\n", __func__);
> > +}
> > +
> > +static int
> > +nfs4_layoutget_handle_exception(struct rpc_task *task,
> > + struct nfs4_layoutget *lgp, struct nfs4_exception *exception)
> > +{
> > struct inode *inode = lgp->args.inode;
> > struct nfs_server *server = NFS_SERVER(inode);
> > struct pnfs_layout_hdr *lo;
> > - struct nfs4_state *state = NULL;
> > - unsigned long timeo, now, giveup;
> > + int status = task->tk_status;
> >
> > dprintk("--> %s tk_status => %d\n", __func__, -task->tk_status);
> >
> > - if (!nfs41_sequence_done(task, &lgp->res.seq_res))
> > - goto out;
> > -
> > - switch (task->tk_status) {
> > + switch (status) {
> > case 0:
> > goto out;
> >
> > @@ -7867,57 +7862,43 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> > * retry go inband.
> > */
> > case -NFS4ERR_LAYOUTUNAVAILABLE:
> > - task->tk_status = -ENODATA;
> > + status = -ENODATA;
> > goto out;
> > /*
> > * NFS4ERR_BADLAYOUT means the MDS cannot return a layout of
> > * length lgp->args.minlength != 0 (see RFC5661 section 18.43.3).
> > */
> > case -NFS4ERR_BADLAYOUT:
> > - goto out_overflow;
> > + status = -EOVERFLOW;
> > + goto out;
> > /*
> > * NFS4ERR_LAYOUTTRYLATER is a conflict with another client
> > * (or clients) writing to the same RAID stripe except when
> > * the minlength argument is 0 (see RFC5661 section 18.43.3).
> > + *
> > + * Treat it like we would RECALLCONFLICT -- we retry for a little
> > + * while, and then eventually give up.
> > */
> > case -NFS4ERR_LAYOUTTRYLATER:
> > - if (lgp->args.minlength == 0)
> > - goto out_overflow;
> > - /*
> > - * NFS4ERR_RECALLCONFLICT is when conflict with self (must recall
> > - * existing layout before getting a new one).
> > - */
> > - case -NFS4ERR_RECALLCONFLICT:
> > - timeo = rpc_get_timeout(task->tk_client);
> > - giveup = lgp->args.timestamp + timeo;
> > - now = jiffies;
> > - if (time_after(giveup, now)) {
> > - unsigned long delay;
> > -
> > - /* Delay for:
> > - * - Not less then NFS4_POLL_RETRY_MIN.
> > - * - One last time a jiffie before we give up
> > - * - exponential backoff (time_now minus start_attempt)
> > - */
> > - delay = max_t(unsigned long, NFS4_POLL_RETRY_MIN,
> > - min((giveup - now - 1),
> > - now - lgp->args.timestamp));
> > -
> > - dprintk("%s: NFS4ERR_RECALLCONFLICT waiting %lu\n",
> > - __func__, delay);
> > - rpc_delay(task, delay);
> > - /* Do not call nfs4_async_handle_error() */
> > - goto out_restart;
> > + if (lgp->args.minlength == 0) {
> > + status = -EOVERFLOW;
> > + goto out;
> > }
> > - break;
> > + /* Fallthrough */
> > + case -NFS4ERR_RECALLCONFLICT:
> > + nfs4_handle_exception(server, -NFS4ERR_RECALLCONFLICT,
> > + exception);
> > + status = -ERECALLCONFLICT;
> > + goto out;
> > case -NFS4ERR_EXPIRED:
> > case -NFS4ERR_BAD_STATEID:
> > + exception->timeout = 0;
> > spin_lock(&inode->i_lock);
> > if (nfs4_stateid_match(&lgp->args.stateid,
> > &lgp->args.ctx->state->stateid)) {
> > spin_unlock(&inode->i_lock);
> > /* If the open stateid was bad, then recover it. */
> > - state = lgp->args.ctx->state;
> > + exception->state = lgp->args.ctx->state;
> > break;
> > }
> > lo = NFS_I(inode)->layout;
> > @@ -7935,20 +7916,16 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
> > pnfs_free_lseg_list(&head);
> > } else
> > spin_unlock(&inode->i_lock);
> > - goto out_restart;
> > + status = -EAGAIN;
> > + goto out;
> > }
> > - if (nfs4_async_handle_error(task, server, state, &lgp->timeout) == -EAGAIN)
> > - goto out_restart;
> > +
> > + status = nfs4_handle_exception(server, status, exception);
> > + if (exception->retry)
> > + status = -EAGAIN;
> > out:
> > dprintk("<-- %s\n", __func__);
> > - return;
> > -out_restart:
> > - task->tk_status = 0;
> > - rpc_restart_call_prepare(task);
> > - return;
> > -out_overflow:
> > - task->tk_status = -EOVERFLOW;
> > - goto out;
> > + return status;
> > }
> >
> > static size_t max_response_pages(struct nfs_server *server)
> > @@ -8017,7 +7994,7 @@ static const struct rpc_call_ops nfs4_layoutget_call_ops = {
> > };
> >
> > struct pnfs_layout_segment *
> > -nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> > +nfs4_proc_layoutget(struct nfs4_layoutget *lgp, long *timeout, gfp_t gfp_flags)
> > {
> > struct inode *inode = lgp->args.inode;
> > struct nfs_server *server = NFS_SERVER(inode);
> > @@ -8037,6 +8014,7 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> > .flags = RPC_TASK_ASYNC,
> > };
> > struct pnfs_layout_segment *lseg = NULL;
> > + struct nfs4_exception exception = { .timeout = *timeout };
> > int status = 0;
> >
> > dprintk("--> %s\n", __func__);
> > @@ -8050,7 +8028,6 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> > return ERR_PTR(-ENOMEM);
> > }
> > lgp->args.layout.pglen = max_pages * PAGE_SIZE;
> > - lgp->args.timestamp = jiffies;
> >
> > lgp->res.layoutp = &lgp->args.layout;
> > lgp->res.seq_res.sr_slot = NULL;
> > @@ -8060,13 +8037,17 @@ nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags)
> > if (IS_ERR(task))
> > return ERR_CAST(task);
> > status = nfs4_wait_for_completion_rpc_task(task);
> > - if (status == 0)
> > - status = task->tk_status;
> > + if (status == 0) {
> > + status = nfs4_layoutget_handle_exception(task, lgp, &exception);
> > + *timeout = exception.timeout;
> > + }
> > +
> > trace_nfs4_layoutget(lgp->args.ctx,
> > &lgp->args.range,
> > &lgp->res.range,
> > &lgp->res.stateid,
> > status);
> > +
> > /* if layoutp->len is 0, nfs4_layoutget_prepare called rpc_exit */
> > if (status == 0 && lgp->res.layoutp->len)
> > lseg = pnfs_layout_process(lgp);
> > diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
> > index 2c8d05dae5b1..9c150b153782 100644
> > --- a/fs/nfs/nfs4trace.h
> > +++ b/fs/nfs/nfs4trace.h
> > @@ -1520,6 +1520,8 @@ DEFINE_NFS4_INODE_EVENT(nfs4_layoutreturn_on_close);
> > { PNFS_UPDATE_LAYOUT_FOUND_CACHED, "found cached" }, \
> > { PNFS_UPDATE_LAYOUT_RETURN, "layoutreturn" }, \
> > { PNFS_UPDATE_LAYOUT_BLOCKED, "layouts blocked" }, \
> > + { PNFS_UPDATE_LAYOUT_INVALID_OPEN, "invalid open" }, \
> > + { PNFS_UPDATE_LAYOUT_RETRY, "retrying" }, \
> > { PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET, "sent layoutget" })
> >
> > TRACE_EVENT(pnfs_update_layout,
> > @@ -1528,9 +1530,10 @@ TRACE_EVENT(pnfs_update_layout,
> > u64 count,
> > enum pnfs_iomode iomode,
> > struct pnfs_layout_hdr *lo,
> > + struct pnfs_layout_segment *lseg,
> > enum pnfs_update_layout_reason reason
> > ),
> > - TP_ARGS(inode, pos, count, iomode, lo, reason),
> > + TP_ARGS(inode, pos, count, iomode, lo, lseg, reason),
> > TP_STRUCT__entry(
> > __field(dev_t, dev)
> > __field(u64, fileid)
> > @@ -1540,6 +1543,7 @@ TRACE_EVENT(pnfs_update_layout,
> > __field(enum pnfs_iomode, iomode)
> > __field(int, layoutstateid_seq)
> > __field(u32, layoutstateid_hash)
> > + __field(long, lseg)
> > __field(enum pnfs_update_layout_reason, reason)
> > ),
> > TP_fast_assign(
> > @@ -1559,11 +1563,12 @@ TRACE_EVENT(pnfs_update_layout,
> > __entry->layoutstateid_seq = 0;
> > __entry->layoutstateid_hash = 0;
> > }
> > + __entry->lseg = (long)lseg;
> > ),
> > TP_printk(
> > "fileid=%02x:%02x:%llu fhandle=0x%08x "
> > "iomode=%s pos=%llu count=%llu "
> > - "layoutstateid=%d:0x%08x (%s)",
> > + "layoutstateid=%d:0x%08x lseg=0x%lx (%s)",
> > MAJOR(__entry->dev), MINOR(__entry->dev),
> > (unsigned long long)__entry->fileid,
> > __entry->fhandle,
> > @@ -1571,6 +1576,7 @@ TRACE_EVENT(pnfs_update_layout,
> > (unsigned long long)__entry->pos,
> > (unsigned long long)__entry->count,
> > __entry->layoutstateid_seq, __entry->layoutstateid_hash,
> > + __entry->lseg,
> > show_pnfs_update_layout_reason(__entry->reason)
> > )
> > );
> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > index bc2c3ec98d32..e97895b427c0 100644
> > --- a/fs/nfs/pnfs.c
> > +++ b/fs/nfs/pnfs.c
> > @@ -796,45 +796,18 @@ pnfs_layoutgets_blocked(const struct pnfs_layout_hdr *lo)
> > test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags);
> > }
> >
> > -int
> > -pnfs_choose_layoutget_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo,
> > - const struct pnfs_layout_range *range,
> > - struct nfs4_state *open_state)
> > -{
> > - int status = 0;
> > -
> > - dprintk("--> %s\n", __func__);
> > - spin_lock(&lo->plh_inode->i_lock);
> > - if (pnfs_layoutgets_blocked(lo)) {
> > - status = -EAGAIN;
> > - } else if (!nfs4_valid_open_stateid(open_state)) {
> > - status = -EBADF;
> > - } else if (list_empty(&lo->plh_segs) ||
> > - test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
> > - int seq;
> > -
> > - do {
> > - seq = read_seqbegin(&open_state->seqlock);
> > - nfs4_stateid_copy(dst, &open_state->stateid);
> > - } while (read_seqretry(&open_state->seqlock, seq));
> > - } else
> > - nfs4_stateid_copy(dst, &lo->plh_stateid);
> > - spin_unlock(&lo->plh_inode->i_lock);
> > - dprintk("<-- %s\n", __func__);
> > - return status;
> > -}
> > -
> > /*
> > -* Get layout from server.
> > -* for now, assume that whole file layouts are requested.
> > -* arg->offset: 0
> > -* arg->length: all ones
> > -*/
> > + * Get layout from server.
> > + * for now, assume that whole file layouts are requested.
> > + * arg->offset: 0
> > + * arg->length: all ones
> > + */
> > static struct pnfs_layout_segment *
> > send_layoutget(struct pnfs_layout_hdr *lo,
> > struct nfs_open_context *ctx,
> > + nfs4_stateid *stateid,
> > const struct pnfs_layout_range *range,
> > - gfp_t gfp_flags)
> > + long *timeout, gfp_t gfp_flags)
> > {
> > struct inode *ino = lo->plh_inode;
> > struct nfs_server *server = NFS_SERVER(ino);
> > @@ -868,10 +841,11 @@ send_layoutget(struct pnfs_layout_hdr *lo,
> > lgp->args.type = server->pnfs_curr_ld->id;
> > lgp->args.inode = ino;
> > lgp->args.ctx = get_nfs_open_context(ctx);
> > + nfs4_stateid_copy(&lgp->args.stateid, stateid);
> > lgp->gfp_flags = gfp_flags;
> > lgp->cred = lo->plh_lc_cred;
> >
> > - return nfs4_proc_layoutget(lgp, gfp_flags);
> > + return nfs4_proc_layoutget(lgp, timeout, gfp_flags);
> > }
> >
> > static void pnfs_clear_layoutcommit(struct inode *inode,
> > @@ -1511,27 +1485,30 @@ pnfs_update_layout(struct inode *ino,
> > .offset = pos,
> > .length = count,
> > };
> > - unsigned pg_offset;
> > + unsigned pg_offset, seq;
> > struct nfs_server *server = NFS_SERVER(ino);
> > struct nfs_client *clp = server->nfs_client;
> > - struct pnfs_layout_hdr *lo;
> > + struct pnfs_layout_hdr *lo = NULL;
> > struct pnfs_layout_segment *lseg = NULL;
> > + nfs4_stateid stateid;
> > + long timeout = 0;
> > + unsigned long giveup = jiffies + rpc_get_timeout(server->client);
> > bool first;
> >
> > if (!pnfs_enabled_sb(NFS_SERVER(ino))) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_NO_PNFS);
> > goto out;
> > }
> >
> > if (iomode == IOMODE_READ && i_size_read(ino) == 0) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_RD_ZEROLEN);
> > goto out;
> > }
> >
> > if (pnfs_within_mdsthreshold(ctx, ino, iomode)) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_MDSTHRESH);
> > goto out;
> > }
> > @@ -1542,14 +1519,14 @@ lookup_again:
> > lo = pnfs_find_alloc_layout(ino, ctx, gfp_flags);
> > if (lo == NULL) {
> > spin_unlock(&ino->i_lock);
> > - trace_pnfs_update_layout(ino, pos, count, iomode, NULL,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_NOMEM);
> > goto out;
> > }
> >
> > /* Do we even need to bother with this? */
> > if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_BULK_RECALL);
> > dprintk("%s matches recall, use MDS\n", __func__);
> > goto out_unlock;
> > @@ -1557,14 +1534,34 @@ lookup_again:
> >
> > /* if LAYOUTGET already failed once we don't try again */
> > if (pnfs_layout_io_test_failed(lo, iomode)) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_IO_TEST_FAIL);
> > goto out_unlock;
> > }
> >
> > - first = list_empty(&lo->plh_segs);
> > - if (first) {
> > - /* The first layoutget for the file. Need to serialize per
> > + lseg = pnfs_find_lseg(lo, &arg);
> > + if (lseg) {
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > + PNFS_UPDATE_LAYOUT_FOUND_CACHED);
> > + goto out_unlock;
> > + }
> > +
> > + if (!nfs4_valid_open_stateid(ctx->state)) {
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > + PNFS_UPDATE_LAYOUT_INVALID_OPEN);
> > + goto out_unlock;
> > + }
> > +
> > + /*
> > + * Choose a stateid for the LAYOUTGET. If we don't have a layout
> > + * stateid, or it has been invalidated, then we must use the open
> > + * stateid.
> > + */
> > + if (lo->plh_stateid.seqid == 0 ||
> > + test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) {
> > +
> > + /*
> > + * The first layoutget for the file. Need to serialize per
> > * RFC 5661 Errata 3208.
> > */
> > if (test_and_set_bit(NFS_LAYOUT_FIRST_LAYOUTGET,
> > @@ -1573,18 +1570,17 @@ lookup_again:
> > wait_on_bit(&lo->plh_flags, NFS_LAYOUT_FIRST_LAYOUTGET,
> > TASK_UNINTERRUPTIBLE);
> > pnfs_put_layout_hdr(lo);
> > + dprintk("%s retrying\n", __func__);
> > goto lookup_again;
> > }
> > +
> > + first = true;
> > + do {
> > + seq = read_seqbegin(&ctx->state->seqlock);
> > + nfs4_stateid_copy(&stateid, &ctx->state->stateid);
> > + } while (read_seqretry(&ctx->state->seqlock, seq));
> > } else {
> > - /* Check to see if the layout for the given range
> > - * already exists
> > - */
> > - lseg = pnfs_find_lseg(lo, &arg);
> > - if (lseg) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > - PNFS_UPDATE_LAYOUT_FOUND_CACHED);
> > - goto out_unlock;
> > - }
> > + nfs4_stateid_copy(&stateid, &lo->plh_stateid);
> > }
> >
> > /*
> > @@ -1599,15 +1595,17 @@ lookup_again:
> > pnfs_clear_first_layoutget(lo);
> > pnfs_put_layout_hdr(lo);
> > dprintk("%s retrying\n", __func__);
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > + lseg, PNFS_UPDATE_LAYOUT_RETRY);
> > goto lookup_again;
> > }
> > - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_RETURN);
> > goto out_put_layout_hdr;
> > }
> >
> > if (pnfs_layoutgets_blocked(lo)) {
> > - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > PNFS_UPDATE_LAYOUT_BLOCKED);
> > goto out_unlock;
> > }
> > @@ -1632,26 +1630,36 @@ lookup_again:
> > if (arg.length != NFS4_MAX_UINT64)
> > arg.length = PAGE_ALIGN(arg.length);
> >
> > - lseg = send_layoutget(lo, ctx, &arg, gfp_flags);
> > + lseg = send_layoutget(lo, ctx, &stateid, &arg, &timeout, gfp_flags);
> > + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg,
> > + PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
> > if (IS_ERR(lseg)) {
> > - if (lseg == ERR_PTR(-EAGAIN)) {
> > + switch(PTR_ERR(lseg)) {
> > + case -ERECALLCONFLICT:
> > + if (time_after(jiffies, giveup))
> > + lseg = NULL;
> > + /* Fallthrough */
> > + case -EAGAIN:
> > + pnfs_put_layout_hdr(lo);
> > if (first)
> > pnfs_clear_first_layoutget(lo);
> > - pnfs_put_layout_hdr(lo);
> > - goto lookup_again;
> > - }
> > -
> > - if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> > - pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> > - lseg = NULL;
> > + if (lseg) {
> > + trace_pnfs_update_layout(ino, pos, count,
> > + iomode, lo, lseg, PNFS_UPDATE_LAYOUT_RETRY);
> > + goto lookup_again;
> > + }
> > + /* Fallthrough */
> > + default:
> > + if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> > + pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> > + lseg = NULL;
> > + }
> Seems like in the past, a non-fatal-error used to trigger the opposite
> behavior, where this would set the fail_bit? Shouldn't that be the
> behavior for -NFS4ERR_LAYOUTUNAVAILABLE (which is mapped to -ENODATA)
> etc...?
>
Yes, and I think that was a bug. See commit
d03ab29dbbe905c6c7c5abd78e02547fa954ec07 for where that actually
changed.
You do have a good point about LAYOUTUNAVAILABLE though. That probably
should stop further attempts to get a layout. That said, the error
mapping here is fiendishly complex. I always have to wonder whether
there other errors that get turned into ENODATA? It may be safest to
change the error mapping there to something more specific.
> >
> > }
> > } else {
> > pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
> > }
> >
> > atomic_dec(&lo->plh_outstanding);
> > - trace_pnfs_update_layout(ino, pos, count, iomode, lo,
> > - PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET);
> > out_put_layout_hdr:
> > if (first)
> > pnfs_clear_first_layoutget(lo);
> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> > index 971068b58647..f9f3331bef49 100644
> > --- a/fs/nfs/pnfs.h
> > +++ b/fs/nfs/pnfs.h
> > @@ -228,7 +228,7 @@ extern void pnfs_unregister_layoutdriver(struct pnfs_layoutdriver_type *);
> > extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
> > struct pnfs_device *dev,
> > struct rpc_cred *cred);
> > -extern struct pnfs_layout_segment* nfs4_proc_layoutget(struct nfs4_layoutget *lgp, gfp_t gfp_flags);
> > +extern struct pnfs_layout_segment* nfs4_proc_layoutget(struct nfs4_layoutget *lgp, long *timeout, gfp_t gfp_flags);
> > extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp, bool sync);
> >
> > /* pnfs.c */
> > @@ -260,10 +260,6 @@ void pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo);
> > void pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo,
> > const nfs4_stateid *new,
> > bool update_barrier);
> > -int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
> > - struct pnfs_layout_hdr *lo,
> > - const struct pnfs_layout_range *range,
> > - struct nfs4_state *open_state);
> > int pnfs_mark_matching_lsegs_invalid(struct pnfs_layout_hdr *lo,
> > struct list_head *tmp_list,
> > const struct pnfs_layout_range *recall_range,
> > diff --git a/include/linux/errno.h b/include/linux/errno.h
> > index 89627b9187f9..7ce9fb1b7d28 100644
> > --- a/include/linux/errno.h
> > +++ b/include/linux/errno.h
> > @@ -28,5 +28,6 @@
> > #define EBADTYPE 527 /* Type not supported by server */
> > #define EJUKEBOX 528 /* Request initiated, but will not complete before timeout */
> > #define EIOCBQUEUED 529 /* iocb queued, will get completion event */
> > +#define ERECALLCONFLICT 530 /* conflict with recalled state */
> >
> > #endif
> > diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> > index 011433478a14..f4870a330290 100644
> > --- a/include/linux/nfs4.h
> > +++ b/include/linux/nfs4.h
> > @@ -621,7 +621,9 @@ enum pnfs_update_layout_reason {
> > PNFS_UPDATE_LAYOUT_IO_TEST_FAIL,
> > PNFS_UPDATE_LAYOUT_FOUND_CACHED,
> > PNFS_UPDATE_LAYOUT_RETURN,
> > + PNFS_UPDATE_LAYOUT_RETRY,
> > PNFS_UPDATE_LAYOUT_BLOCKED,
> > + PNFS_UPDATE_LAYOUT_INVALID_OPEN,
> > PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET,
> > };
> >
> > diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> > index cb9982d8f38f..a4cb8a33ae2c 100644
> > --- a/include/linux/nfs_xdr.h
> > +++ b/include/linux/nfs_xdr.h
> > @@ -233,7 +233,6 @@ struct nfs4_layoutget_args {
> > struct inode *inode;
> > struct nfs_open_context *ctx;
> > nfs4_stateid stateid;
> > - unsigned long timestamp;
> > struct nfs4_layoutdriver_data layout;
> > };
> >
> > @@ -251,7 +250,6 @@ struct nfs4_layoutget {
> > struct nfs4_layoutget_res res;
> > struct rpc_cred *cred;
> > gfp_t gfp_flags;
> > - long timeout;
> > };
> >
> > struct nfs4_getdeviceinfo_args {
--
Jeff Layton <[email protected]>
>> > + default:
>> > + if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
>> > + pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
>> > + lseg = NULL;
>> > + }
>> Seems like in the past, a non-fatal-error used to trigger the opposite
>> behavior, where this would set the fail_bit? Shouldn't that be the
>> behavior for -NFS4ERR_LAYOUTUNAVAILABLE (which is mapped to -ENODATA)
>> etc...?
>>
>
> Yes, and I think that was a bug. See commit
> d03ab29dbbe905c6c7c5abd78e02547fa954ec07 for where that actually
> changed.
I think you mean:
0bcbf039f6b2bcefe4f5dada76079080edf9ecd0
?
...and I was looking at this one:
d600ad1f2bdbf97c4818dcc85b174f72c90c21bd
> You do have a good point about LAYOUTUNAVAILABLE though. That probably
> should stop further attempts to get a layout. That said, the error
> mapping here is fiendishly complex. I always have to wonder whether
> there other errors that get turned into ENODATA? It may be safest to
> change the error mapping there to something more specific.
If setting the fail_bit is the right way to go, that could be done
with less confusion in nfs4_layoutget_handle_exception()...?
--
Andrew W. Elble
[email protected]
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
> On Jun 28, 2016, at 08:53, Andrew W Elble <[email protected]> wrote:
>
>>>> + default:
>>>> + if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
>>>> + pnfs_layout_clear_fail_bit(lo, pnfs_iomode_to_fail_bit(iomode));
>>>> + lseg = NULL;
>>>> + }
>>> Seems like in the past, a non-fatal-error used to trigger the opposite
>>> behavior, where this would set the fail_bit? Shouldn't that be the
>>> behavior for -NFS4ERR_LAYOUTUNAVAILABLE (which is mapped to -ENODATA)
>>> etc...?
>>>
>>
>> Yes, and I think that was a bug. See commit
>> d03ab29dbbe905c6c7c5abd78e02547fa954ec07 for where that actually
>> changed.
>
> I think you mean:
> 0bcbf039f6b2bcefe4f5dada76079080edf9ecd0
>
> ?
>
> ...and I was looking at this one:
> d600ad1f2bdbf97c4818dcc85b174f72c90c21bd
>
>> You do have a good point about LAYOUTUNAVAILABLE though. That probably
>> should stop further attempts to get a layout. That said, the error
>> mapping here is fiendishly complex. I always have to wonder whether
>> there other errors that get turned into ENODATA? It may be safest to
>> change the error mapping there to something more specific.
>
> If setting the fail_bit is the right way to go, that could be done
> with less confusion in nfs4_layoutget_handle_exception()??
It?s not.
Cheers
Trond
On Tue, 2016-06-28 at 08:53 -0400, Andrew W Elble wrote:
> >
> > >
> > > >
> > > > + default:
> > > > + if
> > > > (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> > > > + pnfs_layout_clear_fail_bit(lo,
> > > > pnfs_iomode_to_fail_bit(iomode));
> > > > + lseg = NULL;
> > > > + }
> > > Seems like in the past, a non-fatal-error used to trigger the
> > > opposite
> > > behavior, where this would set the fail_bit? Shouldn't that be
> > > the
> > > behavior for -NFS4ERR_LAYOUTUNAVAILABLE (which is mapped to
> > > -ENODATA)
> > > etc...?
> > >
> > Yes, and I think that was a bug. See commit
> > d03ab29dbbe905c6c7c5abd78e02547fa954ec07 for where that actually
> > changed.
> I think you mean:
> 0bcbf039f6b2bcefe4f5dada76079080edf9ecd0
>
> ?
>
> ...and I was looking at this one:
> d600ad1f2bdbf97c4818dcc85b174f72c90c21bd
>
No, d03ab29dbbe905c6c7c5abd78e02547fa954ec07 is where that bug was
fixed. Basically, we were clearing the fail bit on fatal errors, when
what we really wanted to do was clear it on non-fatal errors.
> >
> > You do have a good point about LAYOUTUNAVAILABLE though. That
> > probably
> > should stop further attempts to get a layout. That said, the error
> > mapping here is fiendishly complex. I always have to wonder whether
> > there other errors that get turned into ENODATA? It may be safest
> > to
> > change the error mapping there to something more specific.
> If setting the fail_bit is the right way to go, that could be done
> with less confusion in nfs4_layoutget_handle_exception()...?
>
Hmm...Trond's response just says "it's not", which I'm not sure how to
interpret here. Trond, do you mean that we should be retrying on
LAYOUTUNAVAILABLE, or that we should be indicating that using some
means other than the fail bit?
--
Jeff Layton <[email protected]>