2010-12-15 18:29:15

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 0/9] pnfs post wave2 changes

Here are the list of patches I'd like to reintroduce into the post-submit pnfs
branch, after Fred's wave2 patches to bring layoutreturn and layoutcommit
and "forgetless" cb_layoutrecall functionality back into the pnfs branch:

[PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"
[PATCH 2/9] Revert "pnfs-submit: Turn off layoutcommits"
[PATCH 3/9] Revert "pnfs-submit: wave2: remove all LAYOUTRETURN code"
[PATCH 4/9] Revert "pnfs-submit: wave2: Remove LAYOUTRETURN from return on close"
[PATCH 5/9] FIXME: roc should return layout on last close
[PATCH 6/9] Revert "pnfs-submit: wave2: remove cl_layoutrecalls list"
[PATCH 7/9] Revert "pnfs-submit: wave2: Pull out all recall initiated LAYOUTRETURNS"
[PATCH 8/9] Revert "pnfs-submit: wave2: Don't wait in layoutget"
[PATCH 9/9] Revert "pnfs-submit: wave2: check that partial LAYOUTGET return is ignored"

Benny


2010-12-15 18:31:12

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 2/9] Revert "pnfs-submit: Turn off layoutcommits"

This reverts commit 1120c1ee8209679aa187248917875665dd8521a0.
---
fs/nfs/nfs4proc.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 68111ef..176449f 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -3098,6 +3098,7 @@ static void pnfs4_update_write_done(struct nfs_inode *nfsi, struct nfs_write_dat
{
#ifdef CONFIG_NFS_V4_1
pnfs_update_last_write(nfsi, data->args.offset, data->res.count);
+ pnfs_need_layoutcommit(nfsi, data->args.context);
#endif /* CONFIG_NFS_V4_1 */
}

--
1.7.2.3


2010-12-16 07:26:58

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-15 22:24, Trond Myklebust wrote:
> On Wed, 2010-12-15 at 14:31 -0500, Trond Myklebust wrote:
>> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
>
>>> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
>>> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
>>> for pnfs-obj) should be equivalent [note: need errata to clarify the
>>> resulting stateid after NOMATCHING_LAYOUT].
>>> Is this the serialization "crap" you're talking about?
>>> What makes checking the conditions for returning NFS4ERR_DELAY to
>>> CB_LAYOUTRECALL so different from implementing a barrier and doing the
>>> returns asynchronously with the CB_LAYOUTRECALL?
>>
>> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
>> at all times." (section 12.5.3).
>>
>> In other words, you cannot just 'do the returns asynchronously': the
>> CB_LAYOUTRECALL requests are required by the protocol to be processed in
>> order, which means that you must serialise those LAYOUTRETURN calls to
>> ensure that they all happen in the order the wretched server expects.
>
> BTW: one consequence of the way the protocol was written is that you
> can't just throw out a LAYOUTRETURN for the entire file if the server
> just recalls a segment. Instead, you have to first return the segment,
> then send the LAYOUTRETURN for the entire file.
>

It is true that the protocol requires the return of the exact recalled range
but why can't the client do return the whole file before returning the recalled
range?

> That part of the protocol is just one insane idea after another...
>

This was done to ensure that the server and client are in-sync after a
CB_LAYOUTRECALL. I agree that returning the whole layout thus resetting
the layout state achieves the same goal and we should consider allowing it
in the next version.

Benny

2010-12-15 18:31:49

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 5/9] FIXME: roc should return layout on last close

rather than on any close.
---
fs/nfs/nfs4state.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 6a1eb41..dc62928 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -622,6 +622,7 @@ static void __nfs4_close(struct path *path, struct nfs4_state *state,
u32 roc_iomode;
struct nfs_inode *nfsi = NFS_I(state->inode);

+ /* FIXME: should return the layout only on last close */
if (has_layout(nfsi) &&
(roc_iomode = pnfs_layout_roc_iomode(nfsi)) != 0) {
struct pnfs_layout_range range = {
--
1.7.2.3


2010-12-15 20:24:21

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Wed, 2010-12-15 at 14:31 -0500, Trond Myklebust wrote:
> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:

> > Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
> > or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
> > for pnfs-obj) should be equivalent [note: need errata to clarify the
> > resulting stateid after NOMATCHING_LAYOUT].
> > Is this the serialization "crap" you're talking about?
> > What makes checking the conditions for returning NFS4ERR_DELAY to
> > CB_LAYOUTRECALL so different from implementing a barrier and doing the
> > returns asynchronously with the CB_LAYOUTRECALL?
>
> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
> at all times." (section 12.5.3).
>
> In other words, you cannot just 'do the returns asynchronously': the
> CB_LAYOUTRECALL requests are required by the protocol to be processed in
> order, which means that you must serialise those LAYOUTRETURN calls to
> ensure that they all happen in the order the wretched server expects.

BTW: one consequence of the way the protocol was written is that you
can't just throw out a LAYOUTRETURN for the entire file if the server
just recalls a segment. Instead, you have to first return the segment,
then send the LAYOUTRETURN for the entire file.

That part of the protocol is just one insane idea after another...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2010-12-15 18:32:55

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Wed, 2010-12-15 at 20:30 +0200, Benny Halevy wrote:
> This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
> ---
> include/linux/nfs4.h | 1 +
> include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
> 2 files changed, 24 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> index 8ca7700..55511e8 100644
> --- a/include/linux/nfs4.h
> +++ b/include/linux/nfs4.h
> @@ -557,6 +557,7 @@ enum {
> NFSPROC4_CLNT_RECLAIM_COMPLETE,
> NFSPROC4_CLNT_LAYOUTGET,
> NFSPROC4_CLNT_LAYOUTCOMMIT,
> + NFSPROC4_CLNT_LAYOUTRETURN,
> NFSPROC4_CLNT_GETDEVICEINFO,
> NFSPROC4_CLNT_PNFS_WRITE,
> NFSPROC4_CLNT_PNFS_COMMIT,
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index 9d847ac..a651574 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
> int status;
> };
>
> +struct nfs4_layoutreturn_args {
> + __u32 reclaim;
> + __u32 layout_type;
> + __u32 return_type;
> + struct pnfs_layout_range range;
> + struct inode *inode;
> + struct nfs4_sequence_args seq_args;
> +};
> +
> +struct nfs4_layoutreturn_res {
> + struct nfs4_sequence_res seq_res;
> + u32 lrs_present;
> + nfs4_stateid stateid;
> +};
> +
> +struct nfs4_layoutreturn {
> + struct nfs4_layoutreturn_args args;
> + struct nfs4_layoutreturn_res res;
> + struct rpc_cred *cred;
> + struct nfs_client *clp;
> + int rpc_status;
> +};
> +
> struct nfs4_getdeviceinfo_args {
> struct pnfs_device *pdev;
> struct nfs4_sequence_args seq_args;

Why? We don't need or even want layoutreturn. It adds too much
serialisation crap.


--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2010-12-16 07:15:34

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-15 21:31, Trond Myklebust wrote:
> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
>> On 2010-12-15 20:32, Trond Myklebust wrote:
>>> On Wed, 2010-12-15 at 20:30 +0200, Benny Halevy wrote:
>>>> This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
>>>> ---
>>>> include/linux/nfs4.h | 1 +
>>>> include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
>>>> 2 files changed, 24 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
>>>> index 8ca7700..55511e8 100644
>>>> --- a/include/linux/nfs4.h
>>>> +++ b/include/linux/nfs4.h
>>>> @@ -557,6 +557,7 @@ enum {
>>>> NFSPROC4_CLNT_RECLAIM_COMPLETE,
>>>> NFSPROC4_CLNT_LAYOUTGET,
>>>> NFSPROC4_CLNT_LAYOUTCOMMIT,
>>>> + NFSPROC4_CLNT_LAYOUTRETURN,
>>>> NFSPROC4_CLNT_GETDEVICEINFO,
>>>> NFSPROC4_CLNT_PNFS_WRITE,
>>>> NFSPROC4_CLNT_PNFS_COMMIT,
>>>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>>>> index 9d847ac..a651574 100644
>>>> --- a/include/linux/nfs_xdr.h
>>>> +++ b/include/linux/nfs_xdr.h
>>>> @@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
>>>> int status;
>>>> };
>>>>
>>>> +struct nfs4_layoutreturn_args {
>>>> + __u32 reclaim;
>>>> + __u32 layout_type;
>>>> + __u32 return_type;
>>>> + struct pnfs_layout_range range;
>>>> + struct inode *inode;
>>>> + struct nfs4_sequence_args seq_args;
>>>> +};
>>>> +
>>>> +struct nfs4_layoutreturn_res {
>>>> + struct nfs4_sequence_res seq_res;
>>>> + u32 lrs_present;
>>>> + nfs4_stateid stateid;
>>>> +};
>>>> +
>>>> +struct nfs4_layoutreturn {
>>>> + struct nfs4_layoutreturn_args args;
>>>> + struct nfs4_layoutreturn_res res;
>>>> + struct rpc_cred *cred;
>>>> + struct nfs_client *clp;
>>>> + int rpc_status;
>>>> +};
>>>> +
>>>> struct nfs4_getdeviceinfo_args {
>>>> struct pnfs_device *pdev;
>>>> struct nfs4_sequence_args seq_args;
>>>
>>> Why? We don't need or even want layoutreturn. It adds too much
>>> serialisation crap.
>>
>> Define "we" :)
>
> Definition: "We who will be forced to maintain whatever is merged
> upstream."
>
>> First, the object layout driver relies on layout return for returning I/O error
>> information. On the common, successful path, with return_on_close (that Panasas
>> uses but others may not) I agree that CLOSE with the implicit layoutreturn
>> semantics we discussed should do a good job too (accompanied with a respective
>> LAYOUTCOMMIT if needed).
>>
>> Then, when there's a large number of outstanding layout segments (again
>> prob. non-files layout presuming server implementations are going to utilize
>> whole-file layouts) proactive layoutreturn comes handy in capping the
>> state both the client and server keep - reducing time wasted on walking long
>> lists of items.
>
> That assumes that you have a good policy for implementing a 'proactive
> layoutreturn'. What knowledge does either the client or the server have
> w.r.t. whether or not part of a layout is likely to be used in the near
> future other than 'file is open' or 'file is closed'?
>

The client can cache layout segments using a least recently used policy.

> What is the advantage to the client w.r.t. sending LAYOUTRETURN rather
> than just forgetting the layout or layout segment? If the server needs
> it returned, it can send a recall. If not, we are wasting processing
> time by sending an unnecessary RPC call.
>

The client can know better than the server which layout segments is is more
likely to reuse since the MDS does not see the layout usage activity
(as it goes to the DS's).

Similarly, for CB_RECALL_ANY, the client chooses what layouts to return.
Rather than dropping all the layouts it should return only the least likely
to be reused.

>> For CB_LAYOUTRECALL response the heart of the debate is around synchronizing
>> with layouts in-use and in-flight layoutgets. There, having the server poll
>> the client, who's retuning NFS4ERR_DELAY should essentially work but may be
>> inefficient and unreliable in use cases where contention is likely enough.
>
> Define these use cases. Otherwise we're just talking generalities and
> presenting circular arguments again.
>

The most common for Panasas is write sharing where multiple clients
collaboratively write into a file in parallel. Although different clients
write into disjoint byte ranges in the file they cross RAID stripe boundaries
(which they're not aware of) and since only one client is allowed to write
into a RAID stripe at a time, the layout is being recalled whenever the
server detects a conflict.

>> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
>> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
>> for pnfs-obj) should be equivalent [note: need errata to clarify the
>> resulting stateid after NOMATCHING_LAYOUT].
>> Is this the serialization "crap" you're talking about?
>> What makes checking the conditions for returning NFS4ERR_DELAY to
>> CB_LAYOUTRECALL so different from implementing a barrier and doing the
>> returns asynchronously with the CB_LAYOUTRECALL?
>
> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
> at all times." (section 12.5.3).
>
> In other words, you cannot just 'do the returns asynchronously': the
> CB_LAYOUTRECALL requests are required by the protocol to be processed in
> order, which means that you must serialise those LAYOUTRETURN calls to
> ensure that they all happen in the order the wretched server expects.
>
>

To simplify this (presumably rare) case what I had in mind is returning
NFS4ERR_DELAY if there's a conflicting layout recall in progress.

Benny

2010-12-15 18:31:37

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 4/9] Revert "pnfs-submit: wave2: Remove LAYOUTRETURN from return on close"

This reverts commit 5611feb54b4c0225f0924c6b059f281d45915b24.
---
fs/nfs/nfs4_fs.h | 2 +-
fs/nfs/nfs4proc.c | 16 ++-------
fs/nfs/nfs4state.c | 16 +++++++-
fs/nfs/pnfs.c | 103 +++++-----------------------------------------------
fs/nfs/pnfs.h | 38 ++++++-------------
5 files changed, 39 insertions(+), 136 deletions(-)

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index d58a130..a917872 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -242,7 +242,7 @@ extern int nfs4_proc_async_renew(struct nfs_client *, struct rpc_cred *);
extern int nfs4_proc_renew(struct nfs_client *, struct rpc_cred *);
extern int nfs4_init_clientid(struct nfs_client *, struct rpc_cred *);
extern int nfs41_init_clientid(struct nfs_client *, struct rpc_cred *);
-extern int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, int wait, bool roc);
+extern int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, int wait);
extern int nfs4_server_capabilities(struct nfs_server *server, struct nfs_fh *fhandle);
extern int nfs4_proc_fs_locations(struct inode *dir, const struct qstr *name,
struct nfs4_fs_locations *fs_locations, struct page *page);
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 35af296..5331f28 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1841,8 +1841,6 @@ struct nfs4_closedata {
struct nfs_closeres res;
struct nfs_fattr fattr;
unsigned long timestamp;
- bool roc;
- u32 roc_barrier;
};

static void nfs4_free_closedata(void *data)
@@ -1850,7 +1848,6 @@ static void nfs4_free_closedata(void *data)
struct nfs4_closedata *calldata = data;
struct nfs4_state_owner *sp = calldata->state->owner;

- pnfs_roc_release(calldata->roc, calldata->state->inode);
nfs4_put_open_state(calldata->state);
nfs_free_seqid(calldata->arg.seqid);
nfs4_put_state_owner(sp);
@@ -1883,8 +1880,6 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
*/
switch (task->tk_status) {
case 0:
- pnfs_roc_set_barrier(calldata->roc, state->inode,
- calldata->roc_barrier);
nfs_set_open_stateid(state, &calldata->res.stateid, 0);
renew_lease(server, calldata->timestamp);
nfs4_close_clear_stateid_flags(state,
@@ -1937,11 +1932,8 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
return;
}

- if (calldata->arg.fmode == 0) {
+ if (calldata->arg.fmode == 0)
task->tk_msg.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_CLOSE];
- pnfs_roc_drain(calldata->roc, state->inode,
- &calldata->roc_barrier, task);
- }

nfs_fattr_init(calldata->res.fattr);
calldata->timestamp = jiffies;
@@ -1969,7 +1961,7 @@ static const struct rpc_call_ops nfs4_close_ops = {
*
* NOTE: Caller must be holding the sp->so_owner semaphore!
*/
-int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, int wait, bool roc)
+int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, int wait)
{
struct nfs_server *server = NFS_SERVER(state->inode);
struct nfs4_closedata *calldata;
@@ -2004,7 +1996,6 @@ int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, i
calldata->res.fattr = &calldata->fattr;
calldata->res.seqid = calldata->arg.seqid;
calldata->res.server = server;
- calldata->roc = roc;
path_get(path);
calldata->path = *path;

@@ -2022,7 +2013,6 @@ int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, i
out_free_calldata:
kfree(calldata);
out:
- pnfs_roc_release(roc, state->inode);
nfs4_put_open_state(state);
nfs4_put_state_owner(sp);
return status;
@@ -5619,7 +5609,7 @@ static void nfs4_layoutreturn_release(void *calldata)
lo->plh_block_lgets--;
atomic_dec(&lo->plh_outstanding);
spin_unlock(&ino->i_lock);
- put_layout_hdr(ino);
+ put_layout_hdr(lo);
}
kfree(calldata);
dprintk("<-- %s\n", __func__);
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index bca8386..6a1eb41 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -619,9 +619,21 @@ static void __nfs4_close(struct path *path, struct nfs4_state *state,
nfs4_put_open_state(state);
nfs4_put_state_owner(owner);
} else {
- bool roc = pnfs_roc(state->inode);
+ u32 roc_iomode;
+ struct nfs_inode *nfsi = NFS_I(state->inode);
+
+ if (has_layout(nfsi) &&
+ (roc_iomode = pnfs_layout_roc_iomode(nfsi)) != 0) {
+ struct pnfs_layout_range range = {
+ .iomode = roc_iomode,
+ .offset = 0,
+ .length = NFS4_MAX_UINT64,
+ };
+
+ pnfs_return_layout(state->inode, &range, wait);
+ }

- nfs4_do_close(path, state, gfp_mask, wait, roc);
+ nfs4_do_close(path, state, gfp_mask, wait);
}
}

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 904c110..1899dc6 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -465,12 +465,9 @@ pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo, const nfs4_stateid *new,
newseq = be32_to_cpu(new->stateid.seqid);
if ((int)(newseq - oldseq) > 0) {
memcpy(&lo->stateid, &new->stateid, sizeof(new->stateid));
- if (update_barrier) {
- u32 new_barrier = be32_to_cpu(new->stateid.seqid);
-
- if ((int)(new_barrier - lo->plh_barrier))
- lo->plh_barrier = new_barrier;
- } else {
+ if (update_barrier)
+ lo->plh_barrier = be32_to_cpu(new->stateid.seqid);
+ else {
/* Because of wraparound, we want to keep the barrier
* "close" to the current seqids. It needs to be
* within 2**31 to count as "behind", so if it
@@ -618,7 +615,7 @@ return_layout(struct inode *ino, struct pnfs_layout_range *range, bool wait)

lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
if (lrp == NULL) {
- put_layout_hdr(ino);
+ put_layout_hdr(NFS_I(ino)->layout);
goto out;
}
lrp->args.reclaim = 0;
@@ -679,91 +676,6 @@ out:
return status;
}

-bool pnfs_roc(struct inode *ino)
-{
- struct pnfs_layout_hdr *lo;
- struct pnfs_layout_segment *lseg, *tmp;
- LIST_HEAD(tmp_list);
- bool found = false;
-
- spin_lock(&ino->i_lock);
- lo = NFS_I(ino)->layout;
- if (!lo || !test_and_clear_bit(NFS_LAYOUT_ROC, &lo->plh_flags) ||
- test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags))
- goto out_nolayout;
- list_for_each_entry_safe(lseg, tmp, &lo->segs, fi_list)
- if (test_bit(NFS_LSEG_ROC, &lseg->pls_flags)) {
- mark_lseg_invalid(lseg, &tmp_list);
- found = true;
- }
- if (!found)
- goto out_nolayout;
- lo->plh_block_lgets++;
- get_layout_hdr(lo); /* matched in pnfs_roc_release */
- spin_unlock(&ino->i_lock);
- pnfs_free_lseg_list(&tmp_list);
- return true;
-
-out_nolayout:
- spin_unlock(&ino->i_lock);
- return false;
-}
-
-void pnfs_roc_release(bool needed, struct inode *ino)
-{
- if (needed) {
- struct pnfs_layout_hdr *lo;
-
- spin_lock(&ino->i_lock);
- lo = NFS_I(ino)->layout;
- lo->plh_block_lgets--;
- put_layout_hdr_locked(lo);
- spin_unlock(&ino->i_lock);
- }
-}
-
-void pnfs_roc_set_barrier(bool needed, struct inode *ino, u32 barrier)
-{
- if (needed) {
- struct pnfs_layout_hdr *lo;
-
- spin_lock(&ino->i_lock);
- lo = NFS_I(ino)->layout;
- if ((int)(barrier - lo->plh_barrier) > 0)
- lo->plh_barrier = barrier;
- spin_unlock(&ino->i_lock);
- }
-}
-
-void pnfs_roc_drain(bool needed, struct inode *ino, u32 *barrier,
- struct rpc_task *task)
-{
- struct nfs_inode *nfsi = NFS_I(ino);
- struct pnfs_layout_segment *lseg;
- bool found = false;
-
- if (!needed)
- return;
- spin_lock(&ino->i_lock);
- list_for_each_entry(lseg, &nfsi->layout->segs, fi_list)
- if (test_bit(NFS_LSEG_ROC, &lseg->pls_flags)) {
- rpc_sleep_on(&NFS_I(ino)->lo_rpcwaitq, task, NULL);
- found = true;
- break;
- }
- if (!found) {
- struct pnfs_layout_hdr *lo = nfsi->layout;
- u32 current_seqid = be32_to_cpu(lo->stateid.stateid.seqid);
-
- /* Since close does not return a layout stateid for use as
- * a barrier, we choose the worst-case barrier.
- */
- *barrier = current_seqid + atomic_read(&lo->plh_outstanding);
- }
- spin_unlock(&ino->i_lock);
- return;
-}
-
/*
* Compare two layout segments for sorting into layout cache.
* We want to preferentially return RW over RO layouts, so ensure those
@@ -1032,8 +944,11 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
pnfs_insert_layout(lo, lseg);

if (res->return_on_close) {
- set_bit(NFS_LSEG_ROC, &lseg->pls_flags);
- set_bit(NFS_LAYOUT_ROC, &lo->plh_flags);
+ /* FI: This needs to be re-examined. At lo level,
+ * all it needs is a bit indicating whether any of
+ * the lsegs in the list have the flags set.
+ */
+ lo->roc_iomode |= res->range.iomode;
}

/* Done processing layoutget. Set the layout stateid */
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index bbafee0..59eb0e8 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -34,7 +34,6 @@

enum {
NFS_LSEG_VALID = 0, /* cleared when lseg is recalled/returned */
- NFS_LSEG_ROC, /* roc bit received from server */
};

struct pnfs_layout_segment {
@@ -60,7 +59,6 @@ enum {
NFS_LAYOUT_RW_FAILED, /* get rw layout failed stop trying */
NFS_LAYOUT_BULK_RECALL, /* bulk recall affecting layout */
NFS_LAYOUT_NEED_LCOMMIT, /* LAYOUTCOMMIT needed */
- NFS_LAYOUT_ROC, /* some lseg had roc bit set */
};

/* Per-layout driver specific registration structure */
@@ -103,6 +101,7 @@ struct pnfs_layout_hdr {
struct list_head layouts; /* other client layouts */
struct list_head plh_bulk_recall; /* clnt list of bulk recalls */
struct list_head segs; /* layout segments list */
+ int roc_iomode;/* return on close iomode, 0=none */
nfs4_stateid stateid;
atomic_t plh_outstanding; /* number of RPCs out */
unsigned long plh_block_lgets; /* block LAYOUTGET if >0 */
@@ -223,11 +222,6 @@ int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
void nfs4_asynch_forget_layouts(struct pnfs_layout_hdr *lo,
struct pnfs_layout_range *range,
struct list_head *tmp_list);
-bool pnfs_roc(struct inode *ino);
-void pnfs_roc_release(bool needed, struct inode *ino);
-void pnfs_roc_set_barrier(bool needed, struct inode *ino, u32 barrier);
-void pnfs_roc_drain(bool needed, struct inode *ino, u32 *barrier,
- struct rpc_task *task);

static inline bool
has_layout(struct nfs_inode *nfsi)
@@ -253,6 +247,14 @@ static inline int pnfs_enabled_sb(struct nfs_server *nfss)
return nfss->pnfs_curr_ld != NULL;
}

+/* Should the pNFS client commit and return the layout on close
+ */
+static inline int
+pnfs_layout_roc_iomode(struct nfs_inode *nfsi)
+{
+ return nfsi->layout->roc_iomode;
+}
+
static inline int pnfs_return_layout(struct inode *ino,
struct pnfs_layout_range *range,
bool wait)
@@ -336,26 +338,10 @@ static inline int pnfs_layoutcommit_inode(struct inode *inode, int sync)
return 0;
}

-static inline bool
-pnfs_roc(struct inode *ino)
-{
- return false;
-}
-
-static inline void
-pnfs_roc_release(bool needed, struct inode *ino)
-{
-}
-
-static inline void
-pnfs_roc_set_barrier(bool needed, struct inode *ino, u32 barrier)
-{
-}
-
-static inline void
-pnfs_roc_drain(bool needed, struct inode *ino, u32 *barrier,
- struct rpc_task *task)
+static inline int
+pnfs_layout_roc_iomode(struct nfs_inode *nfsi)
{
+ return 0;
}

static inline int pnfs_return_layout(struct inode *ino,
--
1.7.2.3


2010-12-18 03:45:16

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-16 20:14, Trond Myklebust wrote:
> On Thu, 2010-12-16 at 19:42 +0200, Benny Halevy wrote:
>> On 2010-12-16 19:35, Trond Myklebust wrote:
>>> On Thu, 2010-12-16 at 18:24 +0200, Benny Halevy wrote:
>>>> On 2010-12-16 17:55, Trond Myklebust wrote:
>>>>> OK, so why not just go the whole hog and do that for all rare cases,
>>>>> including the one where the server recalls a layout segment that we
>>>>> happen to be doing I/O to?
>>>>>
>>>>> The case we should be optimising for is the one where the layout is
>>>>> recalled, and no I/O to that segment is in progress. For that case,
>>>>> returning OK, then doing the LAYOUTRETURN instead of just returning
>>>>> NOMATCHING_LAYOUT is clearly wrong: it adds a completely unnecessary
>>>>> round trip to the server. Agreed?
>>>>
>>>> I agree that if the client can free the recalled layout synchronously
>>>> and if it need not send a LAYOUTCOMMIT or LAYOUTRETURN (e.g. in the objects case)
>>>> it can simply return NFS4ERR_NOMATCHING_LAYOUT.
>>>
>>> Objects and blocks != wave 2. We can cross that bridge when we get to
>>> it.
>>>
>>
>> Right. This patchset is destined as post wave2.
>
> In that case it has a very confusing title (which certainly caught me by
> surprise).
>
>>
>>>>>
>>>>> As for the much rarer case of a recall of a layout that is in use, how
>>>>> does LAYOUTRETURN speed things up? As far as I can see, the MDS is still
>>>>> going to return NFS4ERR_DELAY to the client that requested the
>>>>> conflicting LAYOUTGET. That client then has to resend this LAYOUTGET
>>>>> request, at a time when the first client may or may not have returned
>>>>> its layout segment. So how is LAYOUTRETURN going to make all this a fast
>>>>> and scalable process?
>>>>>
>>>>
>>>> First, the server does not have to poll the client and waste cpu and network
>>>> resources on that.
>>>
>>> ...but this is a ____rare____ case. If you are seeing noticeable effects
>>> on the network from this, then something is wrong. If that is ever the
>>> case, then you should be writing through the MDS anyway.
>>>
>>> Furthermore, the MDS does need to be able to cope with NFS4ERR_DELAY
>>> anyway, so why add the extra complexity to the client?
>>>
>>>> Second, for the competing client, with notifications, it too does not have
>>>> to poll the server and can wait on getting the notification when the
>>>> layout becomes available.
>>>
>>> There is no notification of layout availability in RFC5661. Lock
>>> notification is for byte range locks, and device id notification is for
>>> device ids. The rest is for directory notifications.
>>>
>>
>> Hmm, CB_RECALLABLE_OBJ_AVAIL in response to loga_signal_layout_avail...
>
> Hmm indeed. Section 12.3 states:
>
> "CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a
> recallable object that it was denied (in case of pNFS, a layout denied
> by LAYOUTGET) due to resource exhaustion is now available."
>
> and 18.43.3 states:
>
> "If client sets loga_signal_layout_avail to TRUE, then it is registering
> with the client a "want" for a layout in the event the layout cannot be
> obtained due to resource exhaustion."
>
> I can't see how that is relevant to the case where a specific LAYOUTGET
> requires a layout recall from another client. That's not resource
> exhaustion.
>
>
>

Yeah, the phrasing is miserable.
It should be useful for any reason making the layout temporarily
unavailable. Yet another errata entry...

Benny

2010-12-17 05:19:59

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Fri, Dec 17, 2010 at 1:37 AM, Benny Halevy <[email protected]> wrote:
> On 2010-12-16 19:21, Peng Tao wrote:
>> Hi, Benny,
>>
>> On Thu, Dec 16, 2010 at 3:26 PM, Benny Halevy <[email protected]> wrote:
>>> On 2010-12-15 22:24, Trond Myklebust wrote:
>>>> On Wed, 2010-12-15 at 14:31 -0500, Trond Myklebust wrote:
>>>>> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
>>>>
>>>>>> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
>>>>>> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
>>>>>> for pnfs-obj) should be equivalent [note: need errata to clarify the
>>>>>> resulting stateid after NOMATCHING_LAYOUT].
>>>>>> Is this the serialization "crap" you're talking about?
>>>>>> What makes checking the conditions for returning NFS4ERR_DELAY to
>>>>>> CB_LAYOUTRECALL so different from implementing a barrier and doing the
>>>>>> returns asynchronously with the CB_LAYOUTRECALL?
>>>>>
>>>>> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
>>>>> at all times." (section 12.5.3).
>>>>>
>>>>> In other words, you cannot just 'do the returns asynchronously': the
>>>>> CB_LAYOUTRECALL requests are required by the protocol to be processed in
>>>>> order, which means that you must serialise those LAYOUTRETURN calls to
>>>>> ensure that they all happen in the order the wretched server expects.
>>>>
>>>> BTW: one consequence of the way the protocol was written is that you
>>>> can't just throw out a LAYOUTRETURN for the entire file if the server
>>>> just recalls a segment. Instead, you have to first return the segment,
>>>> then send the LAYOUTRETURN for the entire file.
>>>>
>>>
>>> It is true that the protocol requires the return of the exact recalled range
>>> but why can't the client do return the whole file before returning the recalled
>>> range?
>> Just for clarification, do you mean that after client returns more
>> than server recalls, clients still has to do an echoing LAYOUTRETURN?
>> It is barely overhead...
>> Why would server require some behavior like that?
>>
>
> The reason for that in the protocol was to provide a simple way to
> complete a CB_LAYOUTRECALL "meta operation" when the client returns
> the layout in smaller ranges than recalled. We overlooked this case
> of the client returning a range containing the returned range.
> which is easier to deal with, with further stateid related functionality
> we introduced after specifying how layout recall initially worked...
I see it. Thanks. The behavior is clearly defined in section 18.44.3.
And only in the subset returning case, client has to do a matching
layoutreturn to complete CB_LAYOUTRECALL.

>
> Benny
>
>>>
>>>> That part of the protocol is just one insane idea after another...
>>>>
>>>
>>> This was done to ensure that the server and client are in-sync after a
>>> CB_LAYOUTRECALL.  I agree that returning the whole layout thus resetting
>>> the layout state achieves the same goal and we should consider allowing it
>>> in the next version.
>>>
>>> Benny
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>>
>



--
Thanks,
-Bergwolf

2010-12-15 18:32:19

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 7/9] Revert "pnfs-submit: wave2: Pull out all recall initiated LAYOUTRETURNS"

This reverts commit fea1ffe711454410ba5277115c9aeaf814681f4a.

Conflicts:

fs/nfs/callback_proc.c
---
fs/nfs/callback.h | 5 ++
fs/nfs/callback_proc.c | 103 ++++++++++++++++++++++++++++++++++-------------
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4state.c | 4 ++
4 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
index 7f55c7e..616c5c1 100644
--- a/fs/nfs/callback.h
+++ b/fs/nfs/callback.h
@@ -167,6 +167,7 @@ extern unsigned nfs4_callback_layoutrecall(
extern bool matches_outstanding_recall(struct inode *ino,
struct pnfs_layout_range *range);
extern void notify_drained(struct nfs_client *clp, u64 mask);
+extern void nfs_client_return_layouts(struct nfs_client *clp);

static inline void put_session_client(struct nfs4_session *session)
{
@@ -182,6 +183,10 @@ find_client_from_cps(struct cb_process_state *cps, struct sockaddr *addr)

#else /* CONFIG_NFS_V4_1 */

+static inline void nfs_client_return_layouts(struct nfs_client *clp)
+{
+}
+
static inline struct nfs_client *
find_client_from_cps(struct cb_process_state *cps, struct sockaddr *addr)
{
diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index a9d162f..b6a2903 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -160,30 +160,88 @@ matches_outstanding_recall(struct inode *ino, struct pnfs_layout_range *range)
return rv;
}

+/* Send a synchronous LAYOUTRETURN. By the time this is called, we know
+ * all IO has been drained, any matching lsegs deleted, and that no
+ * overlapping LAYOUTGETs will be sent or processed for the duration
+ * of this call.
+ * Note that it is possible that when this is called, the stateid has
+ * been invalidated. But will not be cleared, so can still use.
+ */
+static int
+pnfs_send_layoutreturn(struct nfs_client *clp,
+ struct pnfs_cb_lrecall_info *cb_info)
+{
+ struct cb_layoutrecallargs *args = &cb_info->pcl_args;
+ struct nfs4_layoutreturn *lrp;
+
+ lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
+ if (!lrp)
+ return -ENOMEM;
+ lrp->args.reclaim = 0;
+ lrp->args.layout_type = args->cbl_layout_type;
+ lrp->args.return_type = args->cbl_recall_type;
+ lrp->clp = clp;
+ if (args->cbl_recall_type == RETURN_FILE) {
+ lrp->args.range = args->cbl_range;
+ lrp->args.inode = cb_info->pcl_ino;
+ } else {
+ lrp->args.range.iomode = IOMODE_ANY;
+ lrp->args.inode = NULL;
+ }
+ return nfs4_proc_layoutreturn(lrp, true);
+}
+
+/* Called by state manager to finish CB_LAYOUTRECALLS initiated by
+ * nfs4_callback_layoutrecall().
+ */
+void nfs_client_return_layouts(struct nfs_client *clp)
+{
+ struct pnfs_cb_lrecall_info *cb_info;
+
+ spin_lock(&clp->cl_lock);
+ while (true) {
+ if (list_empty(&clp->cl_layoutrecalls)) {
+ spin_unlock(&clp->cl_lock);
+ break;
+ }
+ cb_info = list_first_entry(&clp->cl_layoutrecalls,
+ struct pnfs_cb_lrecall_info,
+ pcl_list);
+ spin_unlock(&clp->cl_lock);
+ if (atomic_read(&cb_info->pcl_count) != 0)
+ break;
+ /* What do on error return? These layoutreturns are
+ * required by the protocol. So if do not get
+ * successful reply, probably have to do something
+ * more drastic.
+ */
+ pnfs_send_layoutreturn(clp, cb_info);
+ spin_lock(&clp->cl_lock);
+ /* Removing from the list unblocks LAYOUTGETs */
+ list_del(&cb_info->pcl_list);
+ clp->cl_cb_lrecall_count--;
+ clp->cl_drain_notification[1 << cb_info->pcl_notify_bit] = NULL;
+ kfree(cb_info);
+ }
+}
+
void notify_drained(struct nfs_client *clp, u64 mask)
{
atomic_t **ptr = clp->cl_drain_notification;
+ bool done = false;

/* clp lock not needed except to remove used up entries */
/* Should probably use functions defined in bitmap.h */
while (mask) {
- if ((mask & 1) && (atomic_dec_and_test(*ptr))) {
- struct pnfs_cb_lrecall_info *cb_info;
-
- cb_info = container_of(*ptr,
- struct pnfs_cb_lrecall_info,
- pcl_count);
- spin_lock(&clp->cl_lock);
- /* Removing from the list unblocks LAYOUTGETs */
- list_del(&cb_info->pcl_list);
- clp->cl_cb_lrecall_count--;
- clp->cl_drain_notification[1 << cb_info->pcl_notify_bit] = NULL;
- spin_unlock(&clp->cl_lock);
- kfree(cb_info);
- }
+ if ((mask & 1) && (atomic_dec_and_test(*ptr)))
+ done = true;
mask >>= 1;
ptr++;
}
+ if (done) {
+ set_bit(NFS4CLNT_LAYOUT_RECALL, &clp->cl_state);
+ nfs4_schedule_state_manager(clp);
+ }
}

static int initiate_layout_draining(struct pnfs_cb_lrecall_info *cb_info)
@@ -208,9 +266,8 @@ static int initiate_layout_draining(struct pnfs_cb_lrecall_info *cb_info)
* does having a layout ref keep ino around?
* It should.
*/
- /* Without this, layout can be freed as soon
- * as we release cl_lock. Matched in
- * do_callback_layoutrecall.
+ /* We need to hold the reference until any
+ * potential LAYOUTRETURN is finished.
*/
get_layout_hdr(lo);
cb_info->pcl_ino = lo->inode;
@@ -329,18 +386,6 @@ static u32 do_callback_layoutrecall(struct nfs_client *clp,
res = NFS4ERR_NOMATCHING_LAYOUT;
}
kfree(new);
- } else {
- /* We are currently using a referenced layout */
- if (args->cbl_recall_type == RETURN_FILE) {
- struct pnfs_layout_hdr *lo;
-
- lo = NFS_I(new->pcl_ino)->layout;
- spin_lock(&lo->inode->i_lock);
- lo->plh_block_lgets--;
- spin_unlock(&lo->inode->i_lock);
- put_layout_hdr(lo);
- }
- res = NFS4ERR_DELAY;
}
out:
dprintk("%s returning %i\n", __func__, res);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 15fea61..fe5c07d 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -46,6 +46,7 @@ enum nfs4_client_state {
NFS4CLNT_DELEGRETURN,
NFS4CLNT_SESSION_RESET,
NFS4CLNT_RECALL_SLOT,
+ NFS4CLNT_LAYOUT_RECALL,
};

enum nfs4_session_state {
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index dc62928..acf3e3e 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1577,6 +1577,10 @@ static void nfs4_state_manager(struct nfs_client *clp)
nfs_client_return_marked_delegations(clp);
continue;
}
+ if (test_and_clear_bit(NFS4CLNT_LAYOUT_RECALL, &clp->cl_state)) {
+ nfs_client_return_layouts(clp);
+ continue;
+ }
/* Recall session slots */
if (test_and_clear_bit(NFS4CLNT_RECALL_SLOT, &clp->cl_state)
&& nfs4_has_session(clp)) {
--
1.7.2.3


2010-12-16 17:35:37

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Thu, 2010-12-16 at 18:24 +0200, Benny Halevy wrote:
> On 2010-12-16 17:55, Trond Myklebust wrote:
> > OK, so why not just go the whole hog and do that for all rare cases,
> > including the one where the server recalls a layout segment that we
> > happen to be doing I/O to?
> >
> > The case we should be optimising for is the one where the layout is
> > recalled, and no I/O to that segment is in progress. For that case,
> > returning OK, then doing the LAYOUTRETURN instead of just returning
> > NOMATCHING_LAYOUT is clearly wrong: it adds a completely unnecessary
> > round trip to the server. Agreed?
>
> I agree that if the client can free the recalled layout synchronously
> and if it need not send a LAYOUTCOMMIT or LAYOUTRETURN (e.g. in the objects case)
> it can simply return NFS4ERR_NOMATCHING_LAYOUT.

Objects and blocks != wave 2. We can cross that bridge when we get to
it.

> >
> > As for the much rarer case of a recall of a layout that is in use, how
> > does LAYOUTRETURN speed things up? As far as I can see, the MDS is still
> > going to return NFS4ERR_DELAY to the client that requested the
> > conflicting LAYOUTGET. That client then has to resend this LAYOUTGET
> > request, at a time when the first client may or may not have returned
> > its layout segment. So how is LAYOUTRETURN going to make all this a fast
> > and scalable process?
> >
>
> First, the server does not have to poll the client and waste cpu and network
> resources on that.

...but this is a ____rare____ case. If you are seeing noticeable effects
on the network from this, then something is wrong. If that is ever the
case, then you should be writing through the MDS anyway.

Furthermore, the MDS does need to be able to cope with NFS4ERR_DELAY
anyway, so why add the extra complexity to the client?

> Second, for the competing client, with notifications, it too does not have
> to poll the server and can wait on getting the notification when the
> layout becomes available.

There is no notification of layout availability in RFC5661. Lock
notification is for byte range locks, and device id notification is for
device ids. The rest is for directory notifications.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2010-12-15 19:31:26

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
> On 2010-12-15 20:32, Trond Myklebust wrote:
> > On Wed, 2010-12-15 at 20:30 +0200, Benny Halevy wrote:
> >> This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
> >> ---
> >> include/linux/nfs4.h | 1 +
> >> include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
> >> 2 files changed, 24 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> >> index 8ca7700..55511e8 100644
> >> --- a/include/linux/nfs4.h
> >> +++ b/include/linux/nfs4.h
> >> @@ -557,6 +557,7 @@ enum {
> >> NFSPROC4_CLNT_RECLAIM_COMPLETE,
> >> NFSPROC4_CLNT_LAYOUTGET,
> >> NFSPROC4_CLNT_LAYOUTCOMMIT,
> >> + NFSPROC4_CLNT_LAYOUTRETURN,
> >> NFSPROC4_CLNT_GETDEVICEINFO,
> >> NFSPROC4_CLNT_PNFS_WRITE,
> >> NFSPROC4_CLNT_PNFS_COMMIT,
> >> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> >> index 9d847ac..a651574 100644
> >> --- a/include/linux/nfs_xdr.h
> >> +++ b/include/linux/nfs_xdr.h
> >> @@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
> >> int status;
> >> };
> >>
> >> +struct nfs4_layoutreturn_args {
> >> + __u32 reclaim;
> >> + __u32 layout_type;
> >> + __u32 return_type;
> >> + struct pnfs_layout_range range;
> >> + struct inode *inode;
> >> + struct nfs4_sequence_args seq_args;
> >> +};
> >> +
> >> +struct nfs4_layoutreturn_res {
> >> + struct nfs4_sequence_res seq_res;
> >> + u32 lrs_present;
> >> + nfs4_stateid stateid;
> >> +};
> >> +
> >> +struct nfs4_layoutreturn {
> >> + struct nfs4_layoutreturn_args args;
> >> + struct nfs4_layoutreturn_res res;
> >> + struct rpc_cred *cred;
> >> + struct nfs_client *clp;
> >> + int rpc_status;
> >> +};
> >> +
> >> struct nfs4_getdeviceinfo_args {
> >> struct pnfs_device *pdev;
> >> struct nfs4_sequence_args seq_args;
> >
> > Why? We don't need or even want layoutreturn. It adds too much
> > serialisation crap.
>
> Define "we" :)

Definition: "We who will be forced to maintain whatever is merged
upstream."

> First, the object layout driver relies on layout return for returning I/O error
> information. On the common, successful path, with return_on_close (that Panasas
> uses but others may not) I agree that CLOSE with the implicit layoutreturn
> semantics we discussed should do a good job too (accompanied with a respective
> LAYOUTCOMMIT if needed).
>
> Then, when there's a large number of outstanding layout segments (again
> prob. non-files layout presuming server implementations are going to utilize
> whole-file layouts) proactive layoutreturn comes handy in capping the
> state both the client and server keep - reducing time wasted on walking long
> lists of items.

That assumes that you have a good policy for implementing a 'proactive
layoutreturn'. What knowledge does either the client or the server have
w.r.t. whether or not part of a layout is likely to be used in the near
future other than 'file is open' or 'file is closed'?

What is the advantage to the client w.r.t. sending LAYOUTRETURN rather
than just forgetting the layout or layout segment? If the server needs
it returned, it can send a recall. If not, we are wasting processing
time by sending an unnecessary RPC call.

> For CB_LAYOUTRECALL response the heart of the debate is around synchronizing
> with layouts in-use and in-flight layoutgets. There, having the server poll
> the client, who's retuning NFS4ERR_DELAY should essentially work but may be
> inefficient and unreliable in use cases where contention is likely enough.

Define these use cases. Otherwise we're just talking generalities and
presenting circular arguments again.

> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
> for pnfs-obj) should be equivalent [note: need errata to clarify the
> resulting stateid after NOMATCHING_LAYOUT].
> Is this the serialization "crap" you're talking about?
> What makes checking the conditions for returning NFS4ERR_DELAY to
> CB_LAYOUTRECALL so different from implementing a barrier and doing the
> returns asynchronously with the CB_LAYOUTRECALL?

"CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
at all times." (section 12.5.3).

In other words, you cannot just 'do the returns asynchronously': the
CB_LAYOUTRECALL requests are required by the protocol to be processed in
order, which means that you must serialise those LAYOUTRETURN calls to
ensure that they all happen in the order the wretched server expects.


--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2010-12-16 16:24:39

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-16 17:55, Trond Myklebust wrote:
> On Thu, 2010-12-16 at 09:15 +0200, Benny Halevy wrote:
>> On 2010-12-15 21:31, Trond Myklebust wrote:
>>> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
>>>> On 2010-12-15 20:32, Trond Myklebust wrote:
>>>>> On Wed, 2010-12-15 at 20:30 +0200, Benny Halevy wrote:
>>>>>> This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
>>>>>> ---
>>>>>> include/linux/nfs4.h | 1 +
>>>>>> include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
>>>>>> 2 files changed, 24 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
>>>>>> index 8ca7700..55511e8 100644
>>>>>> --- a/include/linux/nfs4.h
>>>>>> +++ b/include/linux/nfs4.h
>>>>>> @@ -557,6 +557,7 @@ enum {
>>>>>> NFSPROC4_CLNT_RECLAIM_COMPLETE,
>>>>>> NFSPROC4_CLNT_LAYOUTGET,
>>>>>> NFSPROC4_CLNT_LAYOUTCOMMIT,
>>>>>> + NFSPROC4_CLNT_LAYOUTRETURN,
>>>>>> NFSPROC4_CLNT_GETDEVICEINFO,
>>>>>> NFSPROC4_CLNT_PNFS_WRITE,
>>>>>> NFSPROC4_CLNT_PNFS_COMMIT,
>>>>>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>>>>>> index 9d847ac..a651574 100644
>>>>>> --- a/include/linux/nfs_xdr.h
>>>>>> +++ b/include/linux/nfs_xdr.h
>>>>>> @@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
>>>>>> int status;
>>>>>> };
>>>>>>
>>>>>> +struct nfs4_layoutreturn_args {
>>>>>> + __u32 reclaim;
>>>>>> + __u32 layout_type;
>>>>>> + __u32 return_type;
>>>>>> + struct pnfs_layout_range range;
>>>>>> + struct inode *inode;
>>>>>> + struct nfs4_sequence_args seq_args;
>>>>>> +};
>>>>>> +
>>>>>> +struct nfs4_layoutreturn_res {
>>>>>> + struct nfs4_sequence_res seq_res;
>>>>>> + u32 lrs_present;
>>>>>> + nfs4_stateid stateid;
>>>>>> +};
>>>>>> +
>>>>>> +struct nfs4_layoutreturn {
>>>>>> + struct nfs4_layoutreturn_args args;
>>>>>> + struct nfs4_layoutreturn_res res;
>>>>>> + struct rpc_cred *cred;
>>>>>> + struct nfs_client *clp;
>>>>>> + int rpc_status;
>>>>>> +};
>>>>>> +
>>>>>> struct nfs4_getdeviceinfo_args {
>>>>>> struct pnfs_device *pdev;
>>>>>> struct nfs4_sequence_args seq_args;
>>>>>
>>>>> Why? We don't need or even want layoutreturn. It adds too much
>>>>> serialisation crap.
>>>>
>>>> Define "we" :)
>>>
>>> Definition: "We who will be forced to maintain whatever is merged
>>> upstream."
>>>
>>>> First, the object layout driver relies on layout return for returning I/O error
>>>> information. On the common, successful path, with return_on_close (that Panasas
>>>> uses but others may not) I agree that CLOSE with the implicit layoutreturn
>>>> semantics we discussed should do a good job too (accompanied with a respective
>>>> LAYOUTCOMMIT if needed).
>>>>
>>>> Then, when there's a large number of outstanding layout segments (again
>>>> prob. non-files layout presuming server implementations are going to utilize
>>>> whole-file layouts) proactive layoutreturn comes handy in capping the
>>>> state both the client and server keep - reducing time wasted on walking long
>>>> lists of items.
>>>
>>> That assumes that you have a good policy for implementing a 'proactive
>>> layoutreturn'. What knowledge does either the client or the server have
>>> w.r.t. whether or not part of a layout is likely to be used in the near
>>> future other than 'file is open' or 'file is closed'?
>>>
>>
>> The client can cache layout segments using a least recently used policy.
>>
>>> What is the advantage to the client w.r.t. sending LAYOUTRETURN rather
>>> than just forgetting the layout or layout segment? If the server needs
>>> it returned, it can send a recall. If not, we are wasting processing
>>> time by sending an unnecessary RPC call.
>>>
>>
>> The client can know better than the server which layout segments is is more
>> likely to reuse since the MDS does not see the layout usage activity
>> (as it goes to the DS's).
>
> How does it do that? The client isn't in control here; the application
> is.

Of course no one can predict reuse patterns better than the application,

>
> Sure you can track sequential writes and figure out which segment is
> going to be needed next, but that usually doesn't help you figure out
> the segment _reuse_ case.

Given a finite cache on the client side it can evict old layout segments
to make room for new ones, just dropping them means that the server
is required to keep track of more state for the clients.

> The layout segment reuse case actually corresponds to data access
> patterns where it would usually make more sense for the client to cache
> instead of doing I/O (unless we're talking random I/O, but then the
> client won't know much more about layout access patterns either).
>

The random I/O case is a good example where the client can return layout
segments early on.

>> Similarly, for CB_RECALL_ANY, the client chooses what layouts to return.
>> Rather than dropping all the layouts it should return only the least likely
>> to be reused.
>
> That is more easily done, since both the client and server do know which
> files are open and which aren't. Use layout return on close to deal with
> this situation.
>

That doesn't work for large files, with layout segments (less likely in the
files-layout model), that are opened for long periods of time.

>>>> For CB_LAYOUTRECALL response the heart of the debate is around synchronizing
>>>> with layouts in-use and in-flight layoutgets. There, having the server poll
>>>> the client, who's retuning NFS4ERR_DELAY should essentially work but may be
>>>> inefficient and unreliable in use cases where contention is likely enough.
>>>
>>> Define these use cases. Otherwise we're just talking generalities and
>>> presenting circular arguments again.
>>>
>>
>> The most common for Panasas is write sharing where multiple clients
>> collaboratively write into a file in parallel. Although different clients
>> write into disjoint byte ranges in the file they cross RAID stripe boundaries
>> (which they're not aware of) and since only one client is allowed to write
>> into a RAID stripe at a time, the layout is being recalled whenever the
>> server detects a conflict.
>
> So basically, we're talking about the case of a shared database doing
> O_DIRECT without taking striping into account? Something like Oracle
> certainly allows you to tune its database block size to match the
> striping. If you are looking for performance, why wouldn't you do that?
>

Not really, for use its number-crunching clustered applications. Their
block sizes depend more on the number of clients and on the data set.

>>>> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
>>>> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
>>>> for pnfs-obj) should be equivalent [note: need errata to clarify the
>>>> resulting stateid after NOMATCHING_LAYOUT].
>>>> Is this the serialization "crap" you're talking about?
>>>> What makes checking the conditions for returning NFS4ERR_DELAY to
>>>> CB_LAYOUTRECALL so different from implementing a barrier and doing the
>>>> returns asynchronously with the CB_LAYOUTRECALL?
>>>
>>> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
>>> at all times." (section 12.5.3).
>>>
>>> In other words, you cannot just 'do the returns asynchronously': the
>>> CB_LAYOUTRECALL requests are required by the protocol to be processed in
>>> order, which means that you must serialise those LAYOUTRETURN calls to
>>> ensure that they all happen in the order the wretched server expects.
>>>
>>>
>>
>> To simplify this (presumably rare) case what I had in mind is returning
>> NFS4ERR_DELAY if there's a conflicting layout recall in progress.
>
> OK, so why not just go the whole hog and do that for all rare cases,
> including the one where the server recalls a layout segment that we
> happen to be doing I/O to?
>
> The case we should be optimising for is the one where the layout is
> recalled, and no I/O to that segment is in progress. For that case,
> returning OK, then doing the LAYOUTRETURN instead of just returning
> NOMATCHING_LAYOUT is clearly wrong: it adds a completely unnecessary
> round trip to the server. Agreed?

I agree that if the client can free the recalled layout synchronously
and if it need not send a LAYOUTCOMMIT or LAYOUTRETURN (e.g. in the objects case)
it can simply return NFS4ERR_NOMATCHING_LAYOUT.

>
> As for the much rarer case of a recall of a layout that is in use, how
> does LAYOUTRETURN speed things up? As far as I can see, the MDS is still
> going to return NFS4ERR_DELAY to the client that requested the
> conflicting LAYOUTGET. That client then has to resend this LAYOUTGET
> request, at a time when the first client may or may not have returned
> its layout segment. So how is LAYOUTRETURN going to make all this a fast
> and scalable process?
>

First, the server does not have to poll the client and waste cpu and network
resources on that.

Second, for the competing client, with notifications, it too does not have
to poll the server and can wait on getting the notification when the
layout becomes available.

Benny

2010-12-15 18:32:34

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 8/9] Revert "pnfs-submit: wave2: Don't wait in layoutget"

This reverts commit 9628f174680a6ec76aa6e1afa8678631a5efd39b.

Conflicts:

fs/nfs/nfs4proc.c
fs/nfs/pnfs.c
---
fs/nfs/callback_proc.c | 4 +++
fs/nfs/client.c | 2 +
fs/nfs/inode.c | 1 +
fs/nfs/nfs4proc.c | 47 ++++++++++++++++++++++++++++++++++-
fs/nfs/pnfs.c | 58 +++++++++++++++++++++------------------------
fs/nfs/pnfs.h | 1 +
include/linux/nfs_fs.h | 1 +
include/linux/nfs_fs_sb.h | 1 +
8 files changed, 82 insertions(+), 33 deletions(-)

diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index b6a2903..6d48236 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -221,6 +221,7 @@ void nfs_client_return_layouts(struct nfs_client *clp)
list_del(&cb_info->pcl_list);
clp->cl_cb_lrecall_count--;
clp->cl_drain_notification[1 << cb_info->pcl_notify_bit] = NULL;
+ rpc_wake_up(&clp->cl_rpcwaitq_recall);
kfree(cb_info);
}
}
@@ -372,6 +373,7 @@ static u32 do_callback_layoutrecall(struct nfs_client *clp,
list_del(&new->pcl_list);
clp->cl_cb_lrecall_count--;
clp->cl_drain_notification[1 << bit_num] = NULL;
+ rpc_wake_up(&clp->cl_rpcwaitq_recall);
spin_unlock(&clp->cl_lock);
if (res == NFS4_OK) {
if (args->cbl_recall_type == RETURN_FILE) {
@@ -380,6 +382,8 @@ static u32 do_callback_layoutrecall(struct nfs_client *clp,
lo = NFS_I(new->pcl_ino)->layout;
spin_lock(&lo->inode->i_lock);
lo->plh_block_lgets--;
+ if (!pnfs_layoutgets_blocked(lo, NULL))
+ rpc_wake_up(&NFS_I(lo->inode)->lo_rpcwaitq_stateid);
spin_unlock(&lo->inode->i_lock);
put_layout_hdr(lo);
}
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index f8e712f..172175f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -159,6 +159,8 @@ static struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_
#if defined(CONFIG_NFS_V4_1)
INIT_LIST_HEAD(&clp->cl_layouts);
INIT_LIST_HEAD(&clp->cl_layoutrecalls);
+ rpc_init_wait_queue(&clp->cl_rpcwaitq_recall,
+ "NFS client CB_LAYOUTRECALLS");
#endif
nfs_fscache_get_client_cookie(clp);

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e557d96..1e19d5d 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1461,6 +1461,7 @@ static inline void nfs4_init_once(struct nfs_inode *nfsi)
nfsi->delegation_state = 0;
init_rwsem(&nfsi->rwsem);
rpc_init_wait_queue(&nfsi->lo_rpcwaitq, "pNFS Layoutreturn");
+ rpc_init_wait_queue(&nfsi->lo_rpcwaitq_stateid, "pNFS Layoutstateid");
nfsi->layout = NULL;
#endif
}
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 1c79c09..b0a48d8 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5357,17 +5357,43 @@ static void
nfs4_layoutget_prepare(struct rpc_task *task, void *calldata)
{
struct nfs4_layoutget *lgp = calldata;
- struct nfs_server *server = NFS_SERVER(lgp->args.inode);
+ struct inode *ino = lgp->args.inode;
+ struct nfs_inode *nfsi = NFS_I(ino);
+ struct nfs_server *server = NFS_SERVER(ino);
+ struct nfs_client *clp = NFS_SERVER(ino)->nfs_client;

dprintk("--> %s\n", __func__);
+ spin_lock(&clp->cl_lock);
+ if (matches_outstanding_recall(ino, &lgp->args.range)) {
+ rpc_sleep_on(&clp->cl_rpcwaitq_recall, task, NULL);
+ spin_unlock(&clp->cl_lock);
+ return;
+ }
+ spin_unlock(&clp->cl_lock);
/* Note the is a race here, where a CB_LAYOUTRECALL can come in
* right now covering the LAYOUTGET we are about to send.
* However, that is not so catastrophic, and there seems
* to be no way to prevent it completely.
*/
+ spin_lock(&ino->i_lock);
+ if (pnfs_layoutgets_blocked(nfsi->layout, NULL)) {
+ rpc_sleep_on(&nfsi->lo_rpcwaitq_stateid, task, NULL);
+ spin_unlock(&ino->i_lock);
+ return;
+ }
+ /* This needs after above check but atomic with it in order to properly
+ * serialize openstateid LAYOUTGETs.
+ */
+ atomic_inc(&nfsi->layout->plh_outstanding);
+ spin_unlock(&ino->i_lock);
+
if (nfs4_setup_sequence(server, NULL, &lgp->args.seq_args,
- &lgp->res.seq_res, 0, task))
+ &lgp->res.seq_res, 0, task)) {
+ spin_lock(&ino->i_lock);
+ atomic_dec(&nfsi->layout->plh_outstanding);
+ spin_unlock(&ino->i_lock);
return;
+ }
rpc_call_start(task);
}

@@ -5395,6 +5421,11 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)
/* Fall through */
default:
if (nfs4_async_handle_error(task, server, NULL, NULL) == -EAGAIN) {
+ struct inode *ino = lgp->args.inode;
+
+ spin_lock(&ino->i_lock);
+ atomic_dec(&NFS_I(ino)->layout->plh_outstanding);
+ spin_unlock(&ino->i_lock);
rpc_restart_call_prepare(task);
return;
}
@@ -5456,6 +5487,16 @@ int nfs4_proc_layoutget(struct nfs4_layoutget *lgp)
status = task->tk_status;
if (status == 0)
status = pnfs_layout_process(lgp);
+ else {
+ struct inode *ino = lgp->args.inode;
+ struct pnfs_layout_hdr *lo = NFS_I(ino)->layout;
+
+ spin_lock(&ino->i_lock);
+ atomic_dec(&lo->plh_outstanding);
+ if (!pnfs_layoutgets_blocked(lo, NULL))
+ rpc_wake_up(&NFS_I(ino)->lo_rpcwaitq_stateid);
+ spin_unlock(&ino->i_lock);
+ }
rpc_put_task(task);
dprintk("<-- %s status=%d\n", __func__, status);
return status;
@@ -5614,6 +5655,8 @@ static void nfs4_layoutreturn_release(void *calldata)
spin_lock(&ino->i_lock);
lo->plh_block_lgets--;
atomic_dec(&lo->plh_outstanding);
+ if (!pnfs_layoutgets_blocked(lo, NULL))
+ rpc_wake_up(&NFS_I(ino)->lo_rpcwaitq_stateid);
spin_unlock(&ino->i_lock);
put_layout_hdr(lo);
}
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 8b44c41..b778032 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -308,6 +308,8 @@ _put_lseg_common(struct pnfs_layout_segment *lseg)
list_del_init(&lseg->layout->layouts);
spin_unlock(&clp->cl_lock);
clear_bit(NFS_LAYOUT_BULK_RECALL, &lseg->layout->plh_flags);
+ if (!pnfs_layoutgets_blocked(lseg->layout, NULL))
+ rpc_wake_up(&NFS_I(ino)->lo_rpcwaitq_stateid);
}
rpc_wake_up(&NFS_I(ino)->lo_rpcwaitq);
}
@@ -481,21 +483,6 @@ pnfs_set_layout_stateid(struct pnfs_layout_hdr *lo, const nfs4_stateid *new,
}
}

-/* lget is set to 1 if called from inside send_layoutget call chain */
-static bool
-pnfs_layoutgets_blocked(struct pnfs_layout_hdr *lo, nfs4_stateid *stateid,
- int lget)
-{
- assert_spin_locked(&lo->inode->i_lock);
- if ((stateid) &&
- (int)(lo->plh_barrier - be32_to_cpu(stateid->stateid.seqid)) >= 0)
- return true;
- return lo->plh_block_lgets ||
- test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags) ||
- (list_empty(&lo->segs) &&
- (atomic_read(&lo->plh_outstanding) > lget));
-}
-
int
pnfs_choose_layoutget_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo,
struct nfs4_state *open_state)
@@ -504,7 +491,8 @@ pnfs_choose_layoutget_stateid(nfs4_stateid *dst, struct pnfs_layout_hdr *lo,

dprintk("--> %s\n", __func__);
spin_lock(&lo->inode->i_lock);
- if (pnfs_layoutgets_blocked(lo, NULL, 1)) {
+ if (lo->plh_block_lgets ||
+ test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
/* We avoid -EAGAIN, as that has special meaning to
* some callers.
*/
@@ -719,6 +707,9 @@ pnfs_insert_layout(struct pnfs_layout_hdr *lo,
}
if (!found) {
list_add_tail(&lseg->fi_list, &lo->segs);
+ if (list_is_singular(&lo->segs) &&
+ !pnfs_layoutgets_blocked(lo, NULL))
+ rpc_wake_up(&NFS_I(lo->inode)->lo_rpcwaitq_stateid);
dprintk("%s: inserted lseg %p "
"iomode %d offset %llu length %llu at tail\n",
__func__, lseg, lseg->range.iomode,
@@ -836,13 +827,6 @@ pnfs_update_layout(struct inode *ino,

if (!pnfs_enabled_sb(NFS_SERVER(ino)))
return NULL;
- spin_lock(&clp->cl_lock);
- if (matches_outstanding_recall(ino, &arg)) {
- dprintk("%s matches recall, use MDS\n", __func__);
- spin_unlock(&clp->cl_lock);
- return NULL;
- }
- spin_unlock(&clp->cl_lock);
spin_lock(&ino->i_lock);
lo = pnfs_find_alloc_layout(ino);
if (lo == NULL) {
@@ -859,10 +843,6 @@ pnfs_update_layout(struct inode *ino,
if (test_bit(lo_fail_bit(iomode), &nfsi->layout->plh_flags))
goto out_unlock;

- if (pnfs_layoutgets_blocked(lo, NULL, 0))
- goto out_unlock;
- atomic_inc(&lo->plh_outstanding);
-
get_layout_hdr(lo); /* Matched in pnfs_layoutget_release */
if (list_empty(&lo->segs)) {
/* The lo must be on the clp list if there is any
@@ -886,8 +866,6 @@ pnfs_update_layout(struct inode *ino,
}
spin_unlock(&ino->i_lock);
}
- atomic_dec(&lo->plh_outstanding);
- spin_unlock(&ino->i_lock);
out:
dprintk("%s end, state 0x%lx lseg %p\n", __func__,
nfsi->layout->plh_flags, lseg);
@@ -897,6 +875,19 @@ out_unlock:
goto out;
}

+bool
+pnfs_layoutgets_blocked(struct pnfs_layout_hdr *lo, nfs4_stateid *stateid)
+{
+ assert_spin_locked(&lo->inode->i_lock);
+ if ((stateid) &&
+ (int)(lo->plh_barrier - be32_to_cpu(stateid->stateid.seqid)) >= 0)
+ return true;
+ return lo->plh_block_lgets ||
+ test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags) ||
+ (list_empty(&lo->segs) &&
+ (atomic_read(&lo->plh_outstanding) != 0));
+}
+
int
pnfs_layout_process(struct nfs4_layoutget *lgp)
{
@@ -927,11 +918,13 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
status = PTR_ERR(lseg);
dprintk("%s: Could not allocate layout: error %d\n",
__func__, status);
+ spin_lock(&ino->i_lock);
goto out;
}

spin_lock(&ino->i_lock);
/* decrement needs to be done before call to pnfs_layoutget_blocked */
+ atomic_dec(&lo->plh_outstanding);
spin_lock(&clp->cl_lock);
if (matches_outstanding_recall(ino, &res->range)) {
spin_unlock(&clp->cl_lock);
@@ -940,7 +933,7 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
}
spin_unlock(&clp->cl_lock);

- if (pnfs_layoutgets_blocked(lo, &res->stateid, 1)) {
+ if (pnfs_layoutgets_blocked(lo, &res->stateid)) {
dprintk("%s forget reply due to state\n", __func__);
goto out_forget_reply;
}
@@ -960,14 +953,17 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)

/* Done processing layoutget. Set the layout stateid */
pnfs_set_layout_stateid(lo, &res->stateid, false);
- spin_unlock(&ino->i_lock);
out:
+ if (!pnfs_layoutgets_blocked(lo, NULL))
+ rpc_wake_up(&NFS_I(ino)->lo_rpcwaitq_stateid);
+ spin_unlock(&ino->i_lock);
return status;

out_forget_reply:
spin_unlock(&ino->i_lock);
lseg->layout = lo;
NFS_SERVER(ino)->pnfs_curr_ld->free_lseg(lseg);
+ spin_lock(&ino->i_lock);
goto out;
}

diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index b011b3c..3585bd2 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -218,6 +218,7 @@ enum pnfs_try_status pnfs_try_to_commit(struct nfs_write_data *,
void pnfs_pageio_init_read(struct nfs_pageio_descriptor *, struct inode *,
struct nfs_open_context *, struct list_head *);
void pnfs_pageio_init_write(struct nfs_pageio_descriptor *, struct inode *);
+bool pnfs_layoutgets_blocked(struct pnfs_layout_hdr *lo, nfs4_stateid *stateid);
int pnfs_layout_process(struct nfs4_layoutget *lgp);
void pnfs_free_lseg_list(struct list_head *tmp_list);
void pnfs_destroy_layout(struct nfs_inode *);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index caed83e..b4bb8d6 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -191,6 +191,7 @@ struct nfs_inode {

/* pNFS layout information */
struct rpc_wait_queue lo_rpcwaitq;
+ struct rpc_wait_queue lo_rpcwaitq_stateid;
struct pnfs_layout_hdr *layout;
#endif /* CONFIG_NFS_V4*/
#ifdef CONFIG_NFS_FSCACHE
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index b02f486..96cb62f 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -88,6 +88,7 @@ struct nfs_client {
unsigned long cl_cb_lrecall_count;
#define PNFS_MAX_CB_LRECALLS (64)
atomic_t *cl_drain_notification[PNFS_MAX_CB_LRECALLS];
+ struct rpc_wait_queue cl_rpcwaitq_recall;
struct pnfs_deviceid_cache *cl_devid_cache; /* pNFS deviceid cache */
#endif /* CONFIG_NFS_V4_1 */

--
1.7.2.3


2010-12-16 18:14:43

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Thu, 2010-12-16 at 19:42 +0200, Benny Halevy wrote:
> On 2010-12-16 19:35, Trond Myklebust wrote:
> > On Thu, 2010-12-16 at 18:24 +0200, Benny Halevy wrote:
> >> On 2010-12-16 17:55, Trond Myklebust wrote:
> >>> OK, so why not just go the whole hog and do that for all rare cases,
> >>> including the one where the server recalls a layout segment that we
> >>> happen to be doing I/O to?
> >>>
> >>> The case we should be optimising for is the one where the layout is
> >>> recalled, and no I/O to that segment is in progress. For that case,
> >>> returning OK, then doing the LAYOUTRETURN instead of just returning
> >>> NOMATCHING_LAYOUT is clearly wrong: it adds a completely unnecessary
> >>> round trip to the server. Agreed?
> >>
> >> I agree that if the client can free the recalled layout synchronously
> >> and if it need not send a LAYOUTCOMMIT or LAYOUTRETURN (e.g. in the objects case)
> >> it can simply return NFS4ERR_NOMATCHING_LAYOUT.
> >
> > Objects and blocks != wave 2. We can cross that bridge when we get to
> > it.
> >
>
> Right. This patchset is destined as post wave2.

In that case it has a very confusing title (which certainly caught me by
surprise).

>
> >>>
> >>> As for the much rarer case of a recall of a layout that is in use, how
> >>> does LAYOUTRETURN speed things up? As far as I can see, the MDS is still
> >>> going to return NFS4ERR_DELAY to the client that requested the
> >>> conflicting LAYOUTGET. That client then has to resend this LAYOUTGET
> >>> request, at a time when the first client may or may not have returned
> >>> its layout segment. So how is LAYOUTRETURN going to make all this a fast
> >>> and scalable process?
> >>>
> >>
> >> First, the server does not have to poll the client and waste cpu and network
> >> resources on that.
> >
> > ...but this is a ____rare____ case. If you are seeing noticeable effects
> > on the network from this, then something is wrong. If that is ever the
> > case, then you should be writing through the MDS anyway.
> >
> > Furthermore, the MDS does need to be able to cope with NFS4ERR_DELAY
> > anyway, so why add the extra complexity to the client?
> >
> >> Second, for the competing client, with notifications, it too does not have
> >> to poll the server and can wait on getting the notification when the
> >> layout becomes available.
> >
> > There is no notification of layout availability in RFC5661. Lock
> > notification is for byte range locks, and device id notification is for
> > device ids. The rest is for directory notifications.
> >
>
> Hmm, CB_RECALLABLE_OBJ_AVAIL in response to loga_signal_layout_avail...

Hmm indeed. Section 12.3 states:

"CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a
recallable object that it was denied (in case of pNFS, a layout denied
by LAYOUTGET) due to resource exhaustion is now available."

and 18.43.3 states:

"If client sets loga_signal_layout_avail to TRUE, then it is registering
with the client a "want" for a layout in the event the layout cannot be
obtained due to resource exhaustion."

I can't see how that is relevant to the case where a specific LAYOUTGET
requires a layout recall from another client. That's not resource
exhaustion.



--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2010-12-16 17:21:38

by Peng Tao

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

Hi, Benny,

On Thu, Dec 16, 2010 at 3:26 PM, Benny Halevy <[email protected]> wrote:
> On 2010-12-15 22:24, Trond Myklebust wrote:
>> On Wed, 2010-12-15 at 14:31 -0500, Trond Myklebust wrote:
>>> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
>>
>>>> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
>>>> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
>>>> for pnfs-obj) should be equivalent [note: need errata to clarify the
>>>> resulting stateid after NOMATCHING_LAYOUT].
>>>> Is this the serialization "crap" you're talking about?
>>>> What makes checking the conditions for returning NFS4ERR_DELAY to
>>>> CB_LAYOUTRECALL so different from implementing a barrier and doing the
>>>> returns asynchronously with the CB_LAYOUTRECALL?
>>>
>>> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
>>> at all times." (section 12.5.3).
>>>
>>> In other words, you cannot just 'do the returns asynchronously': the
>>> CB_LAYOUTRECALL requests are required by the protocol to be processed in
>>> order, which means that you must serialise those LAYOUTRETURN calls to
>>> ensure that they all happen in the order the wretched server expects.
>>
>> BTW: one consequence of the way the protocol was written is that you
>> can't just throw out a LAYOUTRETURN for the entire file if the server
>> just recalls a segment. Instead, you have to first return the segment,
>> then send the LAYOUTRETURN for the entire file.
>>
>
> It is true that the protocol requires the return of the exact recalled range
> but why can't the client do return the whole file before returning the recalled
> range?
Just for clarification, do you mean that after client returns more
than server recalls, clients still has to do an echoing LAYOUTRETURN?
It is barely overhead...
Why would server require some behavior like that?

>
>> That part of the protocol is just one insane idea after another...
>>
>
> This was done to ensure that the server and client are in-sync after a
> CB_LAYOUTRECALL.  I agree that returning the whole layout thus resetting
> the layout state achieves the same goal and we should consider allowing it
> in the next version.
>
> Benny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
Thanks,
-Bergwolf

2010-12-16 17:37:07

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-16 19:21, Peng Tao wrote:
> Hi, Benny,
>
> On Thu, Dec 16, 2010 at 3:26 PM, Benny Halevy <[email protected]> wrote:
>> On 2010-12-15 22:24, Trond Myklebust wrote:
>>> On Wed, 2010-12-15 at 14:31 -0500, Trond Myklebust wrote:
>>>> On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
>>>
>>>>> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
>>>>> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
>>>>> for pnfs-obj) should be equivalent [note: need errata to clarify the
>>>>> resulting stateid after NOMATCHING_LAYOUT].
>>>>> Is this the serialization "crap" you're talking about?
>>>>> What makes checking the conditions for returning NFS4ERR_DELAY to
>>>>> CB_LAYOUTRECALL so different from implementing a barrier and doing the
>>>>> returns asynchronously with the CB_LAYOUTRECALL?
>>>>
>>>> "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
>>>> at all times." (section 12.5.3).
>>>>
>>>> In other words, you cannot just 'do the returns asynchronously': the
>>>> CB_LAYOUTRECALL requests are required by the protocol to be processed in
>>>> order, which means that you must serialise those LAYOUTRETURN calls to
>>>> ensure that they all happen in the order the wretched server expects.
>>>
>>> BTW: one consequence of the way the protocol was written is that you
>>> can't just throw out a LAYOUTRETURN for the entire file if the server
>>> just recalls a segment. Instead, you have to first return the segment,
>>> then send the LAYOUTRETURN for the entire file.
>>>
>>
>> It is true that the protocol requires the return of the exact recalled range
>> but why can't the client do return the whole file before returning the recalled
>> range?
> Just for clarification, do you mean that after client returns more
> than server recalls, clients still has to do an echoing LAYOUTRETURN?
> It is barely overhead...
> Why would server require some behavior like that?
>

The reason for that in the protocol was to provide a simple way to
complete a CB_LAYOUTRECALL "meta operation" when the client returns
the layout in smaller ranges than recalled. We overlooked this case
of the client returning a range containing the returned range.
which is easier to deal with, with further stateid related functionality
we introduced after specifying how layout recall initially worked...

Benny

>>
>>> That part of the protocol is just one insane idea after another...
>>>
>>
>> This was done to ensure that the server and client are in-sync after a
>> CB_LAYOUTRECALL. I agree that returning the whole layout thus resetting
>> the layout state achieves the same goal and we should consider allowing it
>> in the next version.
>>
>> Benny
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>
>

2010-12-16 17:42:57

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-16 19:35, Trond Myklebust wrote:
> On Thu, 2010-12-16 at 18:24 +0200, Benny Halevy wrote:
>> On 2010-12-16 17:55, Trond Myklebust wrote:
>>> OK, so why not just go the whole hog and do that for all rare cases,
>>> including the one where the server recalls a layout segment that we
>>> happen to be doing I/O to?
>>>
>>> The case we should be optimising for is the one where the layout is
>>> recalled, and no I/O to that segment is in progress. For that case,
>>> returning OK, then doing the LAYOUTRETURN instead of just returning
>>> NOMATCHING_LAYOUT is clearly wrong: it adds a completely unnecessary
>>> round trip to the server. Agreed?
>>
>> I agree that if the client can free the recalled layout synchronously
>> and if it need not send a LAYOUTCOMMIT or LAYOUTRETURN (e.g. in the objects case)
>> it can simply return NFS4ERR_NOMATCHING_LAYOUT.
>
> Objects and blocks != wave 2. We can cross that bridge when we get to
> it.
>

Right. This patchset is destined as post wave2.

>>>
>>> As for the much rarer case of a recall of a layout that is in use, how
>>> does LAYOUTRETURN speed things up? As far as I can see, the MDS is still
>>> going to return NFS4ERR_DELAY to the client that requested the
>>> conflicting LAYOUTGET. That client then has to resend this LAYOUTGET
>>> request, at a time when the first client may or may not have returned
>>> its layout segment. So how is LAYOUTRETURN going to make all this a fast
>>> and scalable process?
>>>
>>
>> First, the server does not have to poll the client and waste cpu and network
>> resources on that.
>
> ...but this is a ____rare____ case. If you are seeing noticeable effects
> on the network from this, then something is wrong. If that is ever the
> case, then you should be writing through the MDS anyway.
>
> Furthermore, the MDS does need to be able to cope with NFS4ERR_DELAY
> anyway, so why add the extra complexity to the client?
>
>> Second, for the competing client, with notifications, it too does not have
>> to poll the server and can wait on getting the notification when the
>> layout becomes available.
>
> There is no notification of layout availability in RFC5661. Lock
> notification is for byte range locks, and device id notification is for
> device ids. The rest is for directory notifications.
>

Hmm, CB_RECALLABLE_OBJ_AVAIL in response to loga_signal_layout_avail...

Benny

2010-12-15 18:31:00

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
---
include/linux/nfs4.h | 1 +
include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 8ca7700..55511e8 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -557,6 +557,7 @@ enum {
NFSPROC4_CLNT_RECLAIM_COMPLETE,
NFSPROC4_CLNT_LAYOUTGET,
NFSPROC4_CLNT_LAYOUTCOMMIT,
+ NFSPROC4_CLNT_LAYOUTRETURN,
NFSPROC4_CLNT_GETDEVICEINFO,
NFSPROC4_CLNT_PNFS_WRITE,
NFSPROC4_CLNT_PNFS_COMMIT,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 9d847ac..a651574 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
int status;
};

+struct nfs4_layoutreturn_args {
+ __u32 reclaim;
+ __u32 layout_type;
+ __u32 return_type;
+ struct pnfs_layout_range range;
+ struct inode *inode;
+ struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_layoutreturn_res {
+ struct nfs4_sequence_res seq_res;
+ u32 lrs_present;
+ nfs4_stateid stateid;
+};
+
+struct nfs4_layoutreturn {
+ struct nfs4_layoutreturn_args args;
+ struct nfs4_layoutreturn_res res;
+ struct rpc_cred *cred;
+ struct nfs_client *clp;
+ int rpc_status;
+};
+
struct nfs4_getdeviceinfo_args {
struct pnfs_device *pdev;
struct nfs4_sequence_args seq_args;
--
1.7.2.3


2010-12-16 15:55:42

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On Thu, 2010-12-16 at 09:15 +0200, Benny Halevy wrote:
> On 2010-12-15 21:31, Trond Myklebust wrote:
> > On Wed, 2010-12-15 at 20:51 +0200, Benny Halevy wrote:
> >> On 2010-12-15 20:32, Trond Myklebust wrote:
> >>> On Wed, 2010-12-15 at 20:30 +0200, Benny Halevy wrote:
> >>>> This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
> >>>> ---
> >>>> include/linux/nfs4.h | 1 +
> >>>> include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
> >>>> 2 files changed, 24 insertions(+), 0 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> >>>> index 8ca7700..55511e8 100644
> >>>> --- a/include/linux/nfs4.h
> >>>> +++ b/include/linux/nfs4.h
> >>>> @@ -557,6 +557,7 @@ enum {
> >>>> NFSPROC4_CLNT_RECLAIM_COMPLETE,
> >>>> NFSPROC4_CLNT_LAYOUTGET,
> >>>> NFSPROC4_CLNT_LAYOUTCOMMIT,
> >>>> + NFSPROC4_CLNT_LAYOUTRETURN,
> >>>> NFSPROC4_CLNT_GETDEVICEINFO,
> >>>> NFSPROC4_CLNT_PNFS_WRITE,
> >>>> NFSPROC4_CLNT_PNFS_COMMIT,
> >>>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> >>>> index 9d847ac..a651574 100644
> >>>> --- a/include/linux/nfs_xdr.h
> >>>> +++ b/include/linux/nfs_xdr.h
> >>>> @@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
> >>>> int status;
> >>>> };
> >>>>
> >>>> +struct nfs4_layoutreturn_args {
> >>>> + __u32 reclaim;
> >>>> + __u32 layout_type;
> >>>> + __u32 return_type;
> >>>> + struct pnfs_layout_range range;
> >>>> + struct inode *inode;
> >>>> + struct nfs4_sequence_args seq_args;
> >>>> +};
> >>>> +
> >>>> +struct nfs4_layoutreturn_res {
> >>>> + struct nfs4_sequence_res seq_res;
> >>>> + u32 lrs_present;
> >>>> + nfs4_stateid stateid;
> >>>> +};
> >>>> +
> >>>> +struct nfs4_layoutreturn {
> >>>> + struct nfs4_layoutreturn_args args;
> >>>> + struct nfs4_layoutreturn_res res;
> >>>> + struct rpc_cred *cred;
> >>>> + struct nfs_client *clp;
> >>>> + int rpc_status;
> >>>> +};
> >>>> +
> >>>> struct nfs4_getdeviceinfo_args {
> >>>> struct pnfs_device *pdev;
> >>>> struct nfs4_sequence_args seq_args;
> >>>
> >>> Why? We don't need or even want layoutreturn. It adds too much
> >>> serialisation crap.
> >>
> >> Define "we" :)
> >
> > Definition: "We who will be forced to maintain whatever is merged
> > upstream."
> >
> >> First, the object layout driver relies on layout return for returning I/O error
> >> information. On the common, successful path, with return_on_close (that Panasas
> >> uses but others may not) I agree that CLOSE with the implicit layoutreturn
> >> semantics we discussed should do a good job too (accompanied with a respective
> >> LAYOUTCOMMIT if needed).
> >>
> >> Then, when there's a large number of outstanding layout segments (again
> >> prob. non-files layout presuming server implementations are going to utilize
> >> whole-file layouts) proactive layoutreturn comes handy in capping the
> >> state both the client and server keep - reducing time wasted on walking long
> >> lists of items.
> >
> > That assumes that you have a good policy for implementing a 'proactive
> > layoutreturn'. What knowledge does either the client or the server have
> > w.r.t. whether or not part of a layout is likely to be used in the near
> > future other than 'file is open' or 'file is closed'?
> >
>
> The client can cache layout segments using a least recently used policy.
>
> > What is the advantage to the client w.r.t. sending LAYOUTRETURN rather
> > than just forgetting the layout or layout segment? If the server needs
> > it returned, it can send a recall. If not, we are wasting processing
> > time by sending an unnecessary RPC call.
> >
>
> The client can know better than the server which layout segments is is more
> likely to reuse since the MDS does not see the layout usage activity
> (as it goes to the DS's).

How does it do that? The client isn't in control here; the application
is.

Sure you can track sequential writes and figure out which segment is
going to be needed next, but that usually doesn't help you figure out
the segment _reuse_ case.
The layout segment reuse case actually corresponds to data access
patterns where it would usually make more sense for the client to cache
instead of doing I/O (unless we're talking random I/O, but then the
client won't know much more about layout access patterns either).

> Similarly, for CB_RECALL_ANY, the client chooses what layouts to return.
> Rather than dropping all the layouts it should return only the least likely
> to be reused.

That is more easily done, since both the client and server do know which
files are open and which aren't. Use layout return on close to deal with
this situation.

> >> For CB_LAYOUTRECALL response the heart of the debate is around synchronizing
> >> with layouts in-use and in-flight layoutgets. There, having the server poll
> >> the client, who's retuning NFS4ERR_DELAY should essentially work but may be
> >> inefficient and unreliable in use cases where contention is likely enough.
> >
> > Define these use cases. Otherwise we're just talking generalities and
> > presenting circular arguments again.
> >
>
> The most common for Panasas is write sharing where multiple clients
> collaboratively write into a file in parallel. Although different clients
> write into disjoint byte ranges in the file they cross RAID stripe boundaries
> (which they're not aware of) and since only one client is allowed to write
> into a RAID stripe at a time, the layout is being recalled whenever the
> server detects a conflict.

So basically, we're talking about the case of a shared database doing
O_DIRECT without taking striping into account? Something like Oracle
certainly allows you to tune its database block size to match the
striping. If you are looking for performance, why wouldn't you do that?

> >> Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
> >> or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
> >> for pnfs-obj) should be equivalent [note: need errata to clarify the
> >> resulting stateid after NOMATCHING_LAYOUT].
> >> Is this the serialization "crap" you're talking about?
> >> What makes checking the conditions for returning NFS4ERR_DELAY to
> >> CB_LAYOUTRECALL so different from implementing a barrier and doing the
> >> returns asynchronously with the CB_LAYOUTRECALL?
> >
> > "CB_LAYOUTRECALL request processing MUST be processed in "seqid" order
> > at all times." (section 12.5.3).
> >
> > In other words, you cannot just 'do the returns asynchronously': the
> > CB_LAYOUTRECALL requests are required by the protocol to be processed in
> > order, which means that you must serialise those LAYOUTRETURN calls to
> > ensure that they all happen in the order the wretched server expects.
> >
> >
>
> To simplify this (presumably rare) case what I had in mind is returning
> NFS4ERR_DELAY if there's a conflicting layout recall in progress.

OK, so why not just go the whole hog and do that for all rare cases,
including the one where the server recalls a layout segment that we
happen to be doing I/O to?

The case we should be optimising for is the one where the layout is
recalled, and no I/O to that segment is in progress. For that case,
returning OK, then doing the LAYOUTRETURN instead of just returning
NOMATCHING_LAYOUT is clearly wrong: it adds a completely unnecessary
round trip to the server. Agreed?

As for the much rarer case of a recall of a layout that is in use, how
does LAYOUTRETURN speed things up? As far as I can see, the MDS is still
going to return NFS4ERR_DELAY to the client that requested the
conflicting LAYOUTGET. That client then has to resend this LAYOUTGET
request, at a time when the first client may or may not have returned
its layout segment. So how is LAYOUTRETURN going to make all this a fast
and scalable process?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2010-12-15 18:32:46

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 9/9] Revert "pnfs-submit: wave2: check that partial LAYOUTGET return is ignored"

This reverts commit 0980d08ae8e980c001a31f0f29f9f1b8b5da6f94.
---
fs/nfs/pnfs.c | 11 -----------
1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index b778032..653be24 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -898,17 +898,6 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
struct nfs_client *clp = NFS_SERVER(ino)->nfs_client;
int status = 0;

- /* Verify we got what we asked for.
- * Note that because the xdr parsing only accepts a single
- * element array, this can fail even if the server is behaving
- * correctly.
- */
- if (lgp->args.range.iomode > res->range.iomode ||
- res->range.offset != 0 ||
- res->range.length != NFS4_MAX_UINT64) {
- status = -EINVAL;
- goto out;
- }
/* Inject layout blob into I/O device driver */
lseg = NFS_SERVER(ino)->pnfs_curr_ld->alloc_lseg(lo, res);
if (!lseg || IS_ERR(lseg)) {
--
1.7.2.3


2010-12-15 18:32:07

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 6/9] Revert "pnfs-submit: wave2: remove cl_layoutrecalls list"

This reverts commit a7324f1761109d4563dc71b9e03e71440f9f1c4a.

Conflicts:

fs/nfs/callback_proc.c
fs/nfs/pnfs.c
---
fs/nfs/callback.h | 4 +-
fs/nfs/callback_proc.c | 165 +++++++++++++++++++++++++++++++++++++++------
fs/nfs/client.c | 1 +
fs/nfs/nfs4_fs.h | 1 -
fs/nfs/nfs4proc.c | 10 ++-
fs/nfs/pnfs.c | 37 ++++++----
fs/nfs/pnfs.h | 13 +++-
include/linux/nfs_fs_sb.h | 4 +
8 files changed, 193 insertions(+), 42 deletions(-)

diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
index b8cafb5..7f55c7e 100644
--- a/fs/nfs/callback.h
+++ b/fs/nfs/callback.h
@@ -154,7 +154,6 @@ struct cb_layoutrecallargs {
union {
struct {
struct nfs_fh cbl_fh;
- struct inode *cbl_inode;
struct pnfs_layout_range cbl_range;
nfs4_stateid cbl_stateid;
};
@@ -165,6 +164,9 @@ struct cb_layoutrecallargs {
extern unsigned nfs4_callback_layoutrecall(
struct cb_layoutrecallargs *args,
void *dummy, struct cb_process_state *cps);
+extern bool matches_outstanding_recall(struct inode *ino,
+ struct pnfs_layout_range *range);
+extern void notify_drained(struct nfs_client *clp, u64 mask);

static inline void put_session_client(struct nfs4_session *session)
{
diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index bc532b0..a9d162f 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -120,16 +120,82 @@ int nfs4_validate_delegation_stateid(struct nfs_delegation *delegation, const nf

#if defined(CONFIG_NFS_V4_1)

-static int initiate_layout_draining(struct nfs_client *clp,
- struct cb_layoutrecallargs *args)
+static bool
+_recall_matches_lget(struct pnfs_cb_lrecall_info *cb_info,
+ struct inode *ino, struct pnfs_layout_range *range)
+{
+ struct cb_layoutrecallargs *cb_args = &cb_info->pcl_args;
+
+ switch (cb_args->cbl_recall_type) {
+ case RETURN_ALL:
+ return true;
+ case RETURN_FSID:
+ return !memcmp(&NFS_SERVER(ino)->fsid, &cb_args->cbl_fsid,
+ sizeof(struct nfs_fsid));
+ case RETURN_FILE:
+ return (ino == cb_info->pcl_ino) &&
+ should_free_lseg(range, &cb_args->cbl_range);
+ default:
+ /* Should never hit here, as decode_layoutrecall_args()
+ * will verify cb_info from server.
+ */
+ BUG();
+ }
+}
+
+bool
+matches_outstanding_recall(struct inode *ino, struct pnfs_layout_range *range)
{
+ struct nfs_client *clp = NFS_SERVER(ino)->nfs_client;
+ struct pnfs_cb_lrecall_info *cb_info;
+ bool rv = false;
+
+ assert_spin_locked(&clp->cl_lock);
+ list_for_each_entry(cb_info, &clp->cl_layoutrecalls, pcl_list) {
+ if (_recall_matches_lget(cb_info, ino, range)) {
+ rv = true;
+ break;
+ }
+ }
+ return rv;
+}
+
+void notify_drained(struct nfs_client *clp, u64 mask)
+{
+ atomic_t **ptr = clp->cl_drain_notification;
+
+ /* clp lock not needed except to remove used up entries */
+ /* Should probably use functions defined in bitmap.h */
+ while (mask) {
+ if ((mask & 1) && (atomic_dec_and_test(*ptr))) {
+ struct pnfs_cb_lrecall_info *cb_info;
+
+ cb_info = container_of(*ptr,
+ struct pnfs_cb_lrecall_info,
+ pcl_count);
+ spin_lock(&clp->cl_lock);
+ /* Removing from the list unblocks LAYOUTGETs */
+ list_del(&cb_info->pcl_list);
+ clp->cl_cb_lrecall_count--;
+ clp->cl_drain_notification[1 << cb_info->pcl_notify_bit] = NULL;
+ spin_unlock(&clp->cl_lock);
+ kfree(cb_info);
+ }
+ mask >>= 1;
+ ptr++;
+ }
+}
+
+static int initiate_layout_draining(struct pnfs_cb_lrecall_info *cb_info)
+{
+ struct nfs_client *clp = cb_info->pcl_clp;
struct pnfs_layout_hdr *lo;
int rv = NFS4ERR_NOMATCHING_LAYOUT;
+ struct cb_layoutrecallargs *args = &cb_info->pcl_args;

if (args->cbl_recall_type == RETURN_FILE) {
LIST_HEAD(free_me_list);

- args->cbl_inode = NULL;
spin_lock(&clp->cl_lock);
list_for_each_entry(lo, &clp->cl_layouts, layouts) {
if (nfs_compare_fh(&args->cbl_fh,
@@ -138,12 +204,16 @@ static int initiate_layout_draining(struct nfs_client *clp,
if (test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags))
rv = NFS4ERR_DELAY;
else {
+ /* FIXME I need to better understand igrab and
+ * does having a layout ref keep ino around?
+ * It should.
+ */
/* Without this, layout can be freed as soon
* as we release cl_lock. Matched in
* do_callback_layoutrecall.
*/
get_layout_hdr(lo);
- args->cbl_inode = lo->inode;
+ cb_info->pcl_ino = lo->inode;
rv = NFS4_OK;
}
break;
@@ -154,6 +224,8 @@ static int initiate_layout_draining(struct nfs_client *clp,
if (rv == NFS4_OK) {
lo->plh_block_lgets++;
nfs4_asynch_forget_layouts(lo, &args->cbl_range,
+ cb_info->pcl_notify_bit,
+ &cb_info->pcl_count,
&free_me_list);
}
pnfs_set_layout_stateid(lo, &args->cbl_stateid, true);
@@ -170,12 +242,18 @@ static int initiate_layout_draining(struct nfs_client *clp,
};

spin_lock(&clp->cl_lock);
+ /* Per RFC 5661, 12.5.5.2.1.5, bulk recall must be serialized */
+ if (!list_is_singular(&clp->cl_layoutrecalls)) {
+ spin_unlock(&clp->cl_lock);
+ return NFS4ERR_DELAY;
+ }
list_for_each_entry(lo, &clp->cl_layouts, layouts) {
if ((args->cbl_recall_type == RETURN_FSID) &&
memcmp(&NFS_SERVER(lo->inode)->fsid,
&args->cbl_fsid, sizeof(struct nfs_fsid)))
continue;
get_layout_hdr(lo);
+ /* We could list_del(&lo->layouts) here */
BUG_ON(!list_empty(&lo->plh_bulk_recall));
list_add(&lo->plh_bulk_recall, &recall_list);
}
@@ -184,7 +262,10 @@ static int initiate_layout_draining(struct nfs_client *clp,
&recall_list, plh_bulk_recall) {
spin_lock(&lo->inode->i_lock);
set_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags);
- nfs4_asynch_forget_layouts(lo, &range, &free_me_list);
+ nfs4_asynch_forget_layouts(lo, &range,
+ cb_info->pcl_notify_bit,
+ &cb_info->pcl_count,
+ &free_me_list);
list_del_init(&lo->plh_bulk_recall);
spin_unlock(&lo->inode->i_lock);
put_layout_hdr(lo);
@@ -198,29 +279,69 @@ static int initiate_layout_draining(struct nfs_client *clp,
static u32 do_callback_layoutrecall(struct nfs_client *clp,
struct cb_layoutrecallargs *args)
{
- u32 status, res = NFS4ERR_DELAY;
+ struct pnfs_cb_lrecall_info *new;
+ atomic_t **ptr;
+ int bit_num;
+ u32 res;

dprintk("%s enter, type=%i\n", __func__, args->cbl_recall_type);
- if (test_and_set_bit(NFS4CLNT_LAYOUTRECALL, &clp->cl_state))
+ new = kmalloc(sizeof(*new), GFP_KERNEL);
+ if (!new) {
+ res = NFS4ERR_DELAY;
goto out;
- atomic_inc(&clp->cl_recall_count);
- status = initiate_layout_draining(clp, args);
- if (atomic_dec_and_test(&clp->cl_recall_count))
- res = NFS4ERR_NOMATCHING_LAYOUT;
- else
+ }
+ memcpy(&new->pcl_args, args, sizeof(*args));
+ atomic_set(&new->pcl_count, 1);
+ new->pcl_clp = clp;
+ new->pcl_ino = NULL;
+ spin_lock(&clp->cl_lock);
+ if (clp->cl_cb_lrecall_count >= PNFS_MAX_CB_LRECALLS) {
+ kfree(new);
res = NFS4ERR_DELAY;
- if (status)
- res = status;
- else if (args->cbl_recall_type == RETURN_FILE) {
- struct pnfs_layout_hdr *lo;
+ spin_unlock(&clp->cl_lock);
+ goto out;
+ }
+ clp->cl_cb_lrecall_count++;
+ /* Adding to the list will block conflicting LGET activity */
+ list_add_tail(&new->pcl_list, &clp->cl_layoutrecalls);
+ for (bit_num = 0, ptr = clp->cl_drain_notification; *ptr; ptr++)
+ bit_num++;
+ *ptr = &new->pcl_count;
+ new->pcl_notify_bit = bit_num;
+ spin_unlock(&clp->cl_lock);
+ res = initiate_layout_draining(new);
+ if (res || atomic_dec_and_test(&new->pcl_count)) {
+ spin_lock(&clp->cl_lock);
+ list_del(&new->pcl_list);
+ clp->cl_cb_lrecall_count--;
+ clp->cl_drain_notification[1 << bit_num] = NULL;
+ spin_unlock(&clp->cl_lock);
+ if (res == NFS4_OK) {
+ if (args->cbl_recall_type == RETURN_FILE) {
+ struct pnfs_layout_hdr *lo;
+
+ lo = NFS_I(new->pcl_ino)->layout;
+ spin_lock(&lo->inode->i_lock);
+ lo->plh_block_lgets--;
+ spin_unlock(&lo->inode->i_lock);
+ put_layout_hdr(lo);
+ }
+ res = NFS4ERR_NOMATCHING_LAYOUT;
+ }
+ kfree(new);
+ } else {
+ /* We are currently using a referenced layout */
+ if (args->cbl_recall_type == RETURN_FILE) {
+ struct pnfs_layout_hdr *lo;

- lo = NFS_I(args->cbl_inode)->layout;
- spin_lock(&lo->inode->i_lock);
- lo->plh_block_lgets--;
- spin_unlock(&lo->inode->i_lock);
- put_layout_hdr(lo);
+ lo = NFS_I(new->pcl_ino)->layout;
+ spin_lock(&lo->inode->i_lock);
+ lo->plh_block_lgets--;
+ spin_unlock(&lo->inode->i_lock);
+ put_layout_hdr(lo);
+ }
+ res = NFS4ERR_DELAY;
}
- clear_bit(NFS4CLNT_LAYOUTRECALL, &clp->cl_state);
out:
dprintk("%s returning %i\n", __func__, res);
return res;
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 9042a7a..f8e712f 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -158,6 +158,7 @@ static struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_
clp->cl_machine_cred = cred;
#if defined(CONFIG_NFS_V4_1)
INIT_LIST_HEAD(&clp->cl_layouts);
+ INIT_LIST_HEAD(&clp->cl_layoutrecalls);
#endif
nfs_fscache_get_client_cookie(clp);

diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index a917872..15fea61 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -44,7 +44,6 @@ enum nfs4_client_state {
NFS4CLNT_RECLAIM_REBOOT,
NFS4CLNT_RECLAIM_NOGRACE,
NFS4CLNT_DELEGRETURN,
- NFS4CLNT_LAYOUTRECALL,
NFS4CLNT_SESSION_RESET,
NFS4CLNT_RECALL_SLOT,
};
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 5331f28..1c79c09 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5378,9 +5378,14 @@ static void nfs4_layoutget_done(struct rpc_task *task, void *calldata)

dprintk("--> %s\n", __func__);

- if (!nfs4_sequence_done(task, &lgp->res.seq_res))
+ if (!nfs4_sequence_done(task, &lgp->res.seq_res)) {
+ /* layout code relies on fact that in this case
+ * code falls back to tk_action=call_start, but not
+ * back to rpc_prepare_task, to keep plh_outstanding
+ * correct.
+ */
return;
-
+ }
switch (task->tk_status) {
case 0:
break;
@@ -5402,6 +5407,7 @@ static void nfs4_layoutget_release(void *calldata)
struct nfs4_layoutget *lgp = calldata;

dprintk("--> %s\n", __func__);
+ put_layout_hdr(NFS_I(lgp->args.inode)->layout);
if (lgp->res.layout.buf != NULL)
free_page((unsigned long) lgp->res.layout.buf);
put_nfs_open_context(lgp->args.ctx);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 1899dc6..8b44c41 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -277,17 +277,17 @@ init_lseg(struct pnfs_layout_hdr *lo, struct pnfs_layout_segment *lseg)
smp_mb();
set_bit(NFS_LSEG_VALID, &lseg->pls_flags);
lseg->layout = lo;
- lseg->pls_recall_count = 0;
+ lseg->pls_notify_mask = 0;
}

static void free_lseg(struct pnfs_layout_segment *lseg)
{
struct inode *ino = lseg->layout->inode;
- int count = lseg->pls_recall_count;
+ u64 mask = lseg->pls_notify_mask;

BUG_ON(atomic_read(&lseg->pls_refcount) != 0);
NFS_SERVER(ino)->pnfs_curr_ld->free_lseg(lseg);
- atomic_sub(count, &NFS_SERVER(ino)->nfs_client->cl_recall_count);
+ notify_drained(NFS_SERVER(ino)->nfs_client, mask);
/* Matched by get_layout_hdr_locked in pnfs_insert_layout */
put_layout_hdr(NFS_I(ino)->layout);
}
@@ -544,8 +544,10 @@ send_layoutget(struct pnfs_layout_hdr *lo,

BUG_ON(ctx == NULL);
lgp = kzalloc(sizeof(*lgp), GFP_KERNEL);
- if (lgp == NULL)
+ if (lgp == NULL) {
+ put_layout_hdr(lo);
return NULL;
+ }
lgp->args.minlength = NFS4_MAX_UINT64;
lgp->args.maxcount = PNFS_LAYOUT_MAXSIZE;
lgp->args.range.iomode = range->iomode;
@@ -569,6 +571,7 @@ send_layoutget(struct pnfs_layout_hdr *lo,

void nfs4_asynch_forget_layouts(struct pnfs_layout_hdr *lo,
struct pnfs_layout_range *range,
+ int notify_bit, atomic_t *notify_count,
struct list_head *tmp_list)
{
struct pnfs_layout_segment *lseg, *tmp;
@@ -576,8 +579,8 @@ void nfs4_asynch_forget_layouts(struct pnfs_layout_hdr *lo,
assert_spin_locked(&lo->inode->i_lock);
list_for_each_entry_safe(lseg, tmp, &lo->segs, fi_list)
if (should_free_lseg(&lseg->range, range)) {
- lseg->pls_recall_count++;
- atomic_inc(&NFS_SERVER(lo->inode)->nfs_client->cl_recall_count);
+ lseg->pls_notify_mask |= (1 << notify_bit);
+ atomic_inc(notify_count);
mark_lseg_invalid(lseg, tmp_list);
}
}
@@ -833,6 +836,13 @@ pnfs_update_layout(struct inode *ino,

if (!pnfs_enabled_sb(NFS_SERVER(ino)))
return NULL;
+ spin_lock(&clp->cl_lock);
+ if (matches_outstanding_recall(ino, &arg)) {
+ dprintk("%s matches recall, use MDS\n", __func__);
+ spin_unlock(&clp->cl_lock);
+ return NULL;
+ }
+ spin_unlock(&clp->cl_lock);
spin_lock(&ino->i_lock);
lo = pnfs_find_alloc_layout(ino);
if (lo == NULL) {
@@ -840,12 +850,6 @@ pnfs_update_layout(struct inode *ino,
goto out_unlock;
}

- /* Do we even need to bother with this? */
- if (test_bit(NFS4CLNT_LAYOUTRECALL, &clp->cl_state) ||
- test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
- dprintk("%s matches recall, use MDS\n", __func__);
- goto out_unlock;
- }
/* Check to see if the layout for the given range already exists */
lseg = pnfs_find_lseg(lo, &arg);
if (lseg)
@@ -883,7 +887,7 @@ pnfs_update_layout(struct inode *ino,
spin_unlock(&ino->i_lock);
}
atomic_dec(&lo->plh_outstanding);
- put_layout_hdr(lo);
+ spin_unlock(&ino->i_lock);
out:
dprintk("%s end, state 0x%lx lseg %p\n", __func__,
nfsi->layout->plh_flags, lseg);
@@ -927,11 +931,14 @@ pnfs_layout_process(struct nfs4_layoutget *lgp)
}

spin_lock(&ino->i_lock);
- if (test_bit(NFS4CLNT_LAYOUTRECALL, &clp->cl_state) ||
- test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) {
+ /* decrement needs to be done before call to pnfs_layoutget_blocked */
+ spin_lock(&clp->cl_lock);
+ if (matches_outstanding_recall(ino, &res->range)) {
+ spin_unlock(&clp->cl_lock);
dprintk("%s forget reply due to recall\n", __func__);
goto out_forget_reply;
}
+ spin_unlock(&clp->cl_lock);

if (pnfs_layoutgets_blocked(lo, &res->stateid, 1)) {
dprintk("%s forget reply due to state\n", __func__);
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 59eb0e8..b011b3c 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -31,6 +31,7 @@
#define FS_NFS_PNFS_H

#include <linux/nfs_page.h>
+#include "callback.h"

enum {
NFS_LSEG_VALID = 0, /* cleared when lseg is recalled/returned */
@@ -42,7 +43,7 @@ struct pnfs_layout_segment {
atomic_t pls_refcount;
unsigned long pls_flags;
struct pnfs_layout_hdr *layout;
- int pls_recall_count;
+ u64 pls_notify_mask;
};

enum pnfs_try_status {
@@ -126,6 +127,15 @@ struct pnfs_device {
unsigned int pglen;
};

+struct pnfs_cb_lrecall_info {
+ struct list_head pcl_list; /* hook into cl_layoutrecalls list */
+ atomic_t pcl_count;
+ int pcl_notify_bit;
+ struct nfs_client *pcl_clp;
+ struct inode *pcl_ino;
+ struct cb_layoutrecallargs pcl_args;
+};
+
/*
* Device ID RCU cache. A device ID is unique per client ID and layout type.
*/
@@ -221,6 +231,7 @@ int pnfs_choose_layoutget_stateid(nfs4_stateid *dst,
struct nfs4_state *open_state);
void nfs4_asynch_forget_layouts(struct pnfs_layout_hdr *lo,
struct pnfs_layout_range *range,
+ int notify_bit, atomic_t *notify_count,
struct list_head *tmp_list);

static inline bool
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 5e97d69..b02f486 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -84,6 +84,10 @@ struct nfs_client {
struct nfs4_session *cl_session; /* sharred session */
struct list_head cl_layouts;
atomic_t cl_recall_count; /* no. of lsegs in recall */
+ struct list_head cl_layoutrecalls;
+ unsigned long cl_cb_lrecall_count;
+#define PNFS_MAX_CB_LRECALLS (64)
+ atomic_t *cl_drain_notification[PNFS_MAX_CB_LRECALLS];
struct pnfs_deviceid_cache *cl_devid_cache; /* pNFS deviceid cache */
#endif /* CONFIG_NFS_V4_1 */

--
1.7.2.3


2010-12-15 18:31:25

by Benny Halevy

[permalink] [raw]
Subject: [PATCH 3/9] Revert "pnfs-submit: wave2: remove all LAYOUTRETURN code"

This reverts commit 7fc5d08da0e6bcff7363199c3238f8f807663ab2.
---
fs/nfs/nfs4proc.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/nfs4xdr.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/pnfs.c | 64 +++++++++++++++++++++++++++-
fs/nfs/pnfs.h | 2 +
4 files changed, 292 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 176449f..35af296 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5554,6 +5554,124 @@ out:
return 0;
}

+static void
+nfs4_layoutreturn_prepare(struct rpc_task *task, void *calldata)
+{
+ struct nfs4_layoutreturn *lrp = calldata;
+
+ dprintk("--> %s\n", __func__);
+ if (lrp->args.return_type == RETURN_FILE) {
+ struct nfs_inode *nfsi = NFS_I(lrp->args.inode);
+
+ if (pnfs_return_layout_barrier(nfsi, &lrp->args.range)) {
+ dprintk("%s: waiting on barrier\n", __func__);
+ rpc_sleep_on(&nfsi->lo_rpcwaitq, task, NULL);
+ return;
+ }
+ }
+ if (nfs41_setup_sequence(lrp->clp->cl_session, &lrp->args.seq_args,
+ &lrp->res.seq_res, 0, task))
+ return;
+ rpc_call_start(task);
+}
+
+static void nfs4_layoutreturn_done(struct rpc_task *task, void *calldata)
+{
+ struct nfs4_layoutreturn *lrp = calldata;
+ struct nfs_server *server;
+
+ dprintk("--> %s\n", __func__);
+
+ if (!nfs4_sequence_done(task, &lrp->res.seq_res))
+ return;
+
+ if (lrp->args.return_type == RETURN_FILE)
+ server = NFS_SERVER(lrp->args.inode);
+ else
+ server = NULL;
+ if (nfs4_async_handle_error(task, server, NULL, lrp->clp) == -EAGAIN) {
+ nfs_restart_rpc(task, lrp->clp);
+ return;
+ }
+ if ((task->tk_status == 0) && (lrp->args.return_type == RETURN_FILE)) {
+ struct pnfs_layout_hdr *lo = NFS_I(lrp->args.inode)->layout;
+
+ spin_lock(&lo->inode->i_lock);
+ if (lrp->res.lrs_present)
+ pnfs_set_layout_stateid(lo, &lrp->res.stateid, true);
+ else
+ BUG_ON(!list_empty(&lo->segs));
+ spin_unlock(&lo->inode->i_lock);
+ }
+ dprintk("<-- %s\n", __func__);
+}
+
+static void nfs4_layoutreturn_release(void *calldata)
+{
+ struct nfs4_layoutreturn *lrp = calldata;
+
+ dprintk("--> %s return_type %d\n", __func__, lrp->args.return_type);
+ if (lrp->args.return_type == RETURN_FILE) {
+ struct inode *ino = lrp->args.inode;
+ struct pnfs_layout_hdr *lo = NFS_I(ino)->layout;
+
+ spin_lock(&ino->i_lock);
+ lo->plh_block_lgets--;
+ atomic_dec(&lo->plh_outstanding);
+ spin_unlock(&ino->i_lock);
+ put_layout_hdr(ino);
+ }
+ kfree(calldata);
+ dprintk("<-- %s\n", __func__);
+}
+
+static const struct rpc_call_ops nfs4_layoutreturn_call_ops = {
+ .rpc_call_prepare = nfs4_layoutreturn_prepare,
+ .rpc_call_done = nfs4_layoutreturn_done,
+ .rpc_release = nfs4_layoutreturn_release,
+};
+
+int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp, bool issync)
+{
+ struct rpc_task *task;
+ struct rpc_message msg = {
+ .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTRETURN],
+ .rpc_argp = &lrp->args,
+ .rpc_resp = &lrp->res,
+ };
+ struct rpc_task_setup task_setup_data = {
+ .rpc_client = lrp->clp->cl_rpcclient,
+ .rpc_message = &msg,
+ .callback_ops = &nfs4_layoutreturn_call_ops,
+ .callback_data = lrp,
+ .flags = RPC_TASK_ASYNC,
+ };
+ int status = 0;
+
+ dprintk("--> %s\n", __func__);
+ if (lrp->args.return_type == RETURN_FILE) {
+ struct pnfs_layout_hdr *lo = NFS_I(lrp->args.inode)->layout;
+ /* FIXME we should test for BULK here */
+ spin_lock(&lo->inode->i_lock);
+ BUG_ON(lo->plh_block_lgets == 0);
+ atomic_inc(&lo->plh_outstanding);
+ spin_unlock(&lo->inode->i_lock);
+ }
+ task = rpc_run_task(&task_setup_data);
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+ if (!issync)
+ goto out;
+ status = nfs4_wait_for_completion_rpc_task(task);
+ if (status != 0)
+ goto out;
+ status = task->tk_status;
+out:
+ dprintk("<-- %s\n", __func__);
+ rpc_put_task(task);
+ return status;
+}
+
static int
_nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
{
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 4564021..7f92bfa 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -329,6 +329,12 @@ static int nfs4_stat_to_errno(int);
op_encode_hdr_maxsz + \
encode_stateid_maxsz)
#define decode_layoutcommit_maxsz (3 + op_decode_hdr_maxsz)
+#define encode_layoutreturn_maxsz (8 + op_encode_hdr_maxsz + \
+ encode_stateid_maxsz + \
+ 1 /* FIXME: opaque lrf_body always empty at
+ *the moment */)
+#define decode_layoutreturn_maxsz (op_decode_hdr_maxsz + \
+ 1 + decode_stateid_maxsz)
#else /* CONFIG_NFS_V4_1 */
#define encode_sequence_maxsz 0
#define decode_sequence_maxsz 0
@@ -742,6 +748,14 @@ static int nfs4_stat_to_errno(int);
decode_putfh_maxsz + \
decode_layoutcommit_maxsz + \
decode_getattr_maxsz)
+#define NFS4_enc_layoutreturn_sz (compound_encode_hdr_maxsz + \
+ encode_sequence_maxsz + \
+ encode_putfh_maxsz + \
+ encode_layoutreturn_maxsz)
+#define NFS4_dec_layoutreturn_sz (compound_decode_hdr_maxsz + \
+ decode_sequence_maxsz + \
+ decode_putfh_maxsz + \
+ decode_layoutreturn_maxsz)
#define NFS4_enc_dswrite_sz (compound_encode_hdr_maxsz + \
encode_sequence_maxsz +\
encode_putfh_maxsz + \
@@ -1888,6 +1902,36 @@ encode_layoutcommit(struct xdr_stream *xdr,
return 0;
}

+static void
+encode_layoutreturn(struct xdr_stream *xdr,
+ const struct nfs4_layoutreturn_args *args,
+ struct compound_hdr *hdr)
+{
+ nfs4_stateid stateid;
+ __be32 *p;
+
+ p = reserve_space(xdr, 20);
+ *p++ = cpu_to_be32(OP_LAYOUTRETURN);
+ *p++ = cpu_to_be32(args->reclaim);
+ *p++ = cpu_to_be32(args->layout_type);
+ *p++ = cpu_to_be32(args->range.iomode);
+ *p = cpu_to_be32(args->return_type);
+ if (args->return_type == RETURN_FILE) {
+ p = reserve_space(xdr, 16 + NFS4_STATEID_SIZE);
+ p = xdr_encode_hyper(p, args->range.offset);
+ p = xdr_encode_hyper(p, args->range.length);
+ spin_lock(&args->inode->i_lock);
+ memcpy(stateid.data, NFS_I(args->inode)->layout->stateid.data,
+ NFS4_STATEID_SIZE);
+ spin_unlock(&args->inode->i_lock);
+ p = xdr_encode_opaque_fixed(p, &stateid.data,
+ NFS4_STATEID_SIZE);
+ p = reserve_space(xdr, 4);
+ *p = cpu_to_be32(0);
+ }
+ hdr->nops++;
+ hdr->replen += decode_layoutreturn_maxsz;
+}
#endif /* CONFIG_NFS_V4_1 */

/*
@@ -2778,6 +2822,26 @@ static int nfs4_xdr_enc_layoutcommit(struct rpc_rqst *req, uint32_t *p,
}

/*
+ * Encode LAYOUTRETURN request
+ */
+static int nfs4_xdr_enc_layoutreturn(struct rpc_rqst *req, uint32_t *p,
+ struct nfs4_layoutreturn_args *args)
+{
+ struct xdr_stream xdr;
+ struct compound_hdr hdr = {
+ .minorversion = nfs4_xdr_minorversion(&args->seq_args),
+ };
+
+ xdr_init_encode(&xdr, &req->rq_snd_buf, p);
+ encode_compound_hdr(&xdr, req, &hdr);
+ encode_sequence(&xdr, &args->seq_args, &hdr);
+ encode_putfh(&xdr, NFS_FH(args->inode), &hdr);
+ encode_layoutreturn(&xdr, args, &hdr);
+ encode_nops(&hdr);
+ return 0;
+}
+
+/*
* Encode a pNFS File Layout Data Server WRITE request
*/
static int nfs4_xdr_enc_dswrite(struct rpc_rqst *req, uint32_t *p,
@@ -5219,6 +5283,27 @@ out_overflow:
return -EIO;
}

+static int decode_layoutreturn(struct xdr_stream *xdr,
+ struct nfs4_layoutreturn_res *res)
+{
+ __be32 *p;
+ int status;
+
+ status = decode_op_hdr(xdr, OP_LAYOUTRETURN);
+ if (status)
+ return status;
+ p = xdr_inline_decode(xdr, 4);
+ if (unlikely(!p))
+ goto out_overflow;
+ res->lrs_present = be32_to_cpup(p);
+ if (res->lrs_present)
+ status = decode_stateid(xdr, &res->stateid);
+ return status;
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
static int decode_layoutcommit(struct xdr_stream *xdr,
struct rpc_rqst *req,
struct nfs4_layoutcommit_res *res)
@@ -6324,6 +6409,31 @@ out:
}

/*
+ * Decode LAYOUTRETURN response
+ */
+static int nfs4_xdr_dec_layoutreturn(struct rpc_rqst *rqstp, uint32_t *p,
+ struct nfs4_layoutreturn_res *res)
+{
+ struct xdr_stream xdr;
+ struct compound_hdr hdr;
+ int status;
+
+ xdr_init_decode(&xdr, &rqstp->rq_rcv_buf, p);
+ status = decode_compound_hdr(&xdr, &hdr);
+ if (status)
+ goto out;
+ status = decode_sequence(&xdr, &res->seq_res, rqstp);
+ if (status)
+ goto out;
+ status = decode_putfh(&xdr);
+ if (status)
+ goto out;
+ status = decode_layoutreturn(&xdr, res);
+out:
+ return status;
+}
+
+/*
* Decode LAYOUTCOMMIT response
*/
static int nfs4_xdr_dec_layoutcommit(struct rpc_rqst *rqstp, uint32_t *p,
@@ -6601,6 +6711,7 @@ struct rpc_procinfo nfs4_procedures[] = {
PROC(GETDEVICEINFO, enc_getdeviceinfo, dec_getdeviceinfo),
PROC(LAYOUTGET, enc_layoutget, dec_layoutget),
PROC(LAYOUTCOMMIT, enc_layoutcommit, dec_layoutcommit),
+ PROC(LAYOUTRETURN, enc_layoutreturn, dec_layoutreturn),
PROC(PNFS_WRITE, enc_dswrite, dec_dswrite),
PROC(PNFS_COMMIT, enc_dscommit, dec_dscommit),
#endif /* CONFIG_NFS_V4_1 */
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 93f50f4..904c110 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -585,9 +585,56 @@ void nfs4_asynch_forget_layouts(struct pnfs_layout_hdr *lo,
}
}

-/* Since we are using the forgetful model, nothing is sent over the wire.
- * However, we still must stop using any matching layouts.
+/* Return true if there is layout based io in progress in the given range.
+ * Assumes range has already been marked invalid, and layout marked to
+ * prevent any new lseg from being inserted.
*/
+bool
+pnfs_return_layout_barrier(struct nfs_inode *nfsi,
+ struct pnfs_layout_range *range)
+{
+ struct pnfs_layout_segment *lseg;
+ bool ret = false;
+
+ spin_lock(&nfsi->vfs_inode.i_lock);
+ list_for_each_entry(lseg, &nfsi->layout->segs, fi_list)
+ if (should_free_lseg(&lseg->range, range)) {
+ ret = true;
+ break;
+ }
+ spin_unlock(&nfsi->vfs_inode.i_lock);
+ dprintk("%s:Return %d\n", __func__, ret);
+ return ret;
+}
+
+static int
+return_layout(struct inode *ino, struct pnfs_layout_range *range, bool wait)
+{
+ struct nfs4_layoutreturn *lrp;
+ struct nfs_server *server = NFS_SERVER(ino);
+ int status = -ENOMEM;
+
+ dprintk("--> %s\n", __func__);
+
+ lrp = kzalloc(sizeof(*lrp), GFP_KERNEL);
+ if (lrp == NULL) {
+ put_layout_hdr(ino);
+ goto out;
+ }
+ lrp->args.reclaim = 0;
+ lrp->args.layout_type = server->pnfs_curr_ld->id;
+ lrp->args.return_type = RETURN_FILE;
+ lrp->args.range = *range;
+ lrp->args.inode = ino;
+ lrp->clp = server->nfs_client;
+
+ status = nfs4_proc_layoutreturn(lrp, wait);
+out:
+ dprintk("<-- %s status: %d\n", __func__, status);
+ return status;
+}
+
+/* Initiates a LAYOUTRETURN(FILE) */
int
_pnfs_return_layout(struct inode *ino, struct pnfs_layout_range *range,
bool wait)
@@ -612,10 +659,21 @@ _pnfs_return_layout(struct inode *ino, struct pnfs_layout_range *range,
goto out;
}
lo->plh_block_lgets++;
+ /* Reference matched in nfs4_layoutreturn_release */
+ get_layout_hdr(lo);
spin_unlock(&ino->i_lock);
pnfs_free_lseg_list(&tmp_list);

- /* Don't need to wait since this is followed by call to end_writeback */
+ if (layoutcommit_needed(nfsi)) {
+ status = pnfs_layoutcommit_inode(ino, wait);
+ if (status) {
+ /* Return layout even if layoutcommit fails */
+ dprintk("%s: layoutcommit failed, status=%d. "
+ "Returning layout anyway\n",
+ __func__, status);
+ }
+ }
+ status = return_layout(ino, &arg, wait);
out:
dprintk("<-- %s status: %d\n", __func__, status);
return status;
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 7a4559e..bbafee0 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -182,6 +182,7 @@ extern int nfs4_proc_getdeviceinfo(struct nfs_server *server,
extern int nfs4_proc_layoutget(struct nfs4_layoutget *lgp);
extern int nfs4_proc_layoutcommit(struct nfs4_layoutcommit_data *data,
int issync);
+extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp, bool wait);

/* pnfs.c */
void get_layout_hdr(struct pnfs_layout_hdr *lo);
@@ -191,6 +192,7 @@ bool should_free_lseg(struct pnfs_layout_range *lseg_range,
struct pnfs_layout_segment *
pnfs_update_layout(struct inode *ino, struct nfs_open_context *ctx,
enum pnfs_iomode access_type);
+bool pnfs_return_layout_barrier(struct nfs_inode *, struct pnfs_layout_range *);
int _pnfs_return_layout(struct inode *, struct pnfs_layout_range *, bool wait);
void set_pnfs_layoutdriver(struct nfs_server *, u32 id);
void unset_pnfs_layoutdriver(struct nfs_server *);
--
1.7.2.3


2010-12-15 18:51:07

by Benny Halevy

[permalink] [raw]
Subject: Re: [PATCH 1/9] Revert "pnfs-submit: wave2: remove forgotten layoutreturn struct definitions"

On 2010-12-15 20:32, Trond Myklebust wrote:
> On Wed, 2010-12-15 at 20:30 +0200, Benny Halevy wrote:
>> This reverts commit 19e1e5ae1ec0a3f5d997a1a5d924d482e147bea2.
>> ---
>> include/linux/nfs4.h | 1 +
>> include/linux/nfs_xdr.h | 23 +++++++++++++++++++++++
>> 2 files changed, 24 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
>> index 8ca7700..55511e8 100644
>> --- a/include/linux/nfs4.h
>> +++ b/include/linux/nfs4.h
>> @@ -557,6 +557,7 @@ enum {
>> NFSPROC4_CLNT_RECLAIM_COMPLETE,
>> NFSPROC4_CLNT_LAYOUTGET,
>> NFSPROC4_CLNT_LAYOUTCOMMIT,
>> + NFSPROC4_CLNT_LAYOUTRETURN,
>> NFSPROC4_CLNT_GETDEVICEINFO,
>> NFSPROC4_CLNT_PNFS_WRITE,
>> NFSPROC4_CLNT_PNFS_COMMIT,
>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>> index 9d847ac..a651574 100644
>> --- a/include/linux/nfs_xdr.h
>> +++ b/include/linux/nfs_xdr.h
>> @@ -258,6 +258,29 @@ struct nfs4_layoutcommit_data {
>> int status;
>> };
>>
>> +struct nfs4_layoutreturn_args {
>> + __u32 reclaim;
>> + __u32 layout_type;
>> + __u32 return_type;
>> + struct pnfs_layout_range range;
>> + struct inode *inode;
>> + struct nfs4_sequence_args seq_args;
>> +};
>> +
>> +struct nfs4_layoutreturn_res {
>> + struct nfs4_sequence_res seq_res;
>> + u32 lrs_present;
>> + nfs4_stateid stateid;
>> +};
>> +
>> +struct nfs4_layoutreturn {
>> + struct nfs4_layoutreturn_args args;
>> + struct nfs4_layoutreturn_res res;
>> + struct rpc_cred *cred;
>> + struct nfs_client *clp;
>> + int rpc_status;
>> +};
>> +
>> struct nfs4_getdeviceinfo_args {
>> struct pnfs_device *pdev;
>> struct nfs4_sequence_args seq_args;
>
> Why? We don't need or even want layoutreturn. It adds too much
> serialisation crap.

Define "we" :)

First, the object layout driver relies on layout return for returning I/O error
information. On the common, successful path, with return_on_close (that Panasas
uses but others may not) I agree that CLOSE with the implicit layoutreturn
semantics we discussed should do a good job too (accompanied with a respective
LAYOUTCOMMIT if needed).

Then, when there's a large number of outstanding layout segments (again
prob. non-files layout presuming server implementations are going to utilize
whole-file layouts) proactive layoutreturn comes handy in capping the
state both the client and server keep - reducing time wasted on walking long
lists of items.

For CB_LAYOUTRECALL response the heart of the debate is around synchronizing
with layouts in-use and in-flight layoutgets. There, having the server poll
the client, who's retuning NFS4ERR_DELAY should essentially work but may be
inefficient and unreliable in use cases where contention is likely enough.
Eventually, when CB_LAYOUTRECALL is clear to go sending the LAYOUTRETURN
or replying with CB_NOMATCHING_LAYOUT (assuming no I/O error to report
for pnfs-obj) should be equivalent [note: need errata to clarify the
resulting stateid after NOMATCHING_LAYOUT].
Is this the serialization "crap" you're talking about?
What makes checking the conditions for returning NFS4ERR_DELAY to
CB_LAYOUTRECALL so different from implementing a barrier and doing the
returns asynchronously with the CB_LAYOUTRECALL?

Benny

>
>