On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote:
> From: Olga Kornievskaia <[email protected]>
>
> In recent pnfs testing we've incountered IO thread starvation problem
> during the time when the server returns LAYOUTUNAVAILABLE error to
> the
> client. When that happens each IO request tries to get a new layout
> and the pnfs_update_layout() code ensures that only 1 LAYOUTGET
> RPC is outstanding, the rest would be waiting. As the thread that
> gets
> the layout wakes up the waiters only one gets to run and it tends to
> be
> the latest added to the waiting queue. After receiving
> LAYOUTUNAVAILABLE
> error the client would fall back to the MDS writes and as those
> writes
> complete and the new write is issued, those requests are added as
> waiters and they get to run before the earliest of the waiters that
> was put on the queue originally never gets to run until the
> LAYOUTUNAVAILABLE condition resolves itself on the server.
>
> With the current code, if N IOs arrive asking for a layout, then
> there will be N serial LAYOUTGETs that will follow, each would be
> getting its own LAYOUTUNAVAILABLE error. Instead, the patch proposes
> to apply the error condition to ALL the waiters for the outstanding
> LAYOUTGET. Once the error is received, the code would allow all
> exiting N IOs fall back to the MDS, but any new arriving IOs would be
> then queued up and one them the new IO would trigger a new LAYOUTGET.
>
> Signed-off-by: Olga Kornievskaia <[email protected]>
> ---
> fs/nfs/pnfs.c | 14 +++++++++++++-
> fs/nfs/pnfs.h | 2 ++
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 68a87be3e6f9..5b7a679e32c8 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino,
> if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo))
> &&
> atomic_read(&lo->plh_outstanding) != 0) {
> spin_unlock(&ino->i_lock);
> + atomic_inc(&lo->plh_waiting);
> lseg = ERR_PTR(wait_var_event_killable(&lo-
> >plh_outstanding,
> !atomic_read(&lo-
> >plh_outstanding)));
> - if (IS_ERR(lseg))
> + if (IS_ERR(lseg)) {
> + atomic_dec(&lo->plh_waiting);
> goto out_put_layout_hdr;
> + }
> + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) {
> + pnfs_layout_clear_fail_bit(lo,
> pnfs_iomode_to_fail_bit(iomode));
> + lseg = NULL;
> + if (atomic_dec_and_test(&lo->plh_waiting))
> + clear_bit(NFS_LAYOUT_DRAIN, &lo-
> >plh_flags);
> + goto out_put_layout_hdr;
> + }
> pnfs_put_layout_hdr(lo);
> goto lookup_again;
> }
> @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino,
> case -ERECALLCONFLICT:
> case -EAGAIN:
> break;
> + case -ENODATA:
> + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags);
> default:
> if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> pnfs_layout_clear_fail_bit(lo,
> pnfs_iomode_to_fail_bit(iomode));
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 07f11489e4e9..5c07da32320b 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -105,6 +105,7 @@ enum {
> NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first layoutget
> */
> NFS_LAYOUT_INODE_FREEING, /* The inode is being freed
> */
> NFS_LAYOUT_HASHED, /* The layout visible */
> + NFS_LAYOUT_DRAIN,
> };
>
> enum layoutdriver_policy_flags {
> @@ -196,6 +197,7 @@ struct pnfs_commit_ops {
> struct pnfs_layout_hdr {
> refcount_t plh_refcount;
> atomic_t plh_outstanding; /* number of RPCs
> out */
> + atomic_t plh_waiting;
> struct list_head plh_layouts; /* other client
> layouts */
> struct list_head plh_bulk_destroy;
> struct list_head plh_segs; /* layout segments
> list */
According to the spec, the correct behaviour for handling
NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the inode,
and to fall back to doing I/O through the MDS. The error describes a
more or less permanent state of the server being unable to hand out a
layout for this file.
If the server wanted the clients to retry after a delay, it should be
returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly.
Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE is
just plain wrong: we just return no layout, and then let the next
caller to pnfs_update_layout() immediately try again.
My problem with this patch, is that it just falls back to doing I/O
through the MDS for the writes that are already queued in
pnfs_update_layout(). It perpetuates the current bad behaviour of
unnecessary pounding of the server with LAYOUTGET requests that are
going to fail with the exact same error.
I'd therefore prefer to see us fix the real bug (i.e. the handling of
NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating issues
with the queuing. I already have 2 patches to deal with this.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Tue, May 31, 2022 at 11:00 AM Trond Myklebust
<[email protected]> wrote:
>
> On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote:
> > From: Olga Kornievskaia <[email protected]>
> >
> > In recent pnfs testing we've incountered IO thread starvation problem
> > during the time when the server returns LAYOUTUNAVAILABLE error to
> > the
> > client. When that happens each IO request tries to get a new layout
> > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET
> > RPC is outstanding, the rest would be waiting. As the thread that
> > gets
> > the layout wakes up the waiters only one gets to run and it tends to
> > be
> > the latest added to the waiting queue. After receiving
> > LAYOUTUNAVAILABLE
> > error the client would fall back to the MDS writes and as those
> > writes
> > complete and the new write is issued, those requests are added as
> > waiters and they get to run before the earliest of the waiters that
> > was put on the queue originally never gets to run until the
> > LAYOUTUNAVAILABLE condition resolves itself on the server.
> >
> > With the current code, if N IOs arrive asking for a layout, then
> > there will be N serial LAYOUTGETs that will follow, each would be
> > getting its own LAYOUTUNAVAILABLE error. Instead, the patch proposes
> > to apply the error condition to ALL the waiters for the outstanding
> > LAYOUTGET. Once the error is received, the code would allow all
> > exiting N IOs fall back to the MDS, but any new arriving IOs would be
> > then queued up and one them the new IO would trigger a new LAYOUTGET.
> >
> > Signed-off-by: Olga Kornievskaia <[email protected]>
> > ---
> > fs/nfs/pnfs.c | 14 +++++++++++++-
> > fs/nfs/pnfs.h | 2 ++
> > 2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > index 68a87be3e6f9..5b7a679e32c8 100644
> > --- a/fs/nfs/pnfs.c
> > +++ b/fs/nfs/pnfs.c
> > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino,
> > if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo))
> > &&
> > atomic_read(&lo->plh_outstanding) != 0) {
> > spin_unlock(&ino->i_lock);
> > + atomic_inc(&lo->plh_waiting);
> > lseg = ERR_PTR(wait_var_event_killable(&lo-
> > >plh_outstanding,
> > !atomic_read(&lo-
> > >plh_outstanding)));
> > - if (IS_ERR(lseg))
> > + if (IS_ERR(lseg)) {
> > + atomic_dec(&lo->plh_waiting);
> > goto out_put_layout_hdr;
> > + }
> > + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) {
> > + pnfs_layout_clear_fail_bit(lo,
> > pnfs_iomode_to_fail_bit(iomode));
> > + lseg = NULL;
> > + if (atomic_dec_and_test(&lo->plh_waiting))
> > + clear_bit(NFS_LAYOUT_DRAIN, &lo-
> > >plh_flags);
> > + goto out_put_layout_hdr;
> > + }
> > pnfs_put_layout_hdr(lo);
> > goto lookup_again;
> > }
> > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino,
> > case -ERECALLCONFLICT:
> > case -EAGAIN:
> > break;
> > + case -ENODATA:
> > + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags);
> > default:
> > if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> > pnfs_layout_clear_fail_bit(lo,
> > pnfs_iomode_to_fail_bit(iomode));
> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> > index 07f11489e4e9..5c07da32320b 100644
> > --- a/fs/nfs/pnfs.h
> > +++ b/fs/nfs/pnfs.h
> > @@ -105,6 +105,7 @@ enum {
> > NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first layoutget
> > */
> > NFS_LAYOUT_INODE_FREEING, /* The inode is being freed
> > */
> > NFS_LAYOUT_HASHED, /* The layout visible */
> > + NFS_LAYOUT_DRAIN,
> > };
> >
> > enum layoutdriver_policy_flags {
> > @@ -196,6 +197,7 @@ struct pnfs_commit_ops {
> > struct pnfs_layout_hdr {
> > refcount_t plh_refcount;
> > atomic_t plh_outstanding; /* number of RPCs
> > out */
> > + atomic_t plh_waiting;
> > struct list_head plh_layouts; /* other client
> > layouts */
> > struct list_head plh_bulk_destroy;
> > struct list_head plh_segs; /* layout segments
> > list */
>
> According to the spec, the correct behaviour for handling
> NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the inode,
> and to fall back to doing I/O through the MDS. The error describes a
> more or less permanent state of the server being unable to hand out a
> layout for this file.
> If the server wanted the clients to retry after a delay, it should be
> returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly.
To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn off
the inode permanently but for a period of time?
It looks to me that for the LAYOUTTRYLATER, the client would face the
same starvation problem in the same situation. I don't see anything
marking the segment failed for such error? I believe the client
returns nolayout for that error, falls back to MDS but allows asking
for the layout for a period of time, having again the queue of waiters
that gets manipulated as such that favors last added.
> Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE is
> just plain wrong: we just return no layout, and then let the next
> caller to pnfs_update_layout() immediately try again.
>
> My problem with this patch, is that it just falls back to doing I/O
> through the MDS for the writes that are already queued in
> pnfs_update_layout(). It perpetuates the current bad behaviour of
> unnecessary pounding of the server with LAYOUTGET requests that are
> going to fail with the exact same error.
>
> I'd therefore prefer to see us fix the real bug (i.e. the handling of
> NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating issues
> with the queuing. I already have 2 patches to deal with this.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>