2022-06-01 14:09:38

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

Hi Olga,

----- Original Message -----
> From: "Olga Kornievskaia" <[email protected]>
> To: "trondmy" <[email protected]>
> Cc: "Anna Schumaker" <[email protected]>, "linux-nfs" <[email protected]>
> Sent: Tuesday, 31 May, 2022 18:03:34
> Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

> On Tue, May 31, 2022 at 11:00 AM Trond Myklebust
> <[email protected]> wrote:
>>
>> On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote:
>> > From: Olga Kornievskaia <[email protected]>
>> >
>> > In recent pnfs testing we've incountered IO thread starvation problem
>> > during the time when the server returns LAYOUTUNAVAILABLE error to
>> > the
>> > client. When that happens each IO request tries to get a new layout
>> > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET
>> > RPC is outstanding, the rest would be waiting. As the thread that
>> > gets
>> > the layout wakes up the waiters only one gets to run and it tends to
>> > be
>> > the latest added to the waiting queue. After receiving
>> > LAYOUTUNAVAILABLE
>> > error the client would fall back to the MDS writes and as those
>> > writes
>> > complete and the new write is issued, those requests are added as
>> > waiters and they get to run before the earliest of the waiters that
>> > was put on the queue originally never gets to run until the
>> > LAYOUTUNAVAILABLE condition resolves itself on the server.
>> >
>> > With the current code, if N IOs arrive asking for a layout, then
>> > there will be N serial LAYOUTGETs that will follow, each would be
>> > getting its own LAYOUTUNAVAILABLE error. Instead, the patch proposes
>> > to apply the error condition to ALL the waiters for the outstanding
>> > LAYOUTGET. Once the error is received, the code would allow all
>> > exiting N IOs fall back to the MDS, but any new arriving IOs would be
>> > then queued up and one them the new IO would trigger a new LAYOUTGET.
>> >
>> > Signed-off-by: Olga Kornievskaia <[email protected]>
>> > ---
>> > fs/nfs/pnfs.c | 14 +++++++++++++-
>> > fs/nfs/pnfs.h | 2 ++
>> > 2 files changed, 15 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> > index 68a87be3e6f9..5b7a679e32c8 100644
>> > --- a/fs/nfs/pnfs.c
>> > +++ b/fs/nfs/pnfs.c
>> > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino,
>> > if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo))
>> > &&
>> > atomic_read(&lo->plh_outstanding) != 0) {
>> > spin_unlock(&ino->i_lock);
>> > + atomic_inc(&lo->plh_waiting);
>> > lseg = ERR_PTR(wait_var_event_killable(&lo-
>> > >plh_outstanding,
>> > !atomic_read(&lo-
>> > >plh_outstanding)));
>> > - if (IS_ERR(lseg))
>> > + if (IS_ERR(lseg)) {
>> > + atomic_dec(&lo->plh_waiting);
>> > goto out_put_layout_hdr;
>> > + }
>> > + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) {
>> > + pnfs_layout_clear_fail_bit(lo,
>> > pnfs_iomode_to_fail_bit(iomode));
>> > + lseg = NULL;
>> > + if (atomic_dec_and_test(&lo->plh_waiting))
>> > + clear_bit(NFS_LAYOUT_DRAIN, &lo-
>> > >plh_flags);
>> > + goto out_put_layout_hdr;
>> > + }
>> > pnfs_put_layout_hdr(lo);
>> > goto lookup_again;
>> > }
>> > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino,
>> > case -ERECALLCONFLICT:
>> > case -EAGAIN:
>> > break;
>> > + case -ENODATA:
>> > + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags);
>> > default:
>> > if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
>> > pnfs_layout_clear_fail_bit(lo,
>> > pnfs_iomode_to_fail_bit(iomode));
>> > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
>> > index 07f11489e4e9..5c07da32320b 100644
>> > --- a/fs/nfs/pnfs.h
>> > +++ b/fs/nfs/pnfs.h
>> > @@ -105,6 +105,7 @@ enum {
>> > NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first layoutget
>> > */
>> > NFS_LAYOUT_INODE_FREEING, /* The inode is being freed
>> > */
>> > NFS_LAYOUT_HASHED, /* The layout visible */
>> > + NFS_LAYOUT_DRAIN,
>> > };
>> >
>> > enum layoutdriver_policy_flags {
>> > @@ -196,6 +197,7 @@ struct pnfs_commit_ops {
>> > struct pnfs_layout_hdr {
>> > refcount_t plh_refcount;
>> > atomic_t plh_outstanding; /* number of RPCs
>> > out */
>> > + atomic_t plh_waiting;
>> > struct list_head plh_layouts; /* other client
>> > layouts */
>> > struct list_head plh_bulk_destroy;
>> > struct list_head plh_segs; /* layout segments
>> > list */
>>
>> According to the spec, the correct behaviour for handling
>> NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the inode,
>> and to fall back to doing I/O through the MDS. The error describes a
>> more or less permanent state of the server being unable to hand out a
>> layout for this file.
>> If the server wanted the clients to retry after a delay, it should be
>> returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly.
>
> To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn off
> the inode permanently but for a period of time?
>
> It looks to me that for the LAYOUTTRYLATER, the client would face the
> same starvation problem in the same situation. I don't see anything
> marking the segment failed for such error? I believe the client
> returns nolayout for that error, falls back to MDS but allows asking
> for the layout for a period of time, having again the queue of waiters

Your assumption doesn't match to our observation. For files that off-line
(DS down or file is on tape) we return LAYOUTTRYLATER. Usually, client keep
re-trying LAYOUTGET until file is available again. We use flexfile layout
as nfs4_file has less predictable behavior. For files that should be served
by MDS we return LAYOUTUNAVAILABLE. Typically, those files are quite small
and served with a single READ request, so I haven't observe repetitive LAYOUTGET
calls.

Best regards,
Tigran.

> that gets manipulated as such that favors last added.
>
>
>> Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE is
>> just plain wrong: we just return no layout, and then let the next
>> caller to pnfs_update_layout() immediately try again.
>>
>> My problem with this patch, is that it just falls back to doing I/O
>> through the MDS for the writes that are already queued in
>> pnfs_update_layout(). It perpetuates the current bad behaviour of
>> unnecessary pounding of the server with LAYOUTGET requests that are
>> going to fail with the exact same error.
>>
>> I'd therefore prefer to see us fix the real bug (i.e. the handling of
>> NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating issues
>> with the queuing. I already have 2 patches to deal with this.
>>
>> --
>> Trond Myklebust
>> Linux NFS client maintainer, Hammerspace
>> [email protected]
>>


Attachments:
smime.p7s (2.16 kB)
S/MIME Cryptographic Signature

2022-06-01 20:16:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

On Wed, 2022-06-01 at 09:27 +0200, Mkrtchyan, Tigran wrote:
> Hi Olga,
>
> ----- Original Message -----
> > From: "Olga Kornievskaia" <[email protected]>
> > To: "trondmy" <[email protected]>
> > Cc: "Anna Schumaker" <[email protected]>, "linux-nfs"
> > <[email protected]>
> > Sent: Tuesday, 31 May, 2022 18:03:34
> > Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during
> > LAYOUTUNAVAILABLE error
>
>

<snip>

> > To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn
> > off
> > the inode permanently but for a period of time?
> >
> > It looks to me that for the LAYOUTTRYLATER, the client would face
> > the
> > same starvation problem in the same situation. I don't see anything
> > marking the segment failed for such error? I believe the client
> > returns nolayout for that error, falls back to MDS but allows
> > asking
> > for the layout for a period of time, having again the queue of
> > waiters
>
> Your assumption doesn't match to our observation. For files that off-
> line
> (DS down or file is on tape) we return LAYOUTTRYLATER. Usually,
> client keep
> re-trying LAYOUTGET until file is available again. We use flexfile
> layout
> as nfs4_file has less predictable behavior. For files that should be
> served
> by MDS we return LAYOUTUNAVAILABLE. Typically, those files are quite
> small
> and served with a single READ request, so I haven't observe
> repetitive LAYOUTGET
> calls.

Right. If you're only doing READ, and this is a small file, then you
are unlikely to see any repetition of the LAYOUTGET calls like Olga
describes, because the client page cache will take care of serialising
the initial I/O (by means of the folio lock/page lock) and will cache
the results so that no further I/O is needed.

The main problems with NFS4ERR_LAYOUTUNAVAILABLE in current pNFS
implementation will occur when reading large files and/or when writing
to the file. In those cases, we will send a LAYOUTGET each time we need
to schedule more I/O because we don't cache the negative result of the
previous attempt.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2022-06-02 10:12:27

by Mkrtchyan, Tigran

[permalink] [raw]
Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error


I had re-run the tests. So, indeed, if I wait log enough, then client falls back to IO
through MDS and issues a new LAYOUTGET after the first successful READ operation.

the capture available at https://sas.desy.de/index.php/s/StGeijXTGysdHa2

Best regards,
Tigran.


----- Original Message -----
> From: "trondmy" <[email protected]>
> To: "Tigran Mkrtchyan" <[email protected]>, "Olga Kornievskaia" <[email protected]>
> Cc: "linux-nfs" <[email protected]>, "Anna Schumaker" <[email protected]>
> Sent: Wednesday, 1 June, 2022 17:50:53
> Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

> On Wed, 2022-06-01 at 09:27 +0200, Mkrtchyan, Tigran wrote:
>> Hi Olga,
>>
>> ----- Original Message -----
>> > From: "Olga Kornievskaia" <[email protected]>
>> > To: "trondmy" <[email protected]>
>> > Cc: "Anna Schumaker" <[email protected]>, "linux-nfs"
>> > <[email protected]>
>> > Sent: Tuesday, 31 May, 2022 18:03:34
>> > Subject: Re: [PATCH] pNFS: fix IO thread starvation problem during
>> > LAYOUTUNAVAILABLE error
>>
>>
>
> <snip>
>
>> > To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn
>> > off
>> > the inode permanently but for a period of time?
>> >
>> > It looks to me that for the LAYOUTTRYLATER, the client would face
>> > the
>> > same starvation problem in the same situation. I don't see anything
>> > marking the segment failed for such error? I believe the client
>> > returns nolayout for that error, falls back to MDS but allows
>> > asking
>> > for the layout for a period of time, having again the queue of
>> > waiters
>>
>> Your assumption doesn't match to our observation. For files that off-
>> line
>> (DS down or file is on tape) we return LAYOUTTRYLATER. Usually,
>> client keep
>> re-trying LAYOUTGET until file is available again. We use flexfile
>> layout
>> as nfs4_file has less predictable behavior. For files that should be
>> served
>> by MDS we return LAYOUTUNAVAILABLE. Typically, those files are quite
>> small
>> and served with a single READ request, so I haven't observe
>> repetitive LAYOUTGET
>> calls.
>
> Right. If you're only doing READ, and this is a small file, then you
> are unlikely to see any repetition of the LAYOUTGET calls like Olga
> describes, because the client page cache will take care of serialising
> the initial I/O (by means of the folio lock/page lock) and will cache
> the results so that no further I/O is needed.
>
> The main problems with NFS4ERR_LAYOUTUNAVAILABLE in current pNFS
> implementation will occur when reading large files and/or when writing
> to the file. In those cases, we will send a LAYOUTGET each time we need
> to schedule more I/O because we don't cache the negative result of the
> previous attempt.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]


Attachments:
smime.p7s (2.16 kB)
S/MIME Cryptographic Signature