LinuxLists.cc - handling ERR_SERVERFAULT on RESTOREFH

2020-04-28 18:15:37

Subject: handling ERR_SERVERFAULT on RESTOREFH

Hi folk,

Looking for guidance on what folks think. A client is sending a LINK
operation to the server. This compound after the LINK has RESTOREFH
and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But LINK is
done successfully. Client still fails the system call with EIO. We
have a hardline and "ln" saying hardlink failed.

Should the client not fail the system call in this case? The fact that
we couldn't get up-to-date attributes don't seem like the reason to
fail the system call?

Thank you.

2020-04-28 18:25:38

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, Apr 28, 2020 at 2:14 PM Olga Kornievskaia <[email protected]> wrote:
>
> Hi folk,
>
> Looking for guidance on what folks think. A client is sending a LINK
> operation to the server. This compound after the LINK has RESTOREFH
> and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But LINK is
> done successfully. Client still fails the system call with EIO. We
> have a hardline and "ln" saying hardlink failed.

correction: have a hardlink.

> Should the client not fail the system call in this case? The fact that
> we couldn't get up-to-date attributes don't seem like the reason to
> fail the system call?
>
> Thank you.

2020-04-28 18:47:52

by Trond Myklebust

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

Hi Olga,

On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> Hi folk,
>
> Looking for guidance on what folks think. A client is sending a LINK
> operation to the server. This compound after the LINK has RESTOREFH
> and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But LINK is
> done successfully. Client still fails the system call with EIO. We
> have a hardline and "ln" saying hardlink failed.
>
> Should the client not fail the system call in this case? The fact
> that
> we couldn't get up-to-date attributes don't seem like the reason to
> fail the system call?
>
> Thank you.

I don't really see this as worth fixing on the client. It is very
clearly a server bug.

Thanks
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2020-04-28 20:42:34

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <[email protected]> wrote:
>
> Hi Olga,
>
> On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > Hi folk,
> >
> > Looking for guidance on what folks think. A client is sending a LINK
> > operation to the server. This compound after the LINK has RESTOREFH
> > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But LINK is
> > done successfully. Client still fails the system call with EIO. We
> > have a hardline and "ln" saying hardlink failed.
> >
> > Should the client not fail the system call in this case? The fact
> > that
> > we couldn't get up-to-date attributes don't seem like the reason to
> > fail the system call?
> >
> > Thank you.
>
> I don't really see this as worth fixing on the client. It is very
> clearly a server bug.

Why is that a server bug? A server can legitimately have an issue
trying to execute an operation (RESTOREFH) and legitimately returning
an error.

NFS client also ignores errors of the returning GETATTR after the
RESTOREFH. So I'm not sure why we are then not ignoring errors (or
some errors) of the RESTOREFH.

> Thanks
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2020-04-28 21:34:07

by Trond Myklebust

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> [email protected]> wrote:
> > Hi Olga,
> >
> > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > Hi folk,
> > >
> > > Looking for guidance on what folks think. A client is sending a
> > > LINK
> > > operation to the server. This compound after the LINK has
> > > RESTOREFH
> > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
> > > LINK is
> > > done successfully. Client still fails the system call with EIO.
> > > We
> > > have a hardline and "ln" saying hardlink failed.
> > >
> > > Should the client not fail the system call in this case? The fact
> > > that
> > > we couldn't get up-to-date attributes don't seem like the reason
> > > to
> > > fail the system call?
> > >
> > > Thank you.
> >
> > I don't really see this as worth fixing on the client. It is very
> > clearly a server bug.
>
> Why is that a server bug? A server can legitimately have an issue
> trying to execute an operation (RESTOREFH) and legitimately returning
> an error.

If it is happening consistently on the server, then it is a bug, and it
gets reported by the client in the same way we always report
NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.

> NFS client also ignores errors of the returning GETATTR after the
> RESTOREFH. So I'm not sure why we are then not ignoring errors (or
> some errors) of the RESTOREFH.

We do need to check the value of RESTOREFH in order to figure out if we
can continue reading the XDR buffer to decode the file attributes. We
want to read those file attributes because we do expect the change
attribute, the ctime and the nlinks values to all change as a result of
the operation.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2020-04-28 23:07:10

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <[email protected]> wrote:
>
> On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > [email protected]> wrote:
> > > Hi Olga,
> > >
> > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > Hi folk,
> > > >
> > > > Looking for guidance on what folks think. A client is sending a
> > > > LINK
> > > > operation to the server. This compound after the LINK has
> > > > RESTOREFH
> > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
> > > > LINK is
> > > > done successfully. Client still fails the system call with EIO.
> > > > We
> > > > have a hardline and "ln" saying hardlink failed.
> > > >
> > > > Should the client not fail the system call in this case? The fact
> > > > that
> > > > we couldn't get up-to-date attributes don't seem like the reason
> > > > to
> > > > fail the system call?
> > > >
> > > > Thank you.
> > >
> > > I don't really see this as worth fixing on the client. It is very
> > > clearly a server bug.
> >
> > Why is that a server bug? A server can legitimately have an issue
> > trying to execute an operation (RESTOREFH) and legitimately returning
> > an error.
>
> If it is happening consistently on the server, then it is a bug, and it
> gets reported by the client in the same way we always report
> NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.

Yes but the client doesn't retry so it can't assess if it's
consistently happening or not. It can be a transient error (or ENOMEM)
that's later resolved.

> > NFS client also ignores errors of the returning GETATTR after the
> > RESTOREFH. So I'm not sure why we are then not ignoring errors (or
> > some errors) of the RESTOREFH.
>
> We do need to check the value of RESTOREFH in order to figure out if we
> can continue reading the XDR buffer to decode the file attributes. We
> want to read those file attributes because we do expect the change
> attribute, the ctime and the nlinks values to all change as a result of
> the operation.

I have nothing against decoding the error and using it in a decision
to keep decoding. But the client doesn't have to propagate the
RESTOREFH error to the application?

In all other non-idempotent operations that have other operations (ie
GETATTR) following them, the client ignores the errors. Btw I just
noticed that on OPEN compound, since we ignore decode error from the
GETATTR, it would continue decoding LAYOUTGET...

CREATE has problem if the following GETFH will return EDELAY. Client
doesn't deal with retrying a part of the compound. It retries the
whole compound. It leads to an error (since non-idempotent operation
is retried). But I guess that's a 2nd issue (or a 3rd if we could the
decoding layoutget)....

All this is under the umbrella of how to handle errors on
non-idempotent operations in a compound....

>
> Cheers
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2020-04-28 23:42:42

by Trond Myklebust

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> [email protected]> wrote:
> > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > [email protected]> wrote:
> > > > Hi Olga,
> > > >
> > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > Hi folk,
> > > > >
> > > > > Looking for guidance on what folks think. A client is sending
> > > > > a
> > > > > LINK
> > > > > operation to the server. This compound after the LINK has
> > > > > RESTOREFH
> > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
> > > > > LINK is
> > > > > done successfully. Client still fails the system call with
> > > > > EIO.
> > > > > We
> > > > > have a hardline and "ln" saying hardlink failed.
> > > > >
> > > > > Should the client not fail the system call in this case? The
> > > > > fact
> > > > > that
> > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > reason
> > > > > to
> > > > > fail the system call?
> > > > >
> > > > > Thank you.
> > > >
> > > > I don't really see this as worth fixing on the client. It is
> > > > very
> > > > clearly a server bug.
> > >
> > > Why is that a server bug? A server can legitimately have an issue
> > > trying to execute an operation (RESTOREFH) and legitimately
> > > returning
> > > an error.
> >
> > If it is happening consistently on the server, then it is a bug,
> > and it
> > gets reported by the client in the same way we always report
> > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
>
> Yes but the client doesn't retry so it can't assess if it's
> consistently happening or not. It can be a transient error (or
> ENOMEM)
> that's later resolved.

If the server wants to signal a transient error, it should send
NFS4ERR_DELAY.

> > > NFS client also ignores errors of the returning GETATTR after the
> > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > (or
> > > some errors) of the RESTOREFH.
> >
> > We do need to check the value of RESTOREFH in order to figure out
> > if we
> > can continue reading the XDR buffer to decode the file attributes.
> > We
> > want to read those file attributes because we do expect the change
> > attribute, the ctime and the nlinks values to all change as a
> > result of
> > the operation.
>
> I have nothing against decoding the error and using it in a decision
> to keep decoding. But the client doesn't have to propagate the
> RESTOREFH error to the application?
>
> In all other non-idempotent operations that have other operations (ie
> GETATTR) following them, the client ignores the errors. Btw I just
> noticed that on OPEN compound, since we ignore decode error from the
> GETATTR, it would continue decoding LAYOUTGET...
>
> CREATE has problem if the following GETFH will return EDELAY. Client
> doesn't deal with retrying a part of the compound. It retries the
> whole compound. It leads to an error (since non-idempotent operation
> is retried). But I guess that's a 2nd issue (or a 3rd if we could the
> decoding layoutget)....
>
> All this is under the umbrella of how to handle errors on
> non-idempotent operations in a compound....

There is no point in trying to handle errors that make no sense. If the
server has a bug, then let's expose it instead of trying to hide it in
the sofa cushions.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2020-04-28 23:56:54

by Trond Myklebust

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, 2020-04-28 at 19:41 -0400, Trond Myklebust wrote:
> On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > [email protected]> wrote:
> > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > [email protected]> wrote:
> > > > > Hi Olga,
> > > > >
> > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > > Hi folk,
> > > > > >
> > > > > > Looking for guidance on what folks think. A client is
> > > > > > sending
> > > > > > a
> > > > > > LINK
> > > > > > operation to the server. This compound after the LINK has
> > > > > > RESTOREFH
> > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH.
> > > > > > But
> > > > > > LINK is
> > > > > > done successfully. Client still fails the system call with
> > > > > > EIO.
> > > > > > We
> > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > >
> > > > > > Should the client not fail the system call in this case?
> > > > > > The
> > > > > > fact
> > > > > > that
> > > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > > reason
> > > > > > to
> > > > > > fail the system call?
> > > > > >
> > > > > > Thank you.
> > > > >
> > > > > I don't really see this as worth fixing on the client. It is
> > > > > very
> > > > > clearly a server bug.
> > > >
> > > > Why is that a server bug? A server can legitimately have an
> > > > issue
> > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > returning
> > > > an error.
> > >
> > > If it is happening consistently on the server, then it is a bug,
> > > and it
> > > gets reported by the client in the same way we always report
> > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> >
> > Yes but the client doesn't retry so it can't assess if it's
> > consistently happening or not. It can be a transient error (or
> > ENOMEM)
> > that's later resolved.
>
> If the server wants to signal a transient error, it should send
> NFS4ERR_DELAY.
>
> > > > NFS client also ignores errors of the returning GETATTR after
> > > > the
> > > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > > (or
> > > > some errors) of the RESTOREFH.
> > >
> > > We do need to check the value of RESTOREFH in order to figure out
> > > if we
> > > can continue reading the XDR buffer to decode the file
> > > attributes.
> > > We
> > > want to read those file attributes because we do expect the
> > > change
> > > attribute, the ctime and the nlinks values to all change as a
> > > result of
> > > the operation.
> >
> > I have nothing against decoding the error and using it in a
> > decision
> > to keep decoding. But the client doesn't have to propagate the
> > RESTOREFH error to the application?
> >
> > In all other non-idempotent operations that have other operations
> > (ie
> > GETATTR) following them, the client ignores the errors. Btw I just
> > noticed that on OPEN compound, since we ignore decode error from
> > the
> > GETATTR, it would continue decoding LAYOUTGET...
> >
> > CREATE has problem if the following GETFH will return EDELAY.
> > Client
> > doesn't deal with retrying a part of the compound. It retries the
> > whole compound. It leads to an error (since non-idempotent
> > operation
> > is retried). But I guess that's a 2nd issue (or a 3rd if we could
> > the
> > decoding layoutget)....
> >
> > All this is under the umbrella of how to handle errors on
> > non-idempotent operations in a compound....
>
> There is no point in trying to handle errors that make no sense. If
> the
> server has a bug, then let's expose it instead of trying to hide it
> in
> the sofa cushions.

It basically boils down to this:
* I do not want to have to maintain a list of server bugs in the
client.
* If I make a client change, I don't want to have to consider whether
or not it causes a regression against some server that only 10
people are still using, and that never got a fix for a bug because
"the clients handle it".

We should fix client bugs when it is clear that they are client bugs.
We should push back hard on server bugs because the client is already a
complex enough beast.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2020-04-29 02:10:00

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust <[email protected]> wrote:
>
> On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > [email protected]> wrote:
> > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > [email protected]> wrote:
> > > > > Hi Olga,
> > > > >
> > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > > Hi folk,
> > > > > >
> > > > > > Looking for guidance on what folks think. A client is sending
> > > > > > a
> > > > > > LINK
> > > > > > operation to the server. This compound after the LINK has
> > > > > > RESTOREFH
> > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
> > > > > > LINK is
> > > > > > done successfully. Client still fails the system call with
> > > > > > EIO.
> > > > > > We
> > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > >
> > > > > > Should the client not fail the system call in this case? The
> > > > > > fact
> > > > > > that
> > > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > > reason
> > > > > > to
> > > > > > fail the system call?
> > > > > >
> > > > > > Thank you.
> > > > >
> > > > > I don't really see this as worth fixing on the client. It is
> > > > > very
> > > > > clearly a server bug.
> > > >
> > > > Why is that a server bug? A server can legitimately have an issue
> > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > returning
> > > > an error.
> > >
> > > If it is happening consistently on the server, then it is a bug,
> > > and it
> > > gets reported by the client in the same way we always report
> > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> >
> > Yes but the client doesn't retry so it can't assess if it's
> > consistently happening or not. It can be a transient error (or
> > ENOMEM)
> > that's later resolved.
>
> If the server wants to signal a transient error, it should send
> NFS4ERR_DELAY.

ERR_DELAY not an allowed error for the RESTOREFH. But let's say, the
server does return it, then client is not following the spec because
if it'll get this error, it will retry the whole compound (causing a
different error of redoing a non-idempotent operation). The spec says
client is responsible for handling partially completed compound. The
client should only retry the failed operations in a compound, I don't
see that client does that.

> > > > NFS client also ignores errors of the returning GETATTR after the
> > > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > > (or
> > > > some errors) of the RESTOREFH.
> > >
> > > We do need to check the value of RESTOREFH in order to figure out
> > > if we
> > > can continue reading the XDR buffer to decode the file attributes.
> > > We
> > > want to read those file attributes because we do expect the change
> > > attribute, the ctime and the nlinks values to all change as a
> > > result of
> > > the operation.
> >
> > I have nothing against decoding the error and using it in a decision
> > to keep decoding. But the client doesn't have to propagate the
> > RESTOREFH error to the application?
> >
> > In all other non-idempotent operations that have other operations (ie
> > GETATTR) following them, the client ignores the errors. Btw I just
> > noticed that on OPEN compound, since we ignore decode error from the
> > GETATTR, it would continue decoding LAYOUTGET...
> >
> > CREATE has problem if the following GETFH will return EDELAY. Client
> > doesn't deal with retrying a part of the compound. It retries the
> > whole compound. It leads to an error (since non-idempotent operation
> > is retried). But I guess that's a 2nd issue (or a 3rd if we could the
> > decoding layoutget)....
> >
> > All this is under the umbrella of how to handle errors on
> > non-idempotent operations in a compound....
>
> There is no point in trying to handle errors that make no sense. If the
> server has a bug, then let's expose it instead of trying to hide it in
> the sofa cushions.

EDELAY on GETFH is a reasonable error for the server to return.

>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2020-04-29 02:13:12

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, Apr 28, 2020 at 7:56 PM Trond Myklebust <[email protected]> wrote:
>
> On Tue, 2020-04-28 at 19:41 -0400, Trond Myklebust wrote:
> > On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > > [email protected]> wrote:
> > > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > > [email protected]> wrote:
> > > > > > Hi Olga,
> > > > > >
> > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > > > Hi folk,
> > > > > > >
> > > > > > > Looking for guidance on what folks think. A client is
> > > > > > > sending
> > > > > > > a
> > > > > > > LINK
> > > > > > > operation to the server. This compound after the LINK has
> > > > > > > RESTOREFH
> > > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH.
> > > > > > > But
> > > > > > > LINK is
> > > > > > > done successfully. Client still fails the system call with
> > > > > > > EIO.
> > > > > > > We
> > > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > > >
> > > > > > > Should the client not fail the system call in this case?
> > > > > > > The
> > > > > > > fact
> > > > > > > that
> > > > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > > > reason
> > > > > > > to
> > > > > > > fail the system call?
> > > > > > >
> > > > > > > Thank you.
> > > > > >
> > > > > > I don't really see this as worth fixing on the client. It is
> > > > > > very
> > > > > > clearly a server bug.
> > > > >
> > > > > Why is that a server bug? A server can legitimately have an
> > > > > issue
> > > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > > returning
> > > > > an error.
> > > >
> > > > If it is happening consistently on the server, then it is a bug,
> > > > and it
> > > > gets reported by the client in the same way we always report
> > > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> > >
> > > Yes but the client doesn't retry so it can't assess if it's
> > > consistently happening or not. It can be a transient error (or
> > > ENOMEM)
> > > that's later resolved.
> >
> > If the server wants to signal a transient error, it should send
> > NFS4ERR_DELAY.
> >
> > > > > NFS client also ignores errors of the returning GETATTR after
> > > > > the
> > > > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > > > (or
> > > > > some errors) of the RESTOREFH.
> > > >
> > > > We do need to check the value of RESTOREFH in order to figure out
> > > > if we
> > > > can continue reading the XDR buffer to decode the file
> > > > attributes.
> > > > We
> > > > want to read those file attributes because we do expect the
> > > > change
> > > > attribute, the ctime and the nlinks values to all change as a
> > > > result of
> > > > the operation.
> > >
> > > I have nothing against decoding the error and using it in a
> > > decision
> > > to keep decoding. But the client doesn't have to propagate the
> > > RESTOREFH error to the application?
> > >
> > > In all other non-idempotent operations that have other operations
> > > (ie
> > > GETATTR) following them, the client ignores the errors. Btw I just
> > > noticed that on OPEN compound, since we ignore decode error from
> > > the
> > > GETATTR, it would continue decoding LAYOUTGET...
> > >
> > > CREATE has problem if the following GETFH will return EDELAY.
> > > Client
> > > doesn't deal with retrying a part of the compound. It retries the
> > > whole compound. It leads to an error (since non-idempotent
> > > operation
> > > is retried). But I guess that's a 2nd issue (or a 3rd if we could
> > > the
> > > decoding layoutget)....
> > >
> > > All this is under the umbrella of how to handle errors on
> > > non-idempotent operations in a compound....
> >
> > There is no point in trying to handle errors that make no sense. If
> > the
> > server has a bug, then let's expose it instead of trying to hide it
> > in
> > the sofa cushions.
>
> It basically boils down to this:
> * I do not want to have to maintain a list of server bugs in the
> client.

I also believe that client shouldn't be coded to a broken server. But
in some of those cases, the client is not spec compliant, how is that
a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what
to make of it. I think it's more of a spec failure to address. It
seems that server isn't allowed to fail after executing a
non-idempotent operation but that's a hard requirement. I still think
that client's best set of action is to ignore errors on RESTOREFH.

> * If I make a client change, I don't want to have to consider whether
> or not it causes a regression against some server that only 10
> people are still using, and that never got a fix for a bug because
> "the clients handle it".
>
> We should fix client bugs when it is clear that they are client bugs.
> We should push back hard on server bugs because the client is already a
> complex enough beast.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2020-04-29 15:01:03

by Tom Talpey

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On 4/28/2020 10:06 PM, Olga Kornievskaia wrote:
> On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust <[email protected]> wrote:
>>
>> On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
>>> On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
>>> [email protected]> wrote:
>>>> On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
>>>>> On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
>>>>> [email protected]> wrote:
>>>>>> Hi Olga,
>>>>>>
>>>>>> On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
>>>>>>> Hi folk,
>>>>>>>
>>>>>>> Looking for guidance on what folks think. A client is sending
>>>>>>> a
>>>>>>> LINK
>>>>>>> operation to the server. This compound after the LINK has
>>>>>>> RESTOREFH
>>>>>>> and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
>>>>>>> LINK is
>>>>>>> done successfully. Client still fails the system call with
>>>>>>> EIO.
>>>>>>> We
>>>>>>> have a hardline and "ln" saying hardlink failed.
>>>>>>>
>>>>>>> Should the client not fail the system call in this case? The
>>>>>>> fact
>>>>>>> that
>>>>>>> we couldn't get up-to-date attributes don't seem like the
>>>>>>> reason
>>>>>>> to
>>>>>>> fail the system call?
>>>>>>>
>>>>>>> Thank you.
>>>>>>
>>>>>> I don't really see this as worth fixing on the client. It is
>>>>>> very
>>>>>> clearly a server bug.
>>>>>
>>>>> Why is that a server bug? A server can legitimately have an issue
>>>>> trying to execute an operation (RESTOREFH) and legitimately
>>>>> returning
>>>>> an error.
>>>>
>>>> If it is happening consistently on the server, then it is a bug,
>>>> and it
>>>> gets reported by the client in the same way we always report
>>>> NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
>>>
>>> Yes but the client doesn't retry so it can't assess if it's
>>> consistently happening or not. It can be a transient error (or
>>> ENOMEM)
>>> that's later resolved.
>>
>> If the server wants to signal a transient error, it should send
>> NFS4ERR_DELAY.
>
> ERR_DELAY not an allowed error for the RESTOREFH. But let's say, the
> server does return it, then client is not following the spec because
> if it'll get this error, it will retry the whole compound (causing a
> different error of redoing a non-idempotent operation). The spec says
> client is responsible for handling partially completed compound. The
> client should only retry the failed operations in a compound, I don't
> see that client does that.
>
>>>>> NFS client also ignores errors of the returning GETATTR after the
>>>>> RESTOREFH. So I'm not sure why we are then not ignoring errors
>>>>> (or
>>>>> some errors) of the RESTOREFH.
>>>>
>>>> We do need to check the value of RESTOREFH in order to figure out
>>>> if we
>>>> can continue reading the XDR buffer to decode the file attributes.
>>>> We
>>>> want to read those file attributes because we do expect the change
>>>> attribute, the ctime and the nlinks values to all change as a
>>>> result of
>>>> the operation.
>>>
>>> I have nothing against decoding the error and using it in a decision
>>> to keep decoding. But the client doesn't have to propagate the
>>> RESTOREFH error to the application?
>>>
>>> In all other non-idempotent operations that have other operations (ie
>>> GETATTR) following them, the client ignores the errors. Btw I just
>>> noticed that on OPEN compound, since we ignore decode error from the
>>> GETATTR, it would continue decoding LAYOUTGET...
>>>
>>> CREATE has problem if the following GETFH will return EDELAY. Client
>>> doesn't deal with retrying a part of the compound. It retries the
>>> whole compound. It leads to an error (since non-idempotent operation
>>> is retried). But I guess that's a 2nd issue (or a 3rd if we could the
>>> decoding layoutget)....
>>>
>>> All this is under the umbrella of how to handle errors on
>>> non-idempotent operations in a compound....
>>
>> There is no point in trying to handle errors that make no sense. If the
>> server has a bug, then let's expose it instead of trying to hide it in
>> the sofa cushions.
>
> EDELAY on GETFH is a reasonable error for the server to return.

I don't disagree that this is a broken server behavior. But from the
protocol perspective, I want to make two observations.

1) The post-operation attributes are not atomic, therefore an attribute
failure does not imply the operation was unsuccessful.

2) The application did not necessarily request the attributes, this
was inserted by the client, right? So again, their success or failure
is not actually relevant to the application.

Tom.

2020-04-29 15:49:17

by J. Bruce Fields

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Tue, Apr 28, 2020 at 10:12:29PM -0400, Olga Kornievskaia wrote:
> I also believe that client shouldn't be coded to a broken server. But
> in some of those cases, the client is not spec compliant, how is that
> a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what
> to make of it. I think it's more of a spec failure to address. It
> seems that server isn't allowed to fail after executing a
> non-idempotent operation but that's a hard requirement. I still think
> that client's best set of action is to ignore errors on RESTOREFH.

Maybe. But how is a server hitting SERVERFAULT on RESTOREFH, anyway?
That's pretty weird.

--b.

2020-04-29 15:50:11

by Trond Myklebust

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Wed, 2020-04-29 at 10:50 -0400, Tom Talpey wrote:
> On 4/28/2020 10:06 PM, Olga Kornievskaia wrote:
> > On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust <
> > [email protected]> wrote:
> > > On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > > > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > > > [email protected]> wrote:
> > > > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > > > [email protected]> wrote:
> > > > > > > Hi Olga,
> > > > > > >
> > > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia
> > > > > > > wrote:
> > > > > > > > Hi folk,
> > > > > > > >
> > > > > > > > Looking for guidance on what folks think. A client is
> > > > > > > > sending
> > > > > > > > a
> > > > > > > > LINK
> > > > > > > > operation to the server. This compound after the LINK
> > > > > > > > has
> > > > > > > > RESTOREFH
> > > > > > > > and GETATTR. Server returns SERVER_FAULT to on
> > > > > > > > RESTOREFH. But
> > > > > > > > LINK is
> > > > > > > > done successfully. Client still fails the system call
> > > > > > > > with
> > > > > > > > EIO.
> > > > > > > > We
> > > > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > > > >
> > > > > > > > Should the client not fail the system call in this
> > > > > > > > case? The
> > > > > > > > fact
> > > > > > > > that
> > > > > > > > we couldn't get up-to-date attributes don't seem like
> > > > > > > > the
> > > > > > > > reason
> > > > > > > > to
> > > > > > > > fail the system call?
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > >
> > > > > > > I don't really see this as worth fixing on the client. It
> > > > > > > is
> > > > > > > very
> > > > > > > clearly a server bug.
> > > > > >
> > > > > > Why is that a server bug? A server can legitimately have an
> > > > > > issue
> > > > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > > > returning
> > > > > > an error.
> > > > >
> > > > > If it is happening consistently on the server, then it is a
> > > > > bug,
> > > > > and it
> > > > > gets reported by the client in the same way we always report
> > > > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> > > >
> > > > Yes but the client doesn't retry so it can't assess if it's
> > > > consistently happening or not. It can be a transient error (or
> > > > ENOMEM)
> > > > that's later resolved.
> > >
> > > If the server wants to signal a transient error, it should send
> > > NFS4ERR_DELAY.
> >
> > ERR_DELAY not an allowed error for the RESTOREFH. But let's say,
> > the
> > server does return it, then client is not following the spec
> > because
> > if it'll get this error, it will retry the whole compound (causing
> > a
> > different error of redoing a non-idempotent operation). The spec
> > says
> > client is responsible for handling partially completed compound.
> > The
> > client should only retry the failed operations in a compound, I
> > don't
> > see that client does that.
> >
> > > > > > NFS client also ignores errors of the returning GETATTR
> > > > > > after the
> > > > > > RESTOREFH. So I'm not sure why we are then not ignoring
> > > > > > errors
> > > > > > (or
> > > > > > some errors) of the RESTOREFH.
> > > > >
> > > > > We do need to check the value of RESTOREFH in order to figure
> > > > > out
> > > > > if we
> > > > > can continue reading the XDR buffer to decode the file
> > > > > attributes.
> > > > > We
> > > > > want to read those file attributes because we do expect the
> > > > > change
> > > > > attribute, the ctime and the nlinks values to all change as a
> > > > > result of
> > > > > the operation.
> > > >
> > > > I have nothing against decoding the error and using it in a
> > > > decision
> > > > to keep decoding. But the client doesn't have to propagate the
> > > > RESTOREFH error to the application?
> > > >
> > > > In all other non-idempotent operations that have other
> > > > operations (ie
> > > > GETATTR) following them, the client ignores the errors. Btw I
> > > > just
> > > > noticed that on OPEN compound, since we ignore decode error
> > > > from the
> > > > GETATTR, it would continue decoding LAYOUTGET...
> > > >
> > > > CREATE has problem if the following GETFH will return EDELAY.
> > > > Client
> > > > doesn't deal with retrying a part of the compound. It retries
> > > > the
> > > > whole compound. It leads to an error (since non-idempotent
> > > > operation
> > > > is retried). But I guess that's a 2nd issue (or a 3rd if we
> > > > could the
> > > > decoding layoutget)....
> > > >
> > > > All this is under the umbrella of how to handle errors on
> > > > non-idempotent operations in a compound....
> > >
> > > There is no point in trying to handle errors that make no sense.
> > > If the
> > > server has a bug, then let's expose it instead of trying to hide
> > > it in
> > > the sofa cushions.
> >
> > EDELAY on GETFH is a reasonable error for the server to return.
>
> I don't disagree that this is a broken server behavior. But from the
> protocol perspective, I want to make two observations.
>
> 1) The post-operation attributes are not atomic, therefore an
> attribute
> failure does not imply the operation was unsuccessful.
>
> 2) The application did not necessarily request the attributes, this
> was inserted by the client, right? So again, their success or failure
> is not actually relevant to the application.
>

Understood, however if a server vendor doesn't want to take
responsibility for their product bug, then I don't want the Linux
client to have to own their problem in perpetuity.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2020-04-29 16:23:14

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Wed, Apr 29, 2020 at 11:46 AM J. Bruce Fields <[email protected]> wrote:
>
> On Tue, Apr 28, 2020 at 10:12:29PM -0400, Olga Kornievskaia wrote:
> > I also believe that client shouldn't be coded to a broken server. But
> > in some of those cases, the client is not spec compliant, how is that
> > a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what
> > to make of it. I think it's more of a spec failure to address. It
> > seems that server isn't allowed to fail after executing a
> > non-idempotent operation but that's a hard requirement. I still think
> > that client's best set of action is to ignore errors on RESTOREFH.
>
> Maybe. But how is a server hitting SERVERFAULT on RESTOREFH, anyway?
> That's pretty weird.

An example error is ENOMEM. A server is doing operations to lookup the
filehandle (due to it being some other place) and needs to allocate
memory. It's possible that resources are currently unavailable. Since
RESTOREFH doesn't allow EDELAY, server can only return SERVERFAULT.
But as I mentioned before, even if EDELAY was allowed, client only
resends the whole compound which is incorrect in case of
non-idempotent operations.

2020-04-29 17:02:45

by J. Bruce Fields

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Wed, Apr 29, 2020 at 12:22:16PM -0400, Olga Kornievskaia wrote:
> On Wed, Apr 29, 2020 at 11:46 AM J. Bruce Fields <[email protected]> wrote:
> >
> > On Tue, Apr 28, 2020 at 10:12:29PM -0400, Olga Kornievskaia wrote:
> > > I also believe that client shouldn't be coded to a broken server. But
> > > in some of those cases, the client is not spec compliant, how is that
> > > a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what
> > > to make of it. I think it's more of a spec failure to address. It
> > > seems that server isn't allowed to fail after executing a
> > > non-idempotent operation but that's a hard requirement. I still think
> > > that client's best set of action is to ignore errors on RESTOREFH.
> >
> > Maybe. But how is a server hitting SERVERFAULT on RESTOREFH, anyway?
> > That's pretty weird.
>
> An example error is ENOMEM. A server is doing operations to lookup the
> filehandle (due to it being some other place) and needs to allocate
> memory. It's possible that resources are currently unavailable.

This is a filehandle that's been previously used in the compound. All
the resource use I can think of here (filehandle storage, xdr reply
buffer space...) is very predictable in this compound. If anything was
to cause trouble I'd expect it to be the GETATTR reply.

--b.

2020-05-01 02:06:39

by Tom Talpey

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On 4/29/2020 12:22 PM, Olga Kornievskaia wrote:
> On Wed, Apr 29, 2020 at 11:46 AM J. Bruce Fields <[email protected]> wrote:
>>
>> On Tue, Apr 28, 2020 at 10:12:29PM -0400, Olga Kornievskaia wrote:
>>> I also believe that client shouldn't be coded to a broken server. But
>>> in some of those cases, the client is not spec compliant, how is that
>>> a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what
>>> to make of it. I think it's more of a spec failure to address. It
>>> seems that server isn't allowed to fail after executing a
>>> non-idempotent operation but that's a hard requirement. I still think
>>> that client's best set of action is to ignore errors on RESTOREFH.
>>
>> Maybe. But how is a server hitting SERVERFAULT on RESTOREFH, anyway?
>> That's pretty weird.
>
> An example error is ENOMEM. A server is doing operations to lookup the
> filehandle (due to it being some other place) and needs to allocate
> memory. It's possible that resources are currently unavailable. Since
> RESTOREFH doesn't allow EDELAY, server can only return SERVERFAULT.

Why does the server need to do that? Surely it can best know how and
when to reschedule a memory allocation, instead of whining about its
temporary failure to the client.

> But as I mentioned before, even if EDELAY was allowed, client only
> resends the whole compound which is incorrect in case of
> non-idempotent operations.

Indeed, that's a protocol imperative, which the client should obey
by "cracking" the compound to determine what to retry.

Tom.

2020-05-01 04:09:32

by Trond Myklebust

[permalink] [raw]

Subject: Re: handling ERR_SERVERFAULT on RESTOREFH

On Thu, 2020-04-30 at 22:05 -0400, Tom Talpey wrote:
> On 4/29/2020 12:22 PM, Olga Kornievskaia wrote:
> > On Wed, Apr 29, 2020 at 11:46 AM J. Bruce Fields <
> > [email protected]> wrote:
> > > On Tue, Apr 28, 2020 at 10:12:29PM -0400, Olga Kornievskaia
> > > wrote:
> > > > I also believe that client shouldn't be coded to a broken
> > > > server. But
> > > > in some of those cases, the client is not spec compliant, how
> > > > is that
> > > > a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure
> > > > what
> > > > to make of it. I think it's more of a spec failure to address.
> > > > It
> > > > seems that server isn't allowed to fail after executing a
> > > > non-idempotent operation but that's a hard requirement. I still
> > > > think
> > > > that client's best set of action is to ignore errors on
> > > > RESTOREFH.
> > >
> > > Maybe. But how is a server hitting SERVERFAULT on RESTOREFH,
> > > anyway?
> > > That's pretty weird.
> >
> > An example error is ENOMEM. A server is doing operations to lookup
> > the
> > filehandle (due to it being some other place) and needs to allocate
> > memory. It's possible that resources are currently unavailable.
> > Since
> > RESTOREFH doesn't allow EDELAY, server can only return SERVERFAULT.
>
> Why does the server need to do that? Surely it can best know how and
> when to reschedule a memory allocation, instead of whining about its
> temporary failure to the client.
>
> > But as I mentioned before, even if EDELAY was allowed, client only
> > resends the whole compound which is incorrect in case of
> > non-idempotent operations.
>
> Indeed, that's a protocol imperative, which the client should obey
> by "cracking" the compound to determine what to retry.

RFC5661:

15.1.1.6. NFS4ERR_SERVERFAULT (Error Code 10006)

An error occurred on the server that does not map to any of the
specific legal NFSv4.1 protocol error values. The client should
translate this into an appropriate error. UNIX clients may choose
to
translate this to EIO.

Which I believe is what we do. As I said, this is a server bug that
needs to be fixed on the server and I see no need to change the client
behaviour for now.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]