Hi,
Sorry to bother you here.
I'm using NFS and realize it doesn't support opening a file with "O_DIRECT | O_APPEND".
After checking the source code,
I found it has one function that checks explicitly whether there is a combination flag of "O_APPEND | O_DIRECT".
If so, it will return invalid arguments.
int nfs_check_flags(int flags)
{
??? if ((flags & (O_APPEND | O_DIRECT)) == (O_APPEND | O_DIRECT))
??????? return -EINVAL;
??? return 0;
}
But I don't understand why NFS doesn't support this flag combination.
I'd appreciate it if someone could explain this to me.
Thanks in advance.
Best,
Tao
On Thu, 2023-11-23 at 18:14 +0000, Tao Lyu wrote:
> Hi,
>
> Sorry to bother you here.
>
> I'm using NFS and realize it doesn't support opening a file with
> "O_DIRECT | O_APPEND".
>
> After checking the source code,
> I found it has one function that checks explicitly whether there is a
> combination flag of "O_APPEND | O_DIRECT".
> If so, it will return invalid arguments.
>
> int nfs_check_flags(int flags)
> {
> if ((flags & (O_APPEND | O_DIRECT)) == (O_APPEND | O_DIRECT))
> return -EINVAL;
>
> return 0;
> }
>
> But I don't understand why NFS doesn't support this flag combination.
> I'd appreciate it if someone could explain this to me.
Why do you need O_APPEND|O_DIRECT?
In order to implement O_APPEND|O_DIRECT, we would need to add an APPEND
operation, which does not exist in the NFS protocol. The WRITE
operation does not suffice, because it requires you to know the offset
at which you will be writing the data.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
>> Hi,
>>
>> Sorry to bother you here.
>>
>> I'm using NFS and realize it doesn't support opening a file with
>> "O_DIRECT | O_APPEND".
>>
>> After checking the source code,
>> I found it has one function that checks explicitly whether there is a
>> combination flag of "O_APPEND | O_DIRECT".
>> If so, it will return invalid arguments.
>>
>> int nfs_check_flags(int flags)
>> {
>> ??? if ((flags & (O_APPEND | O_DIRECT)) == (O_APPEND | O_DIRECT))
>> ??????? return -EINVAL;
>>
>> ??? return 0;
>> }
>>
>> But I don't understand why NFS doesn't support this flag combination.
>> I'd appreciate it if someone could explain this to me.
>
>
> Why do you need O_APPEND|O_DIRECT?
>
> In order to implement O_APPEND|O_DIRECT, we would need to add an APPEND
> operation, which does not exist in the NFS protocol. The WRITE
> operation does not suffice, because it requires you to know the offset
> at which you will be writing the data.
Hi Trond,
Thank you so much for your reply.
O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs).
Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
But the key point is that looks like NFS has supported O_APPEND already.
I can successfully open a file with "O_RDWR|O_APPEND".
My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination.
Thank you in advance for helping me.
Best,
Tao
On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote:
>
> O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs).
>
> Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
> But the key point is that looks like NFS has supported O_APPEND already.
> I can successfully open a file with "O_RDWR|O_APPEND".
>
> My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination.
Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
Btw, I think an APPEND operation in NFS would be a very good idea, and
I'd love to work with interested parties in the IETF on it. Not that
we (Damien to be specific) plan to add support to Linux to also report
the actual offset an O_APPEND write wrote to through io_uring as we
have varios use cases for out of place write data stores for that.
It would be great to also support that programming model over NFS.
> On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote:
>>
>> O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs).
>>
>> Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
>> But the key point is that looks like NFS has supported O_APPEND already.
>> I can successfully open a file with "O_RDWR|O_APPEND".
>>
>> My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination.
> Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
Hi Christoph,
Yes, it just doesn't work with O_DIRECT.
> Btw, I think an APPEND operation in NFS would be a very good idea, and
> I'd love to work with interested parties in the IETF on it.? Not that
> we (Damien to be specific) plan to add support to Linux to also report
> the actual offset an O_APPEND write wrote to through io_uring as we
> have varios use cases for out of place write data stores for that.
> It would be great to also support that programming model over NFS.
> On Nov 27, 2023, at 11:36 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote:
>>
>> O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs).
>>
>> Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
>> But the key point is that looks like NFS has supported O_APPEND already.
>> I can successfully open a file with "O_RDWR|O_APPEND".
>>
>> My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination.
O_DIRECT is supposed to not depend on any cached information,
including the file size, which the client needs to know to
form an NFS WRITE with the correct offset to ensure it is an
appending write.
File sizes are managed on the server, so the server needs to
know that the client is requesting an appending write so it
knows where to put the payload.
> Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
>
> Btw, I think an APPEND operation in NFS would be a very good idea, and
> I'd love to work with interested parties in the IETF on it.
You can write and submit a personal draft that describes it; it
wouldn't need to be more than a few pages. The hard part of that
would be accumulating use case descriptions.
I think you could create a proof of concept by including a VERIFY
operation in front of the WRITE to ensure the WRITE occurs only
if the offset argument in the WRITE agrees with the file's size
on the server. If the VERIFY fails, the client grabs the updated
file size and tries again.
> Not that
> we (Damien to be specific) plan to add support to Linux to also report
> the actual offset an O_APPEND write wrote to through io_uring as we
> have varios use cases for out of place write data stores for that.
> It would be great to also support that programming model over NFS.
--
Chuck Lever
On Mon, Nov 27, 2023 at 04:50:56PM +0000, Chuck Lever III wrote:
> > Btw, I think an APPEND operation in NFS would be a very good idea, and
> > I'd love to work with interested parties in the IETF on it.
>
> You can write and submit a personal draft that describes it; it
> wouldn't need to be more than a few pages. The hard part of that
> would be accumulating use case descriptions.
>
> I think you could create a proof of concept by including a VERIFY
> operation in front of the WRITE to ensure the WRITE occurs only
> if the offset argument in the WRITE agrees with the file's size
> on the server. If the VERIFY fails, the client grabs the updated
> file size and tries again.
That seems like exactly the wrong idea around. The idea behind append
based models for write out of place storage is that you do not care
where it is written - you leave it to the server or storage device to
place it at the current append point. You just need to know where it
got placed after the fact for some of them (not for simply logs,
though).
> On Nov 27, 2023, at 11:55 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Nov 27, 2023 at 04:50:56PM +0000, Chuck Lever III wrote:
>>> Btw, I think an APPEND operation in NFS would be a very good idea, and
>>> I'd love to work with interested parties in the IETF on it.
>>
>> You can write and submit a personal draft that describes it; it
>> wouldn't need to be more than a few pages. The hard part of that
>> would be accumulating use case descriptions.
>>
>> I think you could create a proof of concept by including a VERIFY
>> operation in front of the WRITE to ensure the WRITE occurs only
>> if the offset argument in the WRITE agrees with the file's size
>> on the server. If the VERIFY fails, the client grabs the updated
>> file size and tries again.
>
> That seems like exactly the wrong idea around. The idea behind append
> based models for write out of place storage is that you do not care
> where it is written - you leave it to the server or storage device to
> place it at the current append point. You just need to know where it
> got placed after the fact for some of them (not for simply logs,
> though).
I said "proof of concept" -- obviously you don't want this kind of
racy arrangement as a long-term solution, you just want something
that works with current server implementations for experimentation.
And, if the above WRITE succeeds, the client would know exactly
where the server placed the payload in the file.
--
Chuck Lever
On Mon, 2023-11-27 at 08:36 -0800, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote:
> >
> > O_APPEND | O_DIRECT can be used to bypass the client cache for
> > multiple threads writing data without caring of the orders (e.g.,
> > logs).
> >
> > Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
> > But the key point is that looks like NFS has supported O_APPEND
> > already.
> > I can successfully open a file with "O_RDWR|O_APPEND".
> >
> > My confusion is why NFS supports O_RDWR and O_APPEND individually
> > but does not support this combination.
>
> Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
>
> Btw, I think an APPEND operation in NFS would be a very good idea,
> and
> I'd love to work with interested parties in the IETF on it. Not that
> we (Damien to be specific) plan to add support to Linux to also
> report
> the actual offset an O_APPEND write wrote to through io_uring as we
> have varios use cases for out of place write data stores for that.
> It would be great to also support that programming model over NFS.
>
Note that APPEND would only really work with O_DIRECT, since it is
anathema to cached I/O to not be able to control the placement of the
data. However it is useful for the case where you want to write logs.
In addition, the model will always break down if someone decides they
want to write a log entry of size > wsize. Once you have to split up
the data, you (obviously) lose the atomicity you need in order to write
a contiguous record.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Mon, Nov 27, 2023 at 05:08:22PM +0000, Trond Myklebust wrote:
> Note that APPEND would only really work with O_DIRECT, since it is
> anathema to cached I/O to not be able to control the placement of the
> data.
Yes.
> In addition, the model will always break down if someone decides they
> want to write a log entry of size > wsize. Once you have to split up
> the data, you (obviously) lose the atomicity you need in order to write
> a contiguous record.
Yes. Note that there is work going on to define atomic I/O limits
on the various Linux lists currently. Although in the block layer
we also have a separate limit for the maximum append size already.
Hi Trond, Christoph, and Chuck,
I understand it.
Thanks a lot for your explanation.
Best,
Tao
On Mon, Nov 27, 2023 at 8:51 AM Chuck Lever III <[email protected]> wrote:
>
>
> > On Nov 27, 2023, at 11:36 AM, Christoph Hellwig <[email protected]> wrote:
> >
> > On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote:
> >>
> >> O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs).
> >>
> >> Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
> >> But the key point is that looks like NFS has supported O_APPEND already.
> >> I can successfully open a file with "O_RDWR|O_APPEND".
> >>
> >> My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination.
>
> O_DIRECT is supposed to not depend on any cached information,
> including the file size, which the client needs to know to
> form an NFS WRITE with the correct offset to ensure it is an
> appending write.
>
> File sizes are managed on the server, so the server needs to
> know that the client is requesting an appending write so it
> knows where to put the payload.
>
>
> > Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
> >
> > Btw, I think an APPEND operation in NFS would be a very good idea, and
> > I'd love to work with interested parties in the IETF on it.
It is not easy to deal with w.r.t. RPC retries.
I suppose a NFSv4.2 extension that either requires (or strongly
recommends) persistent sessions might work?
(Persistent sessions should pretty well guarantee an RPC is not
redone on the server.)
>
> You can write and submit a personal draft that describes it; it
> wouldn't need to be more than a few pages. The hard part of that
> would be accumulating use case descriptions.
>
> I think you could create a proof of concept by including a VERIFY
> operation in front of the WRITE to ensure the WRITE occurs only
> if the offset argument in the WRITE agrees with the file's size
> on the server. If the VERIFY fails, the client grabs the updated
> file size and tries again.
This is what the FreeBSD NFSv4 client does.
Since compounds are not atomic, it is not guaranteed to work and
you might get a lot of "tries again" if multiple clients were doing the
appends on the same file concurrently. (The compound includes a
GETTTR size before the VERIFY, so trying again is pretty straightforward.)
rick
>
>
> > Not that
> > we (Damien to be specific) plan to add support to Linux to also report
> > the actual offset an O_APPEND write wrote to through io_uring as we
> > have varios use cases for out of place write data stores for that.
> > It would be great to also support that programming model over NFS.
>
> --
> Chuck Lever
>
>
> I said "proof of concept" -- obviously you don't want this kind of
> racy arrangement as a long-term solution, you just want something
> that works with current server implementations for experimentation.
>
> And, if the above WRITE succeeds, the client would know exactly
> where the server placed the payload in the file.
But I'm not sure how this proof of concept helps me to prove anything
except that this method sucks :)
On Mon, Nov 27, 2023 at 05:50:49PM -0800, Rick Macklem wrote:
> > > Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
> > >
> > > Btw, I think an APPEND operation in NFS would be a very good idea, and
> > > I'd love to work with interested parties in the IETF on it.
> It is not easy to deal with w.r.t. RPC retries.
Indeed.
> I suppose a NFSv4.2 extension that either requires (or strongly
> recommends) persistent sessions might work?
> (Persistent sessions should pretty well guarantee an RPC is not
> redone on the server.)
I guess so. That of course actually means we rely on a viable
implementation of persistent sessions. The Linux server doesn't
support them, and I'm not sure which servers actually do.
On Tue, 2023-11-28 at 05:09 -0800, Christoph Hellwig wrote:
> On Mon, Nov 27, 2023 at 05:50:49PM -0800, Rick Macklem wrote:
> > > > Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
> > > >
> > > > Btw, I think an APPEND operation in NFS would be a very good
> > > > idea, and
> > > > I'd love to work with interested parties in the IETF on it.
> > It is not easy to deal with w.r.t. RPC retries.
>
> Indeed.
>
> > I suppose a NFSv4.2 extension that either requires (or strongly
> > recommends) persistent sessions might work?
> > (Persistent sessions should pretty well guarantee an RPC is not
> > redone on the server.)
>
> I guess so. That of course actually means we rely on a viable
> implementation of persistent sessions. The Linux server doesn't
> support them, and I'm not sure which servers actually do.
Nobody is going to implement the overhead of persistent sessions just
in order to add support for APPEND.
The only thing that will be achieved by tying the functionality to
persistent sessions is to ensure that people will completely ignore the
spec.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]