2021-03-31 17:51:16

by David Wysochanski

[permalink] [raw]
Subject: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

Trond,

I've been working on getting NFS converted to dhowells new fscache and
netfs APIs and running into a problem with how NFS is designed and it
involves the NFS pagelist.c / pgio API. I'd appreciate it if you
could review and give your thoughts on possible approaches. I've
tried to outline some of the possibilities below. I tried coding
option #3 and ran into some problems, and it has a serialization
limitation. At this point I'm leaning towards option 2, so I'll
probably try that approach if you don't have time for review or have
strong thoughts on it.

Thanks.


Problem: The NFS pageio interface does not expose a max read length that
we can easily use inside netfs clamp_length() function. As a result, when
issue_op() is called indicating a single netfs subrequest, this can be
split into
multiple NFS subrequests / RPCs inside guts of NFS pageio code. Multiple
NFS subrequests requests leads to multiple completions, and the netfs
API expects a 1:1 mapping between issue_op() and
netfs_subreq_terminated() calls.

Details of the NFS pageio API (see include/linux/nfs_page.h and
fs/nfs/pagelist.c)
Details of the netfs API (see include/linux/netfs.h and fs/netfs/read_helper.c)

The NFS pageio API 3 main calls are as follows:
1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N pages)
2. nfs_pageio_add_request(): called for each page to add to an IO
* Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request
* __nfs_pageio_add_request may call nfs_pageio_doio() which actually
sends an RPC over the wire if page cannot be added to the request
("coalesced") due to various factors. For more details, see
nfs_pageio_do_add_request() and all underlying code it calls such
as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test() calls
3. nfs_pageio_complete() - "complete" the pageio
* calls nfs_pageio_complete_mirror -> nfs_pageio_doio()

The NFS pageio API thus may generate multiple over the wire RPCs
and thus multiple completions even though at the high level only
one call to nfs_pageio_complete() is made.

Option 1: Just use NFS pageio API as is, and deal with possible multiple
completions.
- Inconsistent with netfs design intent
- Need to keep track of the RPC completion status, and for example,
if one completes with success and one an error, probably call
netfs_subreq_terminated() with the error.
- There's no way for the caller of the NFS pageio API to know how
many RPCs and thus completions may occur. Thus, it's unclear how
one would distinguish between a READ that resulted in a single RPC
over the wire that completed as a short read, and a READ that
resulted in multiple RPCs that would each complete separately,
but would eventually complete

Option 2: Create a more complex 'clamp_length()' function for NFS,
taking into account all ways NFS / pNFS code can split a read.
+ Consistent with netfs design intent
+ Multiple "split" requests would be called in parallel (see loop
inside netfs_readahead, which repeatedly calls netfs_rreq_submit_slice)
- Looks impossible without refactoring of NFS pgio API. We need
to prevent nfs_pageio_add_request() from calling nfs_pagio_doio(),
and return some indication coalesce failed. In addition, it may
run into problems with fallback from DS to MDS for example (see
commit d9156f9f364897e93bdd98b4ad22138de18f7c24).

Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed.
+ Consistent with netfs design intent
- Multiple "split" requests would be serialized (see code
paths inside netfs_subreq_terminated that check for this flag).
- Looks impossible without some refactoring of NFS pgio API.
* Notes: Terminate NFS pageio page based loop at the first call
to nfs_pageio_doio(). When a READ completes, NFS calls
netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ
and is prepared to have the rest of the subrequest be resubmitted.
Need to somehow fail early or avoid entirely subsequent calls to
nfs_pagio_doio() for the original request though, and handle
error status only from the first RPC.

Option 4: Add some final completion routine to be called near
bottom of nfs_pageio_complete() and would pass in at least
netfs_read_subrequest(), possibly nfs_pageio_descriptor.
+ Inconsistent with netfs design intent
- Would be a new NFS API or call on top of everything
- Need to handle the "multiple completion with different
status" problem (see #1).


2021-03-31 18:06:16

by Trond Myklebust

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

On Wed, 2021-03-31 at 13:49 -0400, David Wysochanski wrote:
> Trond,
>
> I've been working on getting NFS converted to dhowells new fscache
> and
> netfs APIs and running into a problem with how NFS is designed and it
> involves the NFS pagelist.c / pgio API.  I'd appreciate it if you
> could review and give your thoughts on possible approaches.  I've
> tried to outline some of the possibilities below.  I tried coding
> option #3 and ran into some problems, and it has a serialization
> limitation.  At this point I'm leaning towards option 2, so I'll
> probably try that approach if you don't have time for review or have
> strong thoughts on it.
>

I am not going through another redesign of the NFS code in order to
accommodate another cachefs design. If netfs needs a refactoring or
redesign of the I/O code then it will be immediately NACKed.

Why does netfs need to know these details about the NFS code anyway?

> Thanks.
>
>
> Problem: The NFS pageio interface does not expose a max read length
> that
> we can easily use inside netfs clamp_length() function.  As a result,
> when
> issue_op() is called indicating a single netfs subrequest, this can
> be
> split into
> multiple NFS subrequests / RPCs inside guts of NFS pageio code. 
> Multiple
> NFS subrequests requests leads to multiple completions, and the netfs
> API expects a 1:1 mapping between issue_op() and
> netfs_subreq_terminated() calls.
>
> Details of the NFS pageio API (see include/linux/nfs_page.h and
> fs/nfs/pagelist.c)
> Details of the netfs API (see include/linux/netfs.h and
> fs/netfs/read_helper.c)
>
> The NFS pageio API 3 main calls are as follows:
> 1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N
> pages)
> 2. nfs_pageio_add_request(): called for each page to add to an IO
> * Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request
>   * __nfs_pageio_add_request may call nfs_pageio_doio() which
> actually
>     sends an RPC over the wire if page cannot be added to the request
>     ("coalesced") due to various factors.  For more details, see
>     nfs_pageio_do_add_request() and all underlying code it calls such
>     as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test()
> calls
> 3. nfs_pageio_complete() - "complete" the pageio
> * calls nfs_pageio_complete_mirror -> nfs_pageio_doio()
>
> The NFS pageio API thus may generate multiple over the wire RPCs
> and thus multiple completions even though at the high level only
> one call to nfs_pageio_complete() is made.
>
> Option 1: Just use NFS pageio API as is, and deal with possible
> multiple
> completions.
> - Inconsistent with netfs design intent
> - Need to keep track of the RPC completion status, and for example,
> if one completes with success and one an error, probably call
> netfs_subreq_terminated() with the error.
> - There's no way for the caller of the NFS pageio API to know how
> many RPCs and thus completions may occur.  Thus, it's unclear how
> one would distinguish between a READ that resulted in a single RPC
> over the wire that completed as a short read, and a READ that
> resulted in multiple RPCs that would each complete separately,
> but would eventually complete
>
> Option 2: Create a more complex 'clamp_length()' function for NFS,
> taking into account all ways NFS / pNFS code can split a read.
> + Consistent with netfs design intent
> + Multiple "split" requests would be called in parallel (see loop
> inside netfs_readahead, which repeatedly calls
> netfs_rreq_submit_slice)
> - Looks impossible without refactoring of NFS pgio API.  We need
> to prevent nfs_pageio_add_request() from calling nfs_pagio_doio(),
> and return some indication coalesce failed.  In addition, it may
> run into problems with fallback from DS to MDS for example (see
> commit d9156f9f364897e93bdd98b4ad22138de18f7c24).
>
> Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed.
> + Consistent with netfs design intent
> - Multiple "split" requests would be serialized (see code
> paths inside netfs_subreq_terminated that check for this flag).
> - Looks impossible without some refactoring of NFS pgio API.
> * Notes: Terminate NFS pageio page based loop at the first call
> to nfs_pageio_doio().  When a READ completes, NFS calls
> netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ
> and is prepared to have the rest of the subrequest be resubmitted.
> Need to somehow fail early or avoid entirely subsequent calls to
> nfs_pagio_doio() for the original request though, and handle
> error status only from the first RPC.
>
> Option 4: Add some final completion routine to be called near
> bottom of nfs_pageio_complete() and would pass in at least
> netfs_read_subrequest(), possibly nfs_pageio_descriptor.
> + Inconsistent with netfs design intent
> - Would be a new NFS API or call on top of everything
> - Need to handle the "multiple completion with different
> status" problem (see #1).
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2021-03-31 18:38:14

by David Wysochanski

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

On Wed, Mar 31, 2021 at 2:04 PM Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2021-03-31 at 13:49 -0400, David Wysochanski wrote:
> > Trond,
> >
> > I've been working on getting NFS converted to dhowells new fscache
> > and
> > netfs APIs and running into a problem with how NFS is designed and it
> > involves the NFS pagelist.c / pgio API. I'd appreciate it if you
> > could review and give your thoughts on possible approaches. I've
> > tried to outline some of the possibilities below. I tried coding
> > option #3 and ran into some problems, and it has a serialization
> > limitation. At this point I'm leaning towards option 2, so I'll
> > probably try that approach if you don't have time for review or have
> > strong thoughts on it.
> >
>
> I am not going through another redesign of the NFS code in order to
> accommodate another cachefs design. If netfs needs a refactoring or
> redesign of the I/O code then it will be immediately NACKed.
>
I don't think it will require a redesign. I was thinking more about
adding a flag to nfs_pageio_add_request() for example that
would return a different value if coalesce of the page being
added failed. So we'd have:
nfs_pageio_init(): called 1 time
nfs_pageio_add_request(): called N times, one for each page, but stop if
coalesce fails
nfs_pageio_complete(): called 1 time

> Why does netfs need to know these details about the NFS code anyway?
>
We can probably get by without it but it will be awkward and probably not the
best, but I'm not sure.

I tried to explain below with a problem statement but maybe it was unclear.
The basic design of netfs API as it pertains to this problem is:
* issue_op(): calls into the specific netfs (NFS) to obtain the data from server
(send one or more RPCs)
* netfs_subreq_terminated(): when RPC(s) are completed, we need to call
netfs API back to say the data is either there or there was an error

I would note that assuming we can come up with something acceptable to
NFS, it should simplify both nfs_readpage() and nfs_readpages/nfs_readhead.
I hope we can find some common ground so it's neither too invasive to
NFS and also maybe there's some similar improvements in NFS that can
be done along with this interface.


> > Thanks.
> >
> >
> > Problem: The NFS pageio interface does not expose a max read length
> > that
> > we can easily use inside netfs clamp_length() function. As a result,
> > when
> > issue_op() is called indicating a single netfs subrequest, this can
> > be
> > split into
> > multiple NFS subrequests / RPCs inside guts of NFS pageio code.
> > Multiple
> > NFS subrequests requests leads to multiple completions, and the netfs
> > API expects a 1:1 mapping between issue_op() and
> > netfs_subreq_terminated() calls.
> >
> > Details of the NFS pageio API (see include/linux/nfs_page.h and
> > fs/nfs/pagelist.c)
> > Details of the netfs API (see include/linux/netfs.h and
> > fs/netfs/read_helper.c)
> >
> > The NFS pageio API 3 main calls are as follows:
> > 1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N
> > pages)
> > 2. nfs_pageio_add_request(): called for each page to add to an IO
> > * Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request
> > * __nfs_pageio_add_request may call nfs_pageio_doio() which
> > actually
> > sends an RPC over the wire if page cannot be added to the request
> > ("coalesced") due to various factors. For more details, see
> > nfs_pageio_do_add_request() and all underlying code it calls such
> > as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test()
> > calls
> > 3. nfs_pageio_complete() - "complete" the pageio
> > * calls nfs_pageio_complete_mirror -> nfs_pageio_doio()
> >
> > The NFS pageio API thus may generate multiple over the wire RPCs
> > and thus multiple completions even though at the high level only
> > one call to nfs_pageio_complete() is made.
> >
> > Option 1: Just use NFS pageio API as is, and deal with possible
> > multiple
> > completions.
> > - Inconsistent with netfs design intent
> > - Need to keep track of the RPC completion status, and for example,
> > if one completes with success and one an error, probably call
> > netfs_subreq_terminated() with the error.
> > - There's no way for the caller of the NFS pageio API to know how
> > many RPCs and thus completions may occur. Thus, it's unclear how
> > one would distinguish between a READ that resulted in a single RPC
> > over the wire that completed as a short read, and a READ that
> > resulted in multiple RPCs that would each complete separately,
> > but would eventually complete
> >
> > Option 2: Create a more complex 'clamp_length()' function for NFS,
> > taking into account all ways NFS / pNFS code can split a read.
> > + Consistent with netfs design intent
> > + Multiple "split" requests would be called in parallel (see loop
> > inside netfs_readahead, which repeatedly calls
> > netfs_rreq_submit_slice)
> > - Looks impossible without refactoring of NFS pgio API. We need
> > to prevent nfs_pageio_add_request() from calling nfs_pagio_doio(),
> > and return some indication coalesce failed. In addition, it may
> > run into problems with fallback from DS to MDS for example (see
> > commit d9156f9f364897e93bdd98b4ad22138de18f7c24).
> >
> > Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed.
> > + Consistent with netfs design intent
> > - Multiple "split" requests would be serialized (see code
> > paths inside netfs_subreq_terminated that check for this flag).
> > - Looks impossible without some refactoring of NFS pgio API.
> > * Notes: Terminate NFS pageio page based loop at the first call
> > to nfs_pageio_doio(). When a READ completes, NFS calls
> > netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ
> > and is prepared to have the rest of the subrequest be resubmitted.
> > Need to somehow fail early or avoid entirely subsequent calls to
> > nfs_pagio_doio() for the original request though, and handle
> > error status only from the first RPC.
> >
> > Option 4: Add some final completion routine to be called near
> > bottom of nfs_pageio_complete() and would pass in at least
> > netfs_read_subrequest(), possibly nfs_pageio_descriptor.
> > + Inconsistent with netfs design intent
> > - Would be a new NFS API or call on top of everything
> > - Need to handle the "multiple completion with different
> > status" problem (see #1).
> >
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2021-03-31 18:52:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

On Wed, 2021-03-31 at 14:34 -0400, David Wysochanski wrote:
> On Wed, Mar 31, 2021 at 2:04 PM Trond Myklebust <
> [email protected]> wrote:
> >
> > On Wed, 2021-03-31 at 13:49 -0400, David Wysochanski wrote:
> > > Trond,
> > >
> > > I've been working on getting NFS converted to dhowells new
> > > fscache
> > > and
> > > netfs APIs and running into a problem with how NFS is designed
> > > and it
> > > involves the NFS pagelist.c / pgio API.  I'd appreciate it if you
> > > could review and give your thoughts on possible approaches.  I've
> > > tried to outline some of the possibilities below.  I tried coding
> > > option #3 and ran into some problems, and it has a serialization
> > > limitation.  At this point I'm leaning towards option 2, so I'll
> > > probably try that approach if you don't have time for review or
> > > have
> > > strong thoughts on it.
> > >
> >
> > I am not going through another redesign of the NFS code in order to
> > accommodate another cachefs design. If netfs needs a refactoring or
> > redesign of the I/O code then it will be immediately NACKed.
> >
> I don't think it will require a redesign.  I was thinking more about
> adding a flag to nfs_pageio_add_request() for example that
> would return a different value if coalesce of the page being
> added failed.  So we'd have:
> nfs_pageio_init(): called 1 time
> nfs_pageio_add_request(): called N times, one for each page, but stop
> if
> coalesce fails
> nfs_pageio_complete(): called 1 time
>
> > Why does netfs need to know these details about the NFS code
> > anyway?
> >
> We can probably get by without it but it will be awkward and probably
> not the
> best, but I'm not sure.
>
> I tried to explain below with a problem statement but maybe it was
> unclear.
> The basic design of netfs API as it pertains to this problem is:
> * issue_op(): calls into the specific netfs (NFS) to obtain the data
> from server
> (send one or more RPCs)
> * netfs_subreq_terminated(): when RPC(s) are completed, we need to
> call
> netfs API back to say the data is either there or there was an error
>

That's a problem. That means NFS and netfs need intimate knowledge of
each other's design, and I'm not going allow us to go there again. We
did that with cachefs, and here we are 10 years later doing a full
redesign. That's unacceptable.

If netfs requires these detailed changes to the NFS code, then that's a
red flag that the whole design is broken and needs to be revised.

> I would note that assuming we can come up with something acceptable
> to
> NFS, it should simplify both nfs_readpage() and
> nfs_readpages/nfs_readhead.
> I hope we can find some common ground so it's neither too invasive to
> NFS and also maybe there's some similar improvements in NFS that can
> be done along with this interface.
>
>
> > > Thanks.
> > >
> > >
> > > Problem: The NFS pageio interface does not expose a max read
> > > length
> > > that
> > > we can easily use inside netfs clamp_length() function.  As a
> > > result,
> > > when
> > > issue_op() is called indicating a single netfs subrequest, this
> > > can
> > > be
> > > split into
> > > multiple NFS subrequests / RPCs inside guts of NFS pageio code.
> > > Multiple
> > > NFS subrequests requests leads to multiple completions, and the
> > > netfs
> > > API expects a 1:1 mapping between issue_op() and
> > > netfs_subreq_terminated() calls.
> > >
> > > Details of the NFS pageio API (see include/linux/nfs_page.h and
> > > fs/nfs/pagelist.c)
> > > Details of the netfs API (see include/linux/netfs.h and
> > > fs/netfs/read_helper.c)
> > >
> > > The NFS pageio API 3 main calls are as follows:
> > > 1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N
> > > pages)
> > > 2. nfs_pageio_add_request(): called for each page to add to an IO
> > > * Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request
> > >   * __nfs_pageio_add_request may call nfs_pageio_doio() which
> > > actually
> > >     sends an RPC over the wire if page cannot be added to the
> > > request
> > >     ("coalesced") due to various factors.  For more details, see
> > >     nfs_pageio_do_add_request() and all underlying code it calls
> > > such
> > >     as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test()
> > > calls
> > > 3. nfs_pageio_complete() - "complete" the pageio
> > > * calls nfs_pageio_complete_mirror -> nfs_pageio_doio()
> > >
> > > The NFS pageio API thus may generate multiple over the wire RPCs
> > > and thus multiple completions even though at the high level only
> > > one call to nfs_pageio_complete() is made.
> > >
> > > Option 1: Just use NFS pageio API as is, and deal with possible
> > > multiple
> > > completions.
> > > - Inconsistent with netfs design intent
> > > - Need to keep track of the RPC completion status, and for
> > > example,
> > > if one completes with success and one an error, probably call
> > > netfs_subreq_terminated() with the error.
> > > - There's no way for the caller of the NFS pageio API to know how
> > > many RPCs and thus completions may occur.  Thus, it's unclear how
> > > one would distinguish between a READ that resulted in a single
> > > RPC
> > > over the wire that completed as a short read, and a READ that
> > > resulted in multiple RPCs that would each complete separately,
> > > but would eventually complete
> > >
> > > Option 2: Create a more complex 'clamp_length()' function for
> > > NFS,
> > > taking into account all ways NFS / pNFS code can split a read.
> > > + Consistent with netfs design intent
> > > + Multiple "split" requests would be called in parallel (see loop
> > > inside netfs_readahead, which repeatedly calls
> > > netfs_rreq_submit_slice)
> > > - Looks impossible without refactoring of NFS pgio API.  We need
> > > to prevent nfs_pageio_add_request() from calling
> > > nfs_pagio_doio(),
> > > and return some indication coalesce failed.  In addition, it may
> > > run into problems with fallback from DS to MDS for example (see
> > > commit d9156f9f364897e93bdd98b4ad22138de18f7c24).
> > >
> > > Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed.
> > > + Consistent with netfs design intent
> > > - Multiple "split" requests would be serialized (see code
> > > paths inside netfs_subreq_terminated that check for this flag).
> > > - Looks impossible without some refactoring of NFS pgio API.
> > > * Notes: Terminate NFS pageio page based loop at the first call
> > > to nfs_pageio_doio().  When a READ completes, NFS calls
> > > netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ
> > > and is prepared to have the rest of the subrequest be
> > > resubmitted.
> > > Need to somehow fail early or avoid entirely subsequent calls to
> > > nfs_pagio_doio() for the original request though, and handle
> > > error status only from the first RPC.
> > >
> > > Option 4: Add some final completion routine to be called near
> > > bottom of nfs_pageio_complete() and would pass in at least
> > > netfs_read_subrequest(), possibly nfs_pageio_descriptor.
> > > + Inconsistent with netfs design intent
> > > - Would be a new NFS API or call on top of everything
> > > - Need to handle the "multiple completion with different
> > > status" problem (see #1).
> > >
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
> >
> >
>

--
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022

http://www.hammer.space

2021-03-31 19:10:33

by David Wysochanski

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

On Wed, Mar 31, 2021 at 2:51 PM Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2021-03-31 at 14:34 -0400, David Wysochanski wrote:
> > On Wed, Mar 31, 2021 at 2:04 PM Trond Myklebust <
> > [email protected]> wrote:
> > >
> > > On Wed, 2021-03-31 at 13:49 -0400, David Wysochanski wrote:
> > > > Trond,
> > > >
> > > > I've been working on getting NFS converted to dhowells new
> > > > fscache
> > > > and
> > > > netfs APIs and running into a problem with how NFS is designed
> > > > and it
> > > > involves the NFS pagelist.c / pgio API. I'd appreciate it if you
> > > > could review and give your thoughts on possible approaches. I've
> > > > tried to outline some of the possibilities below. I tried coding
> > > > option #3 and ran into some problems, and it has a serialization
> > > > limitation. At this point I'm leaning towards option 2, so I'll
> > > > probably try that approach if you don't have time for review or
> > > > have
> > > > strong thoughts on it.
> > > >
> > >
> > > I am not going through another redesign of the NFS code in order to
> > > accommodate another cachefs design. If netfs needs a refactoring or
> > > redesign of the I/O code then it will be immediately NACKed.
> > >
> > I don't think it will require a redesign. I was thinking more about
> > adding a flag to nfs_pageio_add_request() for example that
> > would return a different value if coalesce of the page being
> > added failed. So we'd have:
> > nfs_pageio_init(): called 1 time
> > nfs_pageio_add_request(): called N times, one for each page, but stop
> > if
> > coalesce fails
> > nfs_pageio_complete(): called 1 time
> >
> > > Why does netfs need to know these details about the NFS code
> > > anyway?
> > >
> > We can probably get by without it but it will be awkward and probably
> > not the
> > best, but I'm not sure.
> >
> > I tried to explain below with a problem statement but maybe it was
> > unclear.
> > The basic design of netfs API as it pertains to this problem is:
> > * issue_op(): calls into the specific netfs (NFS) to obtain the data
> > from server
> > (send one or more RPCs)
> > * netfs_subreq_terminated(): when RPC(s) are completed, we need to
> > call
> > netfs API back to say the data is either there or there was an error
> >
>
> That's a problem. That means NFS and netfs need intimate knowledge of
> each other's design, and I'm not going allow us to go there again. We
> did that with cachefs, and here we are 10 years later doing a full
> redesign. That's unacceptable.
>

I don't think it's a full redesign, and my goal all along has been minimally
invasive to existing NFS.

I forgot to mention this part of netfs:
* clamp_length(): netfs calls into NFS here and we can clamp the length
to a specific size (for example 'rsize' for reads); this is what I'm trying to
see if I can implement fully or not but looks more complicated due to
coalescing failing, etc. If not then there's other possible approaches
(NETFS_SREQ_SHORT_READ) but not sure they are ideal.


> If netfs requires these detailed changes to the NFS code, then that's a
> red flag that the whole design is broken and needs to be revised.
>
> > I would note that assuming we can come up with something acceptable
> > to
> > NFS, it should simplify both nfs_readpage() and
> > nfs_readpages/nfs_readhead.
> > I hope we can find some common ground so it's neither too invasive to
> > NFS and also maybe there's some similar improvements in NFS that can
> > be done along with this interface.
> >
> >
> > > > Thanks.
> > > >
> > > >
> > > > Problem: The NFS pageio interface does not expose a max read
> > > > length
> > > > that
> > > > we can easily use inside netfs clamp_length() function. As a
> > > > result,
> > > > when
> > > > issue_op() is called indicating a single netfs subrequest, this
> > > > can
> > > > be
> > > > split into
> > > > multiple NFS subrequests / RPCs inside guts of NFS pageio code.
> > > > Multiple
> > > > NFS subrequests requests leads to multiple completions, and the
> > > > netfs
> > > > API expects a 1:1 mapping between issue_op() and
> > > > netfs_subreq_terminated() calls.
> > > >
> > > > Details of the NFS pageio API (see include/linux/nfs_page.h and
> > > > fs/nfs/pagelist.c)
> > > > Details of the netfs API (see include/linux/netfs.h and
> > > > fs/netfs/read_helper.c)
> > > >
> > > > The NFS pageio API 3 main calls are as follows:
> > > > 1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N
> > > > pages)
> > > > 2. nfs_pageio_add_request(): called for each page to add to an IO
> > > > * Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request
> > > > * __nfs_pageio_add_request may call nfs_pageio_doio() which
> > > > actually
> > > > sends an RPC over the wire if page cannot be added to the
> > > > request
> > > > ("coalesced") due to various factors. For more details, see
> > > > nfs_pageio_do_add_request() and all underlying code it calls
> > > > such
> > > > as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test()
> > > > calls
> > > > 3. nfs_pageio_complete() - "complete" the pageio
> > > > * calls nfs_pageio_complete_mirror -> nfs_pageio_doio()
> > > >
> > > > The NFS pageio API thus may generate multiple over the wire RPCs
> > > > and thus multiple completions even though at the high level only
> > > > one call to nfs_pageio_complete() is made.
> > > >
> > > > Option 1: Just use NFS pageio API as is, and deal with possible
> > > > multiple
> > > > completions.
> > > > - Inconsistent with netfs design intent
> > > > - Need to keep track of the RPC completion status, and for
> > > > example,
> > > > if one completes with success and one an error, probably call
> > > > netfs_subreq_terminated() with the error.
> > > > - There's no way for the caller of the NFS pageio API to know how
> > > > many RPCs and thus completions may occur. Thus, it's unclear how
> > > > one would distinguish between a READ that resulted in a single
> > > > RPC
> > > > over the wire that completed as a short read, and a READ that
> > > > resulted in multiple RPCs that would each complete separately,
> > > > but would eventually complete
> > > >
> > > > Option 2: Create a more complex 'clamp_length()' function for
> > > > NFS,
> > > > taking into account all ways NFS / pNFS code can split a read.
> > > > + Consistent with netfs design intent
> > > > + Multiple "split" requests would be called in parallel (see loop
> > > > inside netfs_readahead, which repeatedly calls
> > > > netfs_rreq_submit_slice)
> > > > - Looks impossible without refactoring of NFS pgio API. We need
> > > > to prevent nfs_pageio_add_request() from calling
> > > > nfs_pagio_doio(),
> > > > and return some indication coalesce failed. In addition, it may
> > > > run into problems with fallback from DS to MDS for example (see
> > > > commit d9156f9f364897e93bdd98b4ad22138de18f7c24).
> > > >
> > > > Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed.
> > > > + Consistent with netfs design intent
> > > > - Multiple "split" requests would be serialized (see code
> > > > paths inside netfs_subreq_terminated that check for this flag).
> > > > - Looks impossible without some refactoring of NFS pgio API.
> > > > * Notes: Terminate NFS pageio page based loop at the first call
> > > > to nfs_pageio_doio(). When a READ completes, NFS calls
> > > > netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ
> > > > and is prepared to have the rest of the subrequest be
> > > > resubmitted.
> > > > Need to somehow fail early or avoid entirely subsequent calls to
> > > > nfs_pagio_doio() for the original request though, and handle
> > > > error status only from the first RPC.
> > > >
> > > > Option 4: Add some final completion routine to be called near
> > > > bottom of nfs_pageio_complete() and would pass in at least
> > > > netfs_read_subrequest(), possibly nfs_pageio_descriptor.
> > > > + Inconsistent with netfs design intent
> > > > - Would be a new NFS API or call on top of everything
> > > > - Need to handle the "multiple completion with different
> > > > status" problem (see #1).
> > > >
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > [email protected]
> > >
> > >
> >
>
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4984 El Camino Real, Suite 208
> Los Altos, CA 94022
>
> http://www.hammer.space
>

2021-04-01 17:43:32

by David Wysochanski

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

On Thu, Apr 1, 2021 at 11:13 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Apr 01, 2021 at 02:51:06PM +0100, David Howells wrote:
> > (1) The way cachefiles reads data from the cache is very hacky (calling
> > readpage on the backing filesystem and then installing an interceptor on
> > the waitqueue for the PG_locked page flag on that page, then memcpying
> > the page in a worker thread) - but it was the only way to do it at the
> > time. Unfortunately, it's fragile and it seems just occasionally the
> > wake event is missed.
> >
> > Since then, kiocb has come along. I really want to switch to using this
> > to read/write the cache. It's a lot more robust and also allows async
> > DIO to be performed, also cutting out the memcpy.
> >
> > Changing the fscache IO part of API would make this easier.
>
> I agree with this. The current way that fscache works is grotesque.
> It knows far too much about the inner workings of, well, everything.
>
> > (3) VM changes are coming that affect the filesystem address space
> > operations. THP is already here, though not rolled out into all
> > filesystems yet. Folios are (probably) on their way. These manage with
> > page aggregation. There's a new readahead interface function too.
> >
> > This means, however, that you might get an aggregate page that is
> > partially cached. In addition, the current fscache IO API cannot deal
> > with these. I think only 9p, afs, ceph, cifs, nfs plus orangefs don't
> > support THPs yet. The first five Willy has held off on because fscache
> > is a complication and there's an opportunity to make a single solution
> > that fits all five.
>
> This isn't quite uptodate:
>
> - The new readahead interface went into Linux in June 2020.
> All filesystems were converted from readpages to readahead except
> for the five above that use fscache. It would be nice to remove the
> readpages operation, but I'm trying not to get in anyone's way here,
> and the fscache->netfs transition was already underway.
> - THPs in Linux today are available to precisely one filesystem --
> shmem/tmpfs. There are problems all over the place with using THPs
> for non-in-memory filesystems.
> - My THP work achieved POC status. I got some of the prerequistite
> bits in, but have now stopped working on it (see next point). It works
> pretty darned well on XFS only. I did do some work towards enabling
> it on NFS, but never tested it. There's a per-filesystem enable bit,
> so in theory NFS never needs to be converted. In practice, you're
> going to want to for the performance boost.
> - I'm taking the lessons learned as part of the THP work (it's confusing
> when a struct page may refer to part of a large memory allocation or
> all of a large memory allocation) and introducing a new data type
> (the struct folio) to refer to chunks of memory in the page cache.
> All filesystems are going to have to be converted to the new API,
> so the fewer places that filesystems actually deal with struct page,
> the easier this makes the transition.
> - I don't understand how a folio gets to be partially cached. Cached
> should be tracked on a per-folio basis (like dirty or uptodate), not
> on a per-page basis. The point of the folio work is that managing
> memory in page-sized chunks is now too small for good performance.
>
> > So with the above, there is an opportunity to abstract handling of the VM I/O
> > ops for network filesystems - 9p, afs, ceph, cifs and nfs - into a common
> > library that handles VM I/O ops and translates them to RPC calls, cache reads
> > and cache writes. The thought is that we should be able to push the
> > aggregation of pages into RPC calls there, handle rsize/wsize and allow
> > requests to be sliced up and so that they can be distributed to multiple
> > servers (works for ceph) so that all five filesystems can get the same
> > benefits in one go.
>
> If NFS wants to do its own handling of rsize/wsize, could it? That is,
> if the VM passes it a 2MB page and says "read it", and the server has
> an rsize of 256kB, could NFS split it up and send its own stream of 8
> requests, or does it have to use fscache to do that?
>

Yes NFS will split up say the 2MB VM read into smaller requests (RPCs)
as needed, based on rsize and other factors such as the pNFS layouts.
It does not need fscache or netfs APIs to do this - it's always handled it.

The issue I'm trying to resolve is how to best plug the current netfs API
into NFS. Previously, the fscache API was page based, so when an RPC
would complete with fscache enabled, we'd call an fscache API on a single
page. The current netfs API specifies a 'subrequest' which is a set of pages.
Originally I thought this would not be a problem because there are NFS
internal APIs which essentially handle a series of pages into one request
(this is the pagelist.c and nfs_page.h functions I referred to). But digging
deeper it seems it is a bit more challenging since the splitting of the requests
by NFS happen deeper than I realized, and so it does not look doable
for example to tell netfs how large a single NFS request may be (this was
why the "clamp_length()" function was added to netfs API).

2021-04-01 17:48:04

by David Howells

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

Matthew Wilcox <[email protected]> wrote:

> - I don't understand how a folio gets to be partially cached. Cached
> should be tracked on a per-folio basis (like dirty or uptodate), not
> on a per-page basis. The point of the folio work is that managing
> memory in page-sized chunks is now too small for good performance.

Consider the following scenario:

(1) You read two 256K chunks from an nfs file. These are downloaded from the
server and cached.

(2) The nfs inode is dropped from the cache.

(3) You do a 1M read that includes both 256K chunks.

The VM might create a 1M folio to handle (3) that covers the entire read, but
only half of the data is in the cache; the other half has to be fetched from
the server.

David

2021-04-01 17:57:27

by David Howells

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

Trond Myklebust <[email protected]> wrote:

> > I've been working on getting NFS converted to dhowells new fscache
> > and
> > netfs APIs and running into a problem with how NFS is designed and it
> > involves the NFS pagelist.c / pgio API. I'd appreciate it if you
> > could review and give your thoughts on possible approaches. I've
> > tried to outline some of the possibilities below. I tried coding
> > option #3 and ran into some problems, and it has a serialization
> > limitation. At this point I'm leaning towards option 2, so I'll
> > probably try that approach if you don't have time for review or have
> > strong thoughts on it.
>
> I am not going through another redesign of the NFS code in order to
> accommodate another cachefs design. If netfs needs a refactoring or
> redesign of the I/O code then it will be immediately NACKed.
>
> Why does netfs need to know these details about the NFS code anyway?

There are some issues we have to deal with in fscache - and some
opportunities.

(1) The way cachefiles reads data from the cache is very hacky (calling
readpage on the backing filesystem and then installing an interceptor on
the waitqueue for the PG_locked page flag on that page, then memcpying
the page in a worker thread) - but it was the only way to do it at the
time. Unfortunately, it's fragile and it seems just occasionally the
wake event is missed.

Since then, kiocb has come along. I really want to switch to using this
to read/write the cache. It's a lot more robust and also allows async
DIO to be performed, also cutting out the memcpy.

Changing the fscache IO part of API would make this easier.

(2) The way cachefiles finds out whether data is present (using bmap) is not
viable on ext4 or xfs and has to be changed. This means I have to keep
track of the presence of data myself, separately from the backing
filesystem's own metadata.

To reduce the amount of metadata I need to keep track of and store, I
want to increase the cache granularity - that is I will only store, say,
blocks of 256K. But that needs to feed back up to the filesystem so that
it can ask the VM to expand the readahead.

(3) VM changes are coming that affect the filesystem address space
operations. THP is already here, though not rolled out into all
filesystems yet. Folios are (probably) on their way. These manage with
page aggregation. There's a new readahead interface function too.

This means, however, that you might get an aggregate page that is
partially cached. In addition, the current fscache IO API cannot deal
with these. I think only 9p, afs, ceph, cifs, nfs plus orangefs don't
support THPs yet. The first five Willy has held off on because fscache
is a complication and there's an opportunity to make a single solution
that fits all five.

Also to this end, I'm trying to make it so that fscache doesn't retain
any pointers back into the network filesystem structures, beyond the info
provided to perform a cache op - and that is only required on a transient
basis.

(4) I'd like to be able to encrypt the data stored in the local cache and
Jeff Layton is adding support for fscrypt to ceph. It would be nice if
we could share the solution with all of the aforementioned bunch of five
filesystems by putting it into the common library.

So with the above, there is an opportunity to abstract handling of the VM I/O
ops for network filesystems - 9p, afs, ceph, cifs and nfs - into a common
library that handles VM I/O ops and translates them to RPC calls, cache reads
and cache writes. The thought is that we should be able to push the
aggregation of pages into RPC calls there, handle rsize/wsize and allow
requests to be sliced up and so that they can be distributed to multiple
servers (works for ceph) so that all five filesystems can get the same
benefits in one go.

Btw, I'm also looking at changing the way indexing works, though that should
only very minorly alter the nfs code and doesn't require any restructuring.
I've simplified things a lot and I'm hoping to remove a couple of thousand
lines from fscache and cachefiles.

David

2021-04-01 18:04:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: RFC: Approaches to resolve netfs API interface to NFS multiple completions problem

On Thu, Apr 01, 2021 at 02:51:06PM +0100, David Howells wrote:
> (1) The way cachefiles reads data from the cache is very hacky (calling
> readpage on the backing filesystem and then installing an interceptor on
> the waitqueue for the PG_locked page flag on that page, then memcpying
> the page in a worker thread) - but it was the only way to do it at the
> time. Unfortunately, it's fragile and it seems just occasionally the
> wake event is missed.
>
> Since then, kiocb has come along. I really want to switch to using this
> to read/write the cache. It's a lot more robust and also allows async
> DIO to be performed, also cutting out the memcpy.
>
> Changing the fscache IO part of API would make this easier.

I agree with this. The current way that fscache works is grotesque.
It knows far too much about the inner workings of, well, everything.

> (3) VM changes are coming that affect the filesystem address space
> operations. THP is already here, though not rolled out into all
> filesystems yet. Folios are (probably) on their way. These manage with
> page aggregation. There's a new readahead interface function too.
>
> This means, however, that you might get an aggregate page that is
> partially cached. In addition, the current fscache IO API cannot deal
> with these. I think only 9p, afs, ceph, cifs, nfs plus orangefs don't
> support THPs yet. The first five Willy has held off on because fscache
> is a complication and there's an opportunity to make a single solution
> that fits all five.

This isn't quite uptodate:

- The new readahead interface went into Linux in June 2020.
All filesystems were converted from readpages to readahead except
for the five above that use fscache. It would be nice to remove the
readpages operation, but I'm trying not to get in anyone's way here,
and the fscache->netfs transition was already underway.
- THPs in Linux today are available to precisely one filesystem --
shmem/tmpfs. There are problems all over the place with using THPs
for non-in-memory filesystems.
- My THP work achieved POC status. I got some of the prerequistite
bits in, but have now stopped working on it (see next point). It works
pretty darned well on XFS only. I did do some work towards enabling
it on NFS, but never tested it. There's a per-filesystem enable bit,
so in theory NFS never needs to be converted. In practice, you're
going to want to for the performance boost.
- I'm taking the lessons learned as part of the THP work (it's confusing
when a struct page may refer to part of a large memory allocation or
all of a large memory allocation) and introducing a new data type
(the struct folio) to refer to chunks of memory in the page cache.
All filesystems are going to have to be converted to the new API,
so the fewer places that filesystems actually deal with struct page,
the easier this makes the transition.
- I don't understand how a folio gets to be partially cached. Cached
should be tracked on a per-folio basis (like dirty or uptodate), not
on a per-page basis. The point of the folio work is that managing
memory in page-sized chunks is now too small for good performance.

> So with the above, there is an opportunity to abstract handling of the VM I/O
> ops for network filesystems - 9p, afs, ceph, cifs and nfs - into a common
> library that handles VM I/O ops and translates them to RPC calls, cache reads
> and cache writes. The thought is that we should be able to push the
> aggregation of pages into RPC calls there, handle rsize/wsize and allow
> requests to be sliced up and so that they can be distributed to multiple
> servers (works for ceph) so that all five filesystems can get the same
> benefits in one go.

If NFS wants to do its own handling of rsize/wsize, could it? That is,
if the VM passes it a 2MB page and says "read it", and the server has
an rsize of 256kB, could NFS split it up and send its own stream of 8
requests, or does it have to use fscache to do that?