The splice read calls nfsd_splice_actor to put the pages containing file
data into the svc_rqst->rq_pages array. It's possible however to get a
splice result that only has a partial page at the end, if (e.g.) the
filesystem hands back a short read that doesn't cover the whole page.
nfsd_splice_actor will plop the partial page into its rq_pages array and
return. Then later, when nfsd_splice_actor is called again, the
remainder of the page may end up being filled out. At this point,
nfsd_splice_actor will put the page into the array _again_ corrupting
the reply. If this is done enough times, rq_next_page will overrun the
array and corrupt the trailing fields -- the rq_respages and
rq_next_page pointers themselves.
If we've already added the page to the array in the last pass, don't add
it to the array a second time when dealing with a splice continuation.
This was originally handled properly in nfsd_splice_actor, but commit
91e23b1c3982 removed the check for it, and started universally replacing
pages.
Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
Reported-by: Dario Lesca <[email protected]>
Tested-by: David Critch <[email protected]>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfsd/vfs.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 502e1b7742db..3709ef57d96e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
struct page *last_page;
last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
- for (page += offset / PAGE_SIZE; page <= last_page; page++)
- svc_rqst_replace_page(rqstp, page);
+ for (page += offset / PAGE_SIZE; page <= last_page; page++) {
+ /* Only replace page if we haven't already done so */
+ if (page != *(rqstp->rq_next_page - 1))
+ svc_rqst_replace_page(rqstp, page);
+ }
if (rqstp->rq_res.page_len == 0) // first call
rqstp->rq_res.page_base = offset % PAGE_SIZE;
rqstp->rq_res.page_len += sd->len;
--
2.39.2
There's no good way to handle this gracefully, but if rq_next_page ends
up pointing outside the array, we can at least crash the box before it
scribbles over too much else.
Signed-off-by: Jeff Layton <[email protected]>
---
net/sunrpc/svc.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index fea7ce8fba14..864e62945647 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -845,6 +845,16 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
*/
void svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
{
+ struct page **begin, **end;
+
+ /*
+ * Bounds check: make sure rq_next_page points into the rq_respages
+ * part of the array.
+ */
+ begin = rqstp->rq_pages;
+ end = &rqstp->rq_pages[RPCSVC_MAXPAGES];
+ BUG_ON(rqstp->rq_next_page < begin || rqstp->rq_next_page > end);
+
if (*rqstp->rq_next_page) {
if (!pagevec_space(&rqstp->rq_pvec))
__pagevec_release(&rqstp->rq_pvec);
--
2.39.2
On Fri, 2023-03-17 at 06:56 -0400, Jeff Layton wrote:
> The splice read calls nfsd_splice_actor to put the pages containing file
> data into the svc_rqst->rq_pages array. It's possible however to get a
> splice result that only has a partial page at the end, if (e.g.) the
> filesystem hands back a short read that doesn't cover the whole page.
>
> nfsd_splice_actor will plop the partial page into its rq_pages array and
> return. Then later, when nfsd_splice_actor is called again, the
> remainder of the page may end up being filled out. At this point,
> nfsd_splice_actor will put the page into the array _again_ corrupting
> the reply. If this is done enough times, rq_next_page will overrun the
> array and corrupt the trailing fields -- the rq_respages and
> rq_next_page pointers themselves.
>
> If we've already added the page to the array in the last pass, don't add
> it to the array a second time when dealing with a splice continuation.
> This was originally handled properly in nfsd_splice_actor, but commit
> 91e23b1c3982 removed the check for it, and started universally replacing
> pages.
>
> Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
> Reported-by: Dario Lesca <[email protected]>
> Tested-by: David Critch <[email protected]>
> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfsd/vfs.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 502e1b7742db..3709ef57d96e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> struct page *last_page;
>
> last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
> - for (page += offset / PAGE_SIZE; page <= last_page; page++)
> - svc_rqst_replace_page(rqstp, page);
> + for (page += offset / PAGE_SIZE; page <= last_page; page++) {
> + /* Only replace page if we haven't already done so */
Note that I think that this was probably the real rationale for the pp[-
1] check that 91e23b1c3982 removed. Given that, maybe we should flesh
this comment out a bit more for posterity?
/*
* When we're splicing from a pipe, it's possible that
* we'll get an incomplete page that may be updated on
* a later call. Only splice it into rq_pages once.
*/
> + if (page != *(rqstp->rq_next_page - 1))
> + svc_rqst_replace_page(rqstp, page);
> + }
> if (rqstp->rq_res.page_len == 0) // first call
> rqstp->rq_res.page_base = offset % PAGE_SIZE;
> rqstp->rq_res.page_len += sd->len;
--
Jeff Layton <[email protected]>
> On Mar 17, 2023, at 6:56 AM, Jeff Layton <[email protected]> wrote:
>
> There's no good way to handle this gracefully, but if rq_next_page ends
> up pointing outside the array, we can at least crash the box before it
> scribbles over too much else.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> net/sunrpc/svc.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index fea7ce8fba14..864e62945647 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -845,6 +845,16 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
> */
> void svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
> {
> + struct page **begin, **end;
> +
> + /*
> + * Bounds check: make sure rq_next_page points into the rq_respages
> + * part of the array.
> + */
> + begin = rqstp->rq_pages;
> + end = &rqstp->rq_pages[RPCSVC_MAXPAGES];
> + BUG_ON(rqstp->rq_next_page < begin || rqstp->rq_next_page > end);
Linus has stated clearly that he does not want BUG_ON assertions
if the system is not actually in danger... and this is clearly
the result of a software bug, so a crash will occur anyway.
Can you make this a pr_warn_once() ?
> +
> if (*rqstp->rq_next_page) {
> if (!pagevec_space(&rqstp->rq_pvec))
> __pagevec_release(&rqstp->rq_pvec);
> --
> 2.39.2
>
--
Chuck Lever
On Fri, 2023-03-17 at 13:44 +0000, Chuck Lever III wrote:
>
> > On Mar 17, 2023, at 6:56 AM, Jeff Layton <[email protected]> wrote:
> >
> > There's no good way to handle this gracefully, but if rq_next_page ends
> > up pointing outside the array, we can at least crash the box before it
> > scribbles over too much else.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > net/sunrpc/svc.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index fea7ce8fba14..864e62945647 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -845,6 +845,16 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
> > */
> > void svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
> > {
> > + struct page **begin, **end;
> > +
> > + /*
> > + * Bounds check: make sure rq_next_page points into the rq_respages
> > + * part of the array.
> > + */
> > + begin = rqstp->rq_pages;
> > + end = &rqstp->rq_pages[RPCSVC_MAXPAGES];
> > + BUG_ON(rqstp->rq_next_page < begin || rqstp->rq_next_page > end);
>
> Linus has stated clearly that he does not want BUG_ON assertions
> if the system is not actually in danger... and this is clearly
> the result of a software bug, so a crash will occur anyway.
>
It'll crash, but only after we scribble over some memory.
Actually, it looks like the splice actor can return an error. We could
return -EIO here or something without doing anything if we hit this case
and then let that bubble back up to the read?
> Can you make this a pr_warn_once() ?
>
>
> > +
> > if (*rqstp->rq_next_page) {
> > if (!pagevec_space(&rqstp->rq_pvec))
> > __pagevec_release(&rqstp->rq_pvec);
> > --
> > 2.39.2
> >
>
> --
> Chuck Lever
>
>
--
Jeff Layton <[email protected]>
> On Mar 17, 2023, at 9:52 AM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2023-03-17 at 13:44 +0000, Chuck Lever III wrote:
>>
>>> On Mar 17, 2023, at 6:56 AM, Jeff Layton <[email protected]> wrote:
>>>
>>> There's no good way to handle this gracefully, but if rq_next_page ends
>>> up pointing outside the array, we can at least crash the box before it
>>> scribbles over too much else.
>>>
>>> Signed-off-by: Jeff Layton <[email protected]>
>>> ---
>>> net/sunrpc/svc.c | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>>>
>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>> index fea7ce8fba14..864e62945647 100644
>>> --- a/net/sunrpc/svc.c
>>> +++ b/net/sunrpc/svc.c
>>> @@ -845,6 +845,16 @@ EXPORT_SYMBOL_GPL(svc_set_num_threads);
>>> */
>>> void svc_rqst_replace_page(struct svc_rqst *rqstp, struct page *page)
>>> {
>>> + struct page **begin, **end;
>>> +
>>> + /*
>>> + * Bounds check: make sure rq_next_page points into the rq_respages
>>> + * part of the array.
>>> + */
>>> + begin = rqstp->rq_pages;
>>> + end = &rqstp->rq_pages[RPCSVC_MAXPAGES];
>>> + BUG_ON(rqstp->rq_next_page < begin || rqstp->rq_next_page > end);
>>
>> Linus has stated clearly that he does not want BUG_ON assertions
>> if the system is not actually in danger... and this is clearly
>> the result of a software bug, so a crash will occur anyway.
>>
>
> It'll crash, but only after we scribble over some memory.
>
> Actually, it looks like the splice actor can return an error. We could
> return -EIO here or something without doing anything if we hit this case
> and then let that bubble back up to the read?
Yes, if it's possible to fail just the READ operation, that
would be best. Maybe a emitting a trace event would be better
than a pr_warn.
>> Can you make this a pr_warn_once() ?
>>
>>
>>> +
>>> if (*rqstp->rq_next_page) {
>>> if (!pagevec_space(&rqstp->rq_pvec))
>>> __pagevec_release(&rqstp->rq_pvec);
>>> --
>>> 2.39.2
>>>
>>
>> --
>> Chuck Lever
>>
>>
>
> --
> Jeff Layton <[email protected]>
--
Chuck Lever
> On Mar 17, 2023, at 9:06 AM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2023-03-17 at 06:56 -0400, Jeff Layton wrote:
>> The splice read calls nfsd_splice_actor to put the pages containing file
>> data into the svc_rqst->rq_pages array. It's possible however to get a
>> splice result that only has a partial page at the end, if (e.g.) the
>> filesystem hands back a short read that doesn't cover the whole page.
>>
>> nfsd_splice_actor will plop the partial page into its rq_pages array and
>> return. Then later, when nfsd_splice_actor is called again, the
>> remainder of the page may end up being filled out. At this point,
>> nfsd_splice_actor will put the page into the array _again_ corrupting
>> the reply. If this is done enough times, rq_next_page will overrun the
>> array and corrupt the trailing fields -- the rq_respages and
>> rq_next_page pointers themselves.
>>
>> If we've already added the page to the array in the last pass, don't add
>> it to the array a second time when dealing with a splice continuation.
>> This was originally handled properly in nfsd_splice_actor, but commit
>> 91e23b1c3982 removed the check for it, and started universally replacing
>> pages.
>>
>> Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
>> Reported-by: Dario Lesca <[email protected]>
>> Tested-by: David Critch <[email protected]>
>> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
>> Signed-off-by: Jeff Layton <[email protected]>
>> ---
>> fs/nfsd/vfs.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 502e1b7742db..3709ef57d96e 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> struct page *last_page;
>>
>> last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
>> - for (page += offset / PAGE_SIZE; page <= last_page; page++)
>> - svc_rqst_replace_page(rqstp, page);
>> + for (page += offset / PAGE_SIZE; page <= last_page; page++) {
>> + /* Only replace page if we haven't already done so */
>
> Note that I think that this was probably the real rationale for the pp[-
> 1] check that 91e23b1c3982 removed. Given that, maybe we should flesh
> this comment out a bit more for posterity?
>
> /*
> * When we're splicing from a pipe, it's possible that
> * we'll get an incomplete page that may be updated on
> * a later call. Only splice it into rq_pages once.
> */
The "real" bug here is that the API contract for pipe splicing
isn't well defined, so I agree that it's very likely the pp[-1]
check was because a splice can call the actor repeatedly for the
same page. No one could remember why that check was there.
To be clear, if the passed-in page matches the current page in
the rqst, we're "extending the current page" rather than avoiding
replacement... maybe:
/*
* Skip page replacement when extending the contents
* of the current page.
*/
In the patch description, would you mention that this case
arises if the READ request is not page-aligned?
If you resend this patch, please Cc: viro@ . Thanks for chasing
this down!
>> + if (page != *(rqstp->rq_next_page - 1))
>> + svc_rqst_replace_page(rqstp, page);
>> + }
>> if (rqstp->rq_res.page_len == 0) // first call
>> rqstp->rq_res.page_base = offset % PAGE_SIZE;
>> rqstp->rq_res.page_len += sd->len;
>
> --
> Jeff Layton <[email protected]>
--
Chuck Lever
On Fri, 2023-03-17 at 14:16 +0000, Chuck Lever III wrote:
>
> > On Mar 17, 2023, at 9:06 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2023-03-17 at 06:56 -0400, Jeff Layton wrote:
> > > The splice read calls nfsd_splice_actor to put the pages containing file
> > > data into the svc_rqst->rq_pages array. It's possible however to get a
> > > splice result that only has a partial page at the end, if (e.g.) the
> > > filesystem hands back a short read that doesn't cover the whole page.
> > >
> > > nfsd_splice_actor will plop the partial page into its rq_pages array and
> > > return. Then later, when nfsd_splice_actor is called again, the
> > > remainder of the page may end up being filled out. At this point,
> > > nfsd_splice_actor will put the page into the array _again_ corrupting
> > > the reply. If this is done enough times, rq_next_page will overrun the
> > > array and corrupt the trailing fields -- the rq_respages and
> > > rq_next_page pointers themselves.
> > >
> > > If we've already added the page to the array in the last pass, don't add
> > > it to the array a second time when dealing with a splice continuation.
> > > This was originally handled properly in nfsd_splice_actor, but commit
> > > 91e23b1c3982 removed the check for it, and started universally replacing
> > > pages.
> > >
> > > Fixes: 91e23b1c3982 ("NFSD: Clean up nfsd_splice_actor()")
> > > Reported-by: Dario Lesca <[email protected]>
> > > Tested-by: David Critch <[email protected]>
> > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2150630
> > > Signed-off-by: Jeff Layton <[email protected]>
> > > ---
> > > fs/nfsd/vfs.c | 7 +++++--
> > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 502e1b7742db..3709ef57d96e 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -941,8 +941,11 @@ nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> > > struct page *last_page;
> > >
> > > last_page = page + (offset + sd->len - 1) / PAGE_SIZE;
> > > - for (page += offset / PAGE_SIZE; page <= last_page; page++)
> > > - svc_rqst_replace_page(rqstp, page);
> > > + for (page += offset / PAGE_SIZE; page <= last_page; page++) {
> > > + /* Only replace page if we haven't already done so */
> >
> > Note that I think that this was probably the real rationale for the pp[-
> > 1] check that 91e23b1c3982 removed. Given that, maybe we should flesh
> > this comment out a bit more for posterity?
> >
> > /*
> > * When we're splicing from a pipe, it's possible that
> > * we'll get an incomplete page that may be updated on
> > * a later call. Only splice it into rq_pages once.
> > */
>
> The "real" bug here is that the API contract for pipe splicing
> isn't well defined, so I agree that it's very likely the pp[-1]
> check was because a splice can call the actor repeatedly for the
> same page. No one could remember why that check was there.
>
The whole splice API is a minefield.
> To be clear, if the passed-in page matches the current page in
> the rqst, we're "extending the current page" rather than avoiding
> replacement... maybe:
>
> /*
> * Skip page replacement when extending the contents
> * of the current page.
> */
>
Sure, sounds good.
> In the patch description, would you mention that this case
> arises if the READ request is not page-aligned?
>
Does it though? I'm not sure that page alignment has that much to do
with it. I imagine you can hit this even with aligned I/Os.
My guess is the bigger issue is when your storage is doing sub-page-size
I/Os under the hood. We end up filling up part of a page from storage
and the kernel submits what it has to the pipe and then the next bit
comes in and the page is updated for the next actor call.
> If you resend this patch, please Cc: viro@ . Thanks for chasing
> this down!
>
Will do.
>
> > > + if (page != *(rqstp->rq_next_page - 1))
> > > + svc_rqst_replace_page(rqstp, page);
> > > + }
> > > if (rqstp->rq_res.page_len == 0) // first call
> > > rqstp->rq_res.page_base = offset % PAGE_SIZE;
> > > rqstp->rq_res.page_len += sd->len;
> >
> > --
> > Jeff Layton <[email protected]>
>
> --
> Chuck Lever
>
>
--
Jeff Layton <[email protected]>
> On Mar 17, 2023, at 10:59 AM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2023-03-17 at 14:16 +0000, Chuck Lever III wrote:
>
>> In the patch description, would you mention that this case
>> arises if the READ request is not page-aligned?
>
> Does it though? I'm not sure that page alignment has that much to do
> with it. I imagine you can hit this even with aligned I/Os.
Maybe, but no-one has actually seen that. The vast majority of
reports of this problem are with unaligned I/O, which POSIX OS
NFS clients (like the Linux NFS client) usually avoid.
I didn't mean to exclude the possibility of hitting this issue
in other ways, but simply observing a common way it is hit.
--
Chuck Lever
On Fri, 2023-03-17 at 15:04 +0000, Chuck Lever III wrote:
>
> > On Mar 17, 2023, at 10:59 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2023-03-17 at 14:16 +0000, Chuck Lever III wrote:
> >
> > > In the patch description, would you mention that this case
> > > arises if the READ request is not page-aligned?
> >
> > Does it though? I'm not sure that page alignment has that much to do
> > with it. I imagine you can hit this even with aligned I/Os.
>
> Maybe, but no-one has actually seen that. The vast majority of
> reports of this problem are with unaligned I/O, which POSIX OS
> NFS clients (like the Linux NFS client) usually avoid.
>
> I didn't mean to exclude the possibility of hitting this issue
> in other ways, but simply observing a common way it is hit.
>
An unaligned read will consume an extra page, so maybe it just makes it
more likely to overrun the array in that case?
--
Jeff Layton <[email protected]>