Sending Slab or tail pages into ->sendpage will cause really strange
delayed oops. Prevent it right in the networking code instead of
requiring drivers to guess the exact conditions where sendpage works.
Based on a patch from Coly Li <[email protected]>.
Signed-off-by: Christoph Hellwig <[email protected]>
---
net/socket.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/socket.c b/net/socket.c
index dbbe8ea7d395da..b4e65688915fe3 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3638,7 +3638,11 @@ EXPORT_SYMBOL(kernel_getpeername);
int kernel_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
- if (sock->ops->sendpage)
+ /* sendpage does manipulates the refcount of the sent in page, which
+ * does not work for Slab pages, or for tails of non-__GFP_COMP
+ * high order pages.
+ */
+ if (sock->ops->sendpage && !PageSlab(page) && page_count(page) > 0)
return sock->ops->sendpage(sock, page, offset, size, flags);
return sock_no_sendpage(sock, page, offset, size, flags);
--
2.28.0
From: Christoph Hellwig <[email protected]>
Date: Wed, 19 Aug 2020 07:19:45 +0200
> Sending Slab or tail pages into ->sendpage will cause really strange
> delayed oops. Prevent it right in the networking code instead of
> requiring drivers to guess the exact conditions where sendpage works.
>
> Based on a patch from Coly Li <[email protected]>.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
Yes this fixes the problem, but it doesn't in any way deal with the
callers who are doing this stuff.
They are all likely using sendpage because they expect that it will
avoid the copy, for performance reasons or whatever.
Now it won't.
At least with Coly's patch set, the set of violators was documented
and they could switch to allocating non-slab pages or calling
sendmsg() or write() instead.
I hear talk about ABIs just doing the right thing, but when their
value is increased performance vs. other interfaces it means that
taking a slow path silently is bad in the long term. And that's
what this proposed patch here does.
On Wed, Aug 19, 2020 at 12:07:09PM -0700, David Miller wrote:
> Yes this fixes the problem, but it doesn't in any way deal with the
> callers who are doing this stuff.
>
> They are all likely using sendpage because they expect that it will
> avoid the copy, for performance reasons or whatever.
>
> Now it won't.
>
> At least with Coly's patch set, the set of violators was documented
> and they could switch to allocating non-slab pages or calling
> sendmsg() or write() instead.
>
> I hear talk about ABIs just doing the right thing, but when their
> value is increased performance vs. other interfaces it means that
> taking a slow path silently is bad in the long term. And that's
> what this proposed patch here does.
If you look at who uses sendpage outside the networking layer itself
you see that it is basically block driver and file systems. These
have no way to control what memory they get passed and have to deal
with everything someone throws at them.
So for these callers the requirements are in order of importance:
(1) just send the damn page without generating weird OOPSes
(2) do so as fast as possible
(3) do so without requÑ–ring pointless boilerplate code
Any I think the current interface fails these requirements really badly.
Having a helper that just does the right thing would really help all of
these users, including those currently using raw ->sendpage over
kernel_sendpage. If you don't like kernel_sendpage to just do the
right thing we could just add another helper, e.g.
kernel_sendpage_or_fallback, but that would seem a little pointless
to me.
From: Christoph Hellwig <[email protected]>
Date: Thu, 20 Aug 2020 06:37:44 +0200
> If you look at who uses sendpage outside the networking layer itself
> you see that it is basically block driver and file systems. These
> have no way to control what memory they get passed and have to deal
> with everything someone throws at them.
I see nvme doing virt_to_page() on several things when it calls into
kernel_sendpage().
This is the kind of stuff I want cleaned up, and which your patch
will not trap nor address.
In nvme it sometimes seems to check for sendpage validity:
/* can't zcopy slab pages */
if (unlikely(PageSlab(page))) {
ret = sock_no_sendpage(queue->sock, page, offset, len,
flags);
} else {
ret = kernel_sendpage(queue->sock, page, offset, len,
flags);
}
Yet elsewhere does not and just blindly calls:
ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
offset_in_page(pdu) + req->offset, len, flags);
This pdu seems to come from a page frag allocation.
That's the target side. On the host side:
ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
left, flags);
No page slab check or anything like that.
I'm hesitent to put in the kernel_sendpage() patch, becuase it provides a
disincentive to fix up code like this.
On 2020/8/22 05:14, David Miller wrote:
> From: Christoph Hellwig <[email protected]>
> Date: Thu, 20 Aug 2020 06:37:44 +0200
>
>> If you look at who uses sendpage outside the networking layer itself
>> you see that it is basically block driver and file systems. These
>> have no way to control what memory they get passed and have to deal
>> with everything someone throws at them.
>
> I see nvme doing virt_to_page() on several things when it calls into
> kernel_sendpage().
>
> This is the kind of stuff I want cleaned up, and which your patch
> will not trap nor address.
>
> In nvme it sometimes seems to check for sendpage validity:
>
> /* can't zcopy slab pages */
> if (unlikely(PageSlab(page))) {
> ret = sock_no_sendpage(queue->sock, page, offset, len,
> flags);
> } else {
> ret = kernel_sendpage(queue->sock, page, offset, len,
> flags);
> }
>
> Yet elsewhere does not and just blindly calls:
>
> ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
> offset_in_page(pdu) + req->offset, len, flags);
>
> This pdu seems to come from a page frag allocation.
>
> That's the target side. On the host side:
>
> ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
> left, flags);
>
> No page slab check or anything like that.
>
> I'm hesitent to put in the kernel_sendpage() patch, becuase it provides a
> disincentive to fix up code like this.
>
Hi David and Christoph,
It has been quiet for a while, what should we go next for the
kernel_sendpage() related issue ?
Will Christoph's or my series be considered as proper fix, or maybe I
should wait for some other better idea to show up? Any is OK for me,
once the problem is fixed.
Thanks in advance.
Coly Li
On Fri, Sep 18, 2020 at 04:37:24PM +0800, Coly Li wrote:
> Hi David and Christoph,
>
> It has been quiet for a while, what should we go next for the
> kernel_sendpage() related issue ?
>
> Will Christoph's or my series be considered as proper fix, or maybe I
> should wait for some other better idea to show up? Any is OK for me,
> once the problem is fixed.
I think for all the network storage stuff we really need a "send me
out a page helper", and the nvmet bits that Dave pointed to look to
me like they actually are currently broken.
Given that Dave doesn't want to change the kernel_sendpage semantics
I'll resend with a new helper instead. Any preferences for a name?
safe_sendpage? kernel_sendpage_safe? kernel_send_one_page?