This series was original by a bug fix in nvme-over-tcp driver which only
checked whether a page was allocated from slab allcoator, but forgot to
check its page_count: The page handled by sendpage should be neither a
Slab page nor 0 page_count page.
As Sagi Grimberg suggested, the original fix is refind to a more common
inline routine:
static inline bool sendpage_ok(struct page *page)
{
return (!PageSlab(page) && page_count(page) >= 1);
}
If sendpage_ok() returns true, the checking page can be handled by the
zero copy sendpage method in network layer.
The first patch in this series introduces sendpage_ok() in header file
include/linux/net.h, the second patch fixes the page checking issue in
nvme-over-tcp driver, the third patch adds page_count check by using
sendpage_ok() in do_tcp_sendpages() as Eric Dumazet suggested, and all
rested patches just replace existing open coded checks with the inline
sendpage_ok() routine.
Coly Li
Cc: Chaitanya Kulkarni <[email protected]>
Cc: Chris Leech <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Cong Wang <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Lee Duncan <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Mikhail Skorzhinskii <[email protected]>
Cc: Philipp Reisner <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Vlastimil Babka <[email protected]>
---
Changelog:
v7: remove outer brackets from the return line of sendpage_ok() as
Eric Dumazet suggested.
v6: fix page check in do_tcp_sendpages(), as Eric Dumazet suggested.
replace other open coded checks with sendpage_ok() in libceph,
iscsi drivers.
v5, include linux/mm.h in include/linux/net.h
v4, change sendpage_ok() as an inline helper, and post it as
separate patch, as Christoph Hellwig suggested.
v3, introduce a more common sendpage_ok() as Sagi Grimberg suggested.
v2, fix typo in patch subject
v1, the initial version.
Coly Li (6):
net: introduce helper sendpage_ok() in include/linux/net.h
nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage()
tcp: use sendpage_ok() to detect misused .sendpage
drbd: code cleanup by using sendpage_ok() to check page for
kernel_sendpage()
scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map()
libceph: use sendpage_ok() in ceph_tcp_sendpage()
drivers/block/drbd/drbd_main.c | 2 +-
drivers/nvme/host/tcp.c | 7 +++----
drivers/scsi/libiscsi_tcp.c | 2 +-
include/linux/net.h | 16 ++++++++++++++++
net/ceph/messenger.c | 2 +-
net/ipv4/tcp.c | 3 ++-
6 files changed, 24 insertions(+), 8 deletions(-)
--
2.26.2
Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
send slab pages. But for pages allocated by __get_free_pages() without
__GFP_COMP, which also have refcount as 0, they are still sent by
kernel_sendpage() to remote end, this is problematic.
The new introduced helper sendpage_ok() checks both PageSlab tag and
page_count counter, and returns true if the checking page is OK to be
sent by kernel_sendpage().
This patch fixes the page checking issue of nvme_tcp_try_send_data()
with sendpage_ok(). If sendpage_ok() returns true, send this page by
kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
Signed-off-by: Coly Li <[email protected]>
Cc: Chaitanya Kulkarni <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mikhail Skorzhinskii <[email protected]>
Cc: Philipp Reisner <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
---
drivers/nvme/host/tcp.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 62fbaecdc960..902fe742762b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -912,12 +912,11 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
else
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
- /* can't zcopy slab pages */
- if (unlikely(PageSlab(page))) {
- ret = sock_no_sendpage(queue->sock, page, offset, len,
+ if (sendpage_ok(page)) {
+ ret = kernel_sendpage(queue->sock, page, offset, len,
flags);
} else {
- ret = kernel_sendpage(queue->sock, page, offset, len,
+ ret = sock_no_sendpage(queue->sock, page, offset, len,
flags);
}
if (ret <= 0)
--
2.26.2
commit a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab
objects") adds the checks for Slab pages, but the pages don't have
page_count are still missing from the check.
Network layer's sendpage method is not designed to send page_count 0
pages neither, therefore both PageSlab() and page_count() should be
both checked for the sending page. This is exactly what sendpage_ok()
does.
This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
.sendpage, to make the code more robust.
Fixes: a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab objects")
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Coly Li <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: [email protected]
---
net/ipv4/tcp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 31f3b858db81..2135ee7c806d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -970,7 +970,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- WARN_ONCE(PageSlab(page), "page must not be a Slab one"))
+ WARN_ONCE(!sendpage_ok(page),
+ "page must not be a Slab one and have page_count > 0"))
return -EINVAL;
/* Wait for a connection to finish. One exception is TCP Fast Open
--
2.26.2
In _drbd_send_page() a page is checked by following code before sending
it by kernel_sendpage(),
(page_count(page) < 1) || PageSlab(page)
If the check is true, this page won't be send by kernel_sendpage() and
handled by sock_no_sendpage().
This kind of check is exactly what macro sendpage_ok() does, which is
introduced into include/linux/net.h to solve a similar send page issue
in nvme-tcp code.
This patch uses macro sendpage_ok() to replace the open coded checks to
page type and refcount in _drbd_send_page(), as a code cleanup.
Signed-off-by: Coly Li <[email protected]>
Cc: Philipp Reisner <[email protected]>
Cc: Sagi Grimberg <[email protected]>
---
drivers/block/drbd/drbd_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index cb687ccdbd96..55dc0c91781e 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1553,7 +1553,7 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
- if (drbd_disable_sendpage || (page_count(page) < 1) || PageSlab(page))
+ if (drbd_disable_sendpage || !sendpage_ok(page))
return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
msg_flags |= MSG_NOSIGNAL;
--
2.26.2
In iscsci driver, iscsi_tcp_segment_map() uses the following code to
check whether the page should or not be handled by sendpage:
if (!recv && page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)))
The "page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)" part is to
make sure the page can be sent to network layer's zero copy path. This
part is exactly what sendpage_ok() does.
This patch uses use sendpage_ok() in iscsi_tcp_segment_map() to replace
the original open coded checks.
Signed-off-by: Coly Li <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Cong Wang <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Lee Duncan <[email protected]>
Cc: Chris Leech <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
---
drivers/scsi/libiscsi_tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/libiscsi_tcp.c b/drivers/scsi/libiscsi_tcp.c
index 6ef93c7af954..31cd8487c16e 100644
--- a/drivers/scsi/libiscsi_tcp.c
+++ b/drivers/scsi/libiscsi_tcp.c
@@ -128,7 +128,7 @@ static void iscsi_tcp_segment_map(struct iscsi_segment *segment, int recv)
* coalescing neighboring slab objects into a single frag which
* triggers one of hardened usercopy checks.
*/
- if (!recv && page_count(sg_page(sg)) >= 1 && !PageSlab(sg_page(sg)))
+ if (!recv && sendpage_ok(sg_page(sg)))
return;
if (recv) {
--
2.26.2
The original problem was from nvme-over-tcp code, who mistakenly uses
kernel_sendpage() to send pages allocated by __get_free_pages() without
__GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
tail pages, sending them by kernel_sendpage() may trigger a kernel panic
from a corrupted kernel heap, because these pages are incorrectly freed
in network stack as page_count 0 pages.
This patch introduces a helper sendpage_ok(), it returns true if the
checking page,
- is not slab page: PageSlab(page) is false.
- has page refcount: page_count(page) is not zero
All drivers who want to send page to remote end by kernel_sendpage()
may use this helper to check whether the page is OK. If the helper does
not return true, the driver should try other non sendpage method (e.g.
sock_no_sendpage()) to handle the page.
Signed-off-by: Coly Li <[email protected]>
Cc: Chaitanya Kulkarni <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mikhail Skorzhinskii <[email protected]>
Cc: Philipp Reisner <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
---
include/linux/net.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/include/linux/net.h b/include/linux/net.h
index d48ff1180879..05db8690f67e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -21,6 +21,7 @@
#include <linux/rcupdate.h>
#include <linux/once.h>
#include <linux/fs.h>
+#include <linux/mm.h>
#include <linux/sockptr.h>
#include <uapi/linux/net.h>
@@ -286,6 +287,21 @@ do { \
#define net_get_random_once_wait(buf, nbytes) \
get_random_once_wait((buf), (nbytes))
+/*
+ * E.g. XFS meta- & log-data is in slab pages, or bcache meta
+ * data pages, or other high order pages allocated by
+ * __get_free_pages() without __GFP_COMP, which have a page_count
+ * of 0 and/or have PageSlab() set. We cannot use send_page for
+ * those, as that does get_page(); put_page(); and would cause
+ * either a VM_BUG directly, or __page_cache_release a page that
+ * would actually still be referenced by someone, leading to some
+ * obscure delayed Oops somewhere else.
+ */
+static inline bool sendpage_ok(struct page *page)
+{
+ return !PageSlab(page) && page_count(page) >= 1;
+}
+
int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec,
size_t num, size_t len);
int kernel_sendmsg_locked(struct sock *sk, struct msghdr *msg,
--
2.26.2
In libceph, ceph_tcp_sendpage() does the following checks before handle
the page by network layer's zero copy sendpage method,
if (page_count(page) >= 1 && !PageSlab(page))
This check is exactly what sendpage_ok() does. This patch replace the
open coded checks by sendpage_ok() as a code cleanup.
Signed-off-by: Coly Li <[email protected]>
Cc: Ilya Dryomov <[email protected]>
Cc: Jeff Layton <[email protected]>
---
net/ceph/messenger.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 27d6ab11f9ee..6a349da7f013 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -575,7 +575,7 @@ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
* coalescing neighboring slab objects into a single frag which
* triggers one of hardened usercopy checks.
*/
- if (page_count(page) >= 1 && !PageSlab(page))
+ if (sendpage_ok(page))
sendpage = sock->ops->sendpage;
else
sendpage = sock_no_sendpage;
--
2.26.2
I think we should go for something simple like this instead:
---
From 4867e158ee86ebd801b4c267e8f8a4a762a71343 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Tue, 18 Aug 2020 18:19:23 +0200
Subject: net: bypass ->sendpage for slab pages
Sending Slab or tail pages into ->sendpage will cause really strange
delayed oops. Prevent it right in the networking code instead of
requiring drivers to work around the fact.
Signed-off-by: Christoph Hellwig <[email protected]>
---
net/socket.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/socket.c b/net/socket.c
index dbbe8ea7d395da..fbc82eb96d18ce 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3638,7 +3638,12 @@ EXPORT_SYMBOL(kernel_getpeername);
int kernel_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
- if (sock->ops->sendpage)
+ /*
+ * sendpage does manipulates the refcount of the passed in page, which
+ * does not work for Slab pages, or for tails of non-__GFP_COMP
+ * high order pages.
+ */
+ if (sock->ops->sendpage && !PageSlab(page) && page_count(page) > 0)
return sock->ops->sendpage(sock, page, offset, size, flags);
return sock_no_sendpage(sock, page, offset, size, flags);
--
2.28.0
On 2020/8/19 00:24, Christoph Hellwig wrote:
> I think we should go for something simple like this instead:
This idea is fine to me. Should a warning message be through here? IMHO
the driver still sends an improper page in, fix it in silence is too
kind or over nice to the buggy driver(s).
And maybe the fix in nvme-tcp driver and do_tcp_sendpages() are still
necessary. I am not network expert, this is my opinion for reference.
Coly Li
> ---
> From 4867e158ee86ebd801b4c267e8f8a4a762a71343 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <[email protected]>
> Date: Tue, 18 Aug 2020 18:19:23 +0200
> Subject: net: bypass ->sendpage for slab pages
>
> Sending Slab or tail pages into ->sendpage will cause really strange
> delayed oops. Prevent it right in the networking code instead of
> requiring drivers to work around the fact.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> net/socket.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/socket.c b/net/socket.c
> index dbbe8ea7d395da..fbc82eb96d18ce 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -3638,7 +3638,12 @@ EXPORT_SYMBOL(kernel_getpeername);
> int kernel_sendpage(struct socket *sock, struct page *page, int offset,
> size_t size, int flags)
> {
> - if (sock->ops->sendpage)
> + /*
> + * sendpage does manipulates the refcount of the passed in page, which
> + * does not work for Slab pages, or for tails of non-__GFP_COMP
> + * high order pages.
> + */
> + if (sock->ops->sendpage && !PageSlab(page) && page_count(page) > 0)
> return sock->ops->sendpage(sock, page, offset, size, flags);
>
> return sock_no_sendpage(sock, page, offset, size, flags);
>
On Wed, Aug 19, 2020 at 12:33:37AM +0800, Coly Li wrote:
> On 2020/8/19 00:24, Christoph Hellwig wrote:
> > I think we should go for something simple like this instead:
>
> This idea is fine to me. Should a warning message be through here? IMHO
> the driver still sends an improper page in, fix it in silence is too
> kind or over nice to the buggy driver(s).
I don't think a warning is a good idea. An API that does the right
thing underneath and doesn't require boiler plate code in most callers
is the right API.
On 2020/8/19 03:49, Christoph Hellwig wrote:
> On Wed, Aug 19, 2020 at 12:33:37AM +0800, Coly Li wrote:
>> On 2020/8/19 00:24, Christoph Hellwig wrote:
>>> I think we should go for something simple like this instead:
>>
>> This idea is fine to me. Should a warning message be through here? IMHO
>> the driver still sends an improper page in, fix it in silence is too
>> kind or over nice to the buggy driver(s).
>
> I don't think a warning is a good idea. An API that does the right
> thing underneath and doesn't require boiler plate code in most callers
> is the right API.
>
Then I don't have more comment.
Thanks.
Coly Li
On Wed, Aug 19, 2020 at 12:22:05PM +0800, Coly Li wrote:
> On 2020/8/19 03:49, Christoph Hellwig wrote:
> > On Wed, Aug 19, 2020 at 12:33:37AM +0800, Coly Li wrote:
> >> On 2020/8/19 00:24, Christoph Hellwig wrote:
> >>> I think we should go for something simple like this instead:
> >>
> >> This idea is fine to me. Should a warning message be through here? IMHO
> >> the driver still sends an improper page in, fix it in silence is too
> >> kind or over nice to the buggy driver(s).
> >
> > I don't think a warning is a good idea. An API that does the right
> > thing underneath and doesn't require boiler plate code in most callers
> > is the right API.
> >
>
> Then I don't have more comment.
So given the feedback from Dave I suspect we should actually resurrect
this series, sorry for the noise. And in this case I think we do need
the warning in kernel_sendpage.
On 2020/9/23 16:43, Christoph Hellwig wrote:
> On Wed, Aug 19, 2020 at 12:22:05PM +0800, Coly Li wrote:
>> On 2020/8/19 03:49, Christoph Hellwig wrote:
>>> On Wed, Aug 19, 2020 at 12:33:37AM +0800, Coly Li wrote:
>>>> On 2020/8/19 00:24, Christoph Hellwig wrote:
>>>>> I think we should go for something simple like this instead:
>>>>
>>>> This idea is fine to me. Should a warning message be through here? IMHO
>>>> the driver still sends an improper page in, fix it in silence is too
>>>> kind or over nice to the buggy driver(s).
>>>
>>> I don't think a warning is a good idea. An API that does the right
>>> thing underneath and doesn't require boiler plate code in most callers
>>> is the right API.
>>>
>>
>> Then I don't have more comment.
>
> So given the feedback from Dave I suspect we should actually resurrect
> this series, sorry for the noise. And in this case I think we do need
> the warning in kernel_sendpage.
>
Copied, then I will post a v8 series, which adding a warning message in
kernel_sendpage() if non-acceptible paage sent in.
Coly Li