LinuxLists.cc - [PATCH net-next v3 00/18] splice, net: Switch over users of sendpage() and remove it

2023-06-20 15:11:45

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 00/18] splice, net: Switch over users of sendpage() and remove it

Here's the final set of patches towards the removal of sendpage. All the
drivers that use sendpage() get switched over to using sendmsg() with
MSG_SPLICE_PAGES.

skb_splice_from_iter() is given the facility to copy slab data() into
fragments - or to coalesce them (in future) with other unspliced buffers in
the target skbuff. This means that the caller can work the same way, no
matter whether MSG_SPLICE_PAGES is supplied or not and no matter if the
protocol just ignores it. If MSG_SPLICE_PAGES is not supplied or if it is
just ignored, the data will get copied as normal rather than being spliced.

For the moment, skb_splice_from_iter() is equipped with its own fragment
allocator - one that has percpu pages to allocate from to deal with
parallel callers but that can also drop the percpu lock around calling the
page allocator.

The following changes are made:

(1) Introduce an SMP-safe shared fragment allocator and make
skb_splice_from_iter() use it. The allocator is exported so that
ocfs2 can use it.

This now doesn't alter the existing page_frag_cache allocator.

(2) Expose information from the allocator in /proc/. This is useful for
debugging it, but could be dropped.

(3) Make the protocol drivers behave according to MSG_MORE, not
MSG_SENDPAGE_NOTLAST. The latter is restricted to turning on MSG_MORE
in the sendpage() wrappers.

(4) Make siw, ceph/rds, skb_send_sock, dlm, nvme, smc, ocfs2, drbd and
iscsi use sendmsg(), not sendpage and make them specify MSG_MORE
instead of MSG_SENDPAGE_NOTLAST.

ocfs2 now allocates fragments for a couple of cases where it would
otherwise pass in a pointer to shared data that doesn't seem to
sufficient locking.

(5) Make drbd coalesce its entire message into a single sendmsg().

(6) Kill off sendpage and clean up MSG_SENDPAGE_NOTLAST.

I've pushed the patches here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-3-frag

David

Changes
=======
ver #3)
- Remove duplicate decl of skb_splice_from_iter().
- In tcp_bpf, reset msg_flags on each iteration to clear MSG_MORE.
- In tcp_bpf, set MSG_MORE if there's more data in the sk_msg.
- Split the nvme patch into host and target patches.

ver #2)
- Wrapped some lines at 80.
- Fixed parameter to put_cpu_ptr() to have an '&'.
- Use "unsigned int" rather than "unsigned".
- Removed duplicate word in comment.
- Filled in the commit message on the last patch.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1
Link: https://lore.kernel.org/r/[email protected]/ # v1

David Howells (18):
net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)
net: Display info about MSG_SPLICE_PAGES memory handling in proc
tcp_bpf, smc, tls, espintcp: Reduce MSG_SENDPAGE_NOTLAST usage
siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
nvme/target: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES
ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
drdb: Send an entire bio in a single sendmsg
iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
net: Kill MSG_SENDPAGE_NOTLAST

Documentation/bpf/map_sockmap.rst | 10 +-
Documentation/filesystems/locking.rst | 2 -
Documentation/filesystems/vfs.rst | 1 -
Documentation/networking/scaling.rst | 4 +-
crypto/af_alg.c | 28 --
crypto/algif_aead.c | 22 +-
crypto/algif_rng.c | 2 -
crypto/algif_skcipher.c | 14 -
drivers/block/drbd/drbd_main.c | 88 +++---
drivers/infiniband/sw/siw/siw_qp_tx.c | 231 +++-------------
.../chelsio/inline_crypto/chtls/chtls.h | 2 -
.../chelsio/inline_crypto/chtls/chtls_io.c | 14 -
.../chelsio/inline_crypto/chtls/chtls_main.c | 1 -
drivers/nvme/host/tcp.c | 46 ++--
drivers/nvme/target/tcp.c | 46 ++--
drivers/scsi/iscsi_tcp.c | 26 +-
drivers/scsi/iscsi_tcp.h | 2 +-
drivers/target/iscsi/iscsi_target_util.c | 15 +-
fs/dlm/lowcomms.c | 10 +-
fs/nfsd/vfs.c | 2 +-
fs/ocfs2/cluster/tcp.c | 109 ++++----
include/crypto/if_alg.h | 2 -
include/linux/net.h | 8 -
include/linux/skbuff.h | 2 +
include/linux/socket.h | 4 +-
include/net/inet_common.h | 2 -
include/net/sock.h | 6 -
include/net/tcp.h | 4 -
net/appletalk/ddp.c | 1 -
net/atm/pvc.c | 1 -
net/atm/svc.c | 1 -
net/ax25/af_ax25.c | 1 -
net/caif/caif_socket.c | 2 -
net/can/bcm.c | 1 -
net/can/isotp.c | 1 -
net/can/j1939/socket.c | 1 -
net/can/raw.c | 1 -
net/ceph/messenger_v1.c | 58 ++--
net/ceph/messenger_v2.c | 91 ++-----
net/core/skbuff.c | 257 ++++++++++++++++--
net/core/sock.c | 35 +--
net/dccp/ipv4.c | 1 -
net/dccp/ipv6.c | 1 -
net/ieee802154/socket.c | 2 -
net/ipv4/af_inet.c | 21 --
net/ipv4/tcp.c | 43 +--
net/ipv4/tcp_bpf.c | 28 +-
net/ipv4/tcp_ipv4.c | 1 -
net/ipv4/udp.c | 15 -
net/ipv4/udp_impl.h | 2 -
net/ipv4/udplite.c | 1 -
net/ipv6/af_inet6.c | 3 -
net/ipv6/raw.c | 1 -
net/ipv6/tcp_ipv6.c | 1 -
net/kcm/kcmsock.c | 20 --
net/key/af_key.c | 1 -
net/l2tp/l2tp_ip.c | 1 -
net/l2tp/l2tp_ip6.c | 1 -
net/llc/af_llc.c | 1 -
net/mctp/af_mctp.c | 1 -
net/mptcp/protocol.c | 2 -
net/netlink/af_netlink.c | 1 -
net/netrom/af_netrom.c | 1 -
net/packet/af_packet.c | 2 -
net/phonet/socket.c | 2 -
net/qrtr/af_qrtr.c | 1 -
net/rds/af_rds.c | 1 -
net/rds/tcp_send.c | 74 ++---
net/rose/af_rose.c | 1 -
net/rxrpc/af_rxrpc.c | 1 -
net/sctp/protocol.c | 1 -
net/smc/af_smc.c | 29 --
net/smc/smc_stats.c | 2 +-
net/smc/smc_stats.h | 1 -
net/smc/smc_tx.c | 20 +-
net/smc/smc_tx.h | 2 -
net/socket.c | 48 ----
net/tipc/socket.c | 3 -
net/tls/tls.h | 6 -
net/tls/tls_device.c | 24 +-
net/tls/tls_main.c | 9 +-
net/tls/tls_sw.c | 37 +--
net/unix/af_unix.c | 19 --
net/vmw_vsock/af_vsock.c | 3 -
net/x25/af_x25.c | 1 -
net/xdp/xsk.c | 1 -
net/xfrm/espintcp.c | 10 +-
.../perf/trace/beauty/include/linux/socket.h | 1 -
tools/perf/trace/beauty/msg_flags.c | 3 -
89 files changed, 544 insertions(+), 1061 deletions(-)

2023-06-20 15:13:00

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 06/18] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()

Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage in
skb_send_sock(). This causes pages to be spliced from the source iterator
if possible.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Note that this could perhaps be improved to fill out a bvec array with all
the frags and then make a single sendmsg call, possibly sticking the header
on the front also.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---

Notes:
ver #2)
- Wrap lines at 80.

net/core/skbuff.c | 50 ++++++++++++++++++++++++++---------------------
1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 36605510a76d..ee2fc8fe31cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2990,32 +2990,32 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
}
EXPORT_SYMBOL_GPL(skb_splice_bits);

-static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg,
- struct kvec *vec, size_t num, size_t size)
+static int sendmsg_locked(struct sock *sk, struct msghdr *msg)
{
struct socket *sock = sk->sk_socket;
+ size_t size = msg_data_left(msg);

if (!sock)
return -EINVAL;
- return kernel_sendmsg(sock, msg, vec, num, size);
+
+ if (!sock->ops->sendmsg_locked)
+ return sock_no_sendmsg_locked(sk, msg, size);
+
+ return sock->ops->sendmsg_locked(sk, msg, size);
}

-static int sendpage_unlocked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg)
{
struct socket *sock = sk->sk_socket;

if (!sock)
return -EINVAL;
- return kernel_sendpage(sock, page, offset, size, flags);
+ return sock_sendmsg(sock, msg);
}

-typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg,
- struct kvec *vec, size_t num, size_t size);
-typedef int (*sendpage_func)(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
+typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg);
static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
- int len, sendmsg_func sendmsg, sendpage_func sendpage)
+ int len, sendmsg_func sendmsg)
{
unsigned int orig_len = len;
struct sk_buff *head = skb;
@@ -3035,8 +3035,9 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
memset(&msg, 0, sizeof(msg));
msg.msg_flags = MSG_DONTWAIT;

- ret = INDIRECT_CALL_2(sendmsg, kernel_sendmsg_locked,
- sendmsg_unlocked, sk, &msg, &kv, 1, slen);
+ iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &kv, 1, slen);
+ ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+ sendmsg_unlocked, sk, &msg);
if (ret <= 0)
goto error;

@@ -3067,11 +3068,18 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
slen = min_t(size_t, len, skb_frag_size(frag) - offset);

while (slen) {
- ret = INDIRECT_CALL_2(sendpage, kernel_sendpage_locked,
- sendpage_unlocked, sk,
- skb_frag_page(frag),
- skb_frag_off(frag) + offset,
- slen, MSG_DONTWAIT);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT,
+ };
+
+ bvec_set_page(&bvec, skb_frag_page(frag), slen,
+ skb_frag_off(frag) + offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1,
+ slen);
+
+ ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+ sendmsg_unlocked, sk, &msg);
if (ret <= 0)
goto error;

@@ -3108,16 +3116,14 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
int len)
{
- return __skb_send_sock(sk, skb, offset, len, kernel_sendmsg_locked,
- kernel_sendpage_locked);
+ return __skb_send_sock(sk, skb, offset, len, sendmsg_locked);
}
EXPORT_SYMBOL_GPL(skb_send_sock_locked);

/* Send skb data on a socket. Socket must be unlocked. */
int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len)
{
- return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked,
- sendpage_unlocked);
+ return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked);
}

/**

2023-06-20 15:15:50

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 12/18] smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES

Drop the smc_sendpage() code as smc_sendmsg() just passes the call down to
the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
its sendmsg implementation.

Signed-off-by: David Howells <[email protected]>
cc: Karsten Graul <[email protected]>
cc: Wenjia Zhang <[email protected]>
cc: Jan Karcher <[email protected]>
cc: "D. Wythe" <[email protected]>
cc: Tony Lu <[email protected]>
cc: Wen Gu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/smc/af_smc.c | 29 -----------------------------
net/smc/smc_stats.c | 2 +-
net/smc/smc_stats.h | 1 -
net/smc/smc_tx.c | 22 +---------------------
net/smc/smc_tx.h | 2 --
5 files changed, 2 insertions(+), 54 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 538e9c6ec8c9..a7f887d91d89 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -3133,34 +3133,6 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd,
return put_user(answ, (int __user *)arg);
}

-static ssize_t smc_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- struct smc_sock *smc;
- int rc = -EPIPE;
-
- smc = smc_sk(sk);
- lock_sock(sk);
- if (sk->sk_state != SMC_ACTIVE) {
- release_sock(sk);
- goto out;
- }
- release_sock(sk);
- if (smc->use_fallback) {
- rc = kernel_sendpage(smc->clcsock, page, offset,
- size, flags);
- } else {
- lock_sock(sk);
- rc = smc_tx_sendpage(smc, page, offset, size, flags);
- release_sock(sk);
- SMC_STAT_INC(smc, sendpage_cnt);
- }
-
-out:
- return rc;
-}
-
/* Map the affected portions of the rmbe into an spd, note the number of bytes
* to splice in conn->splice_pending, and press 'go'. Delays consumer cursor
* updates till whenever a respective page has been fully processed.
@@ -3232,7 +3204,6 @@ static const struct proto_ops smc_sock_ops = {
.sendmsg = smc_sendmsg,
.recvmsg = smc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = smc_sendpage,
.splice_read = smc_splice_read,
};

diff --git a/net/smc/smc_stats.c b/net/smc/smc_stats.c
index e80e34f7ac15..ca14c0f3a07d 100644
--- a/net/smc/smc_stats.c
+++ b/net/smc/smc_stats.c
@@ -227,7 +227,7 @@ static int smc_nl_fill_stats_tech_data(struct sk_buff *skb,
SMC_NLA_STATS_PAD))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SENDPAGE_CNT,
- smc_tech->sendpage_cnt,
+ 0,
SMC_NLA_STATS_PAD))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CORK_CNT,
diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h
index 84b7ecd8c05c..b60fe1eb37ab 100644
--- a/net/smc/smc_stats.h
+++ b/net/smc/smc_stats.h
@@ -71,7 +71,6 @@ struct smc_stats_tech {
u64 clnt_v2_succ_cnt;
u64 srv_v1_succ_cnt;
u64 srv_v2_succ_cnt;
- u64 sendpage_cnt;
u64 urg_data_cnt;
u64 splice_cnt;
u64 cork_cnt;
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 9b9e0a190734..5147207808e5 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -167,8 +167,7 @@ static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
* sndbuf_space is still available. The applications
* should known how/when to uncork it.
*/
- if ((msg->msg_flags & MSG_MORE ||
- smc_tx_is_corked(smc)) &&
+ if ((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc)) &&
atomic_read(&conn->sndbuf_space))
return true;

@@ -297,25 +296,6 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
return rc;
}

-int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
- size_t size, int flags)
-{
- struct msghdr msg = {.msg_flags = flags};
- char *kaddr = kmap(page);
- struct kvec iov;
- int rc;
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
- rc = smc_tx_sendmsg(smc, &msg, size);
- kunmap(page);
- return rc;
-}
-
/***************************** sndbuf consumer *******************************/

/* sndbuf consumer: actual data transfer of one target chunk with ISM write */
diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
index 34b578498b1f..a59f370b8b43 100644
--- a/net/smc/smc_tx.h
+++ b/net/smc/smc_tx.h
@@ -31,8 +31,6 @@ void smc_tx_pending(struct smc_connection *conn);
void smc_tx_work(struct work_struct *work);
void smc_tx_init(struct smc_sock *smc);
int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
-int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
- size_t size, int flags);
int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
void smc_tx_sndbuf_nonfull(struct smc_sock *smc);
void smc_tx_consumer_update(struct smc_connection *conn, bool force);

2023-06-20 15:17:14

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 05/18] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.

Signed-off-by: David Howells <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Xiubo Li <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ceph/messenger_v1.c | 58 ++++++++++++++---------------------------
1 file changed, 19 insertions(+), 39 deletions(-)

diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index d664cb1593a7..f082e5c780a3 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -74,37 +74,6 @@ static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
return r;
}

-/*
- * @more: either or both of MSG_MORE and MSG_SENDPAGE_NOTLAST
- */
-static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int more)
-{
- ssize_t (*sendpage)(struct socket *sock, struct page *page,
- int offset, size_t size, int flags);
- int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
- int ret;
-
- /*
- * sendpage cannot properly handle pages with page_count == 0,
- * we need to fall back to sendmsg if that's the case.
- *
- * Same goes for slab pages: skb_can_coalesce() allows
- * coalescing neighboring slab objects into a single frag which
- * triggers one of hardened usercopy checks.
- */
- if (sendpage_ok(page))
- sendpage = sock->ops->sendpage;
- else
- sendpage = sock_no_sendpage;
-
- ret = sendpage(sock, page, offset, size, flags);
- if (ret == -EAGAIN)
- ret = 0;
-
- return ret;
-}
-
static void con_out_kvec_reset(struct ceph_connection *con)
{
BUG_ON(con->v1.out_skip);
@@ -464,7 +433,6 @@ static int write_partial_message_data(struct ceph_connection *con)
struct ceph_msg *msg = con->out_msg;
struct ceph_msg_data_cursor *cursor = &msg->cursor;
bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
- int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
u32 crc;

dout("%s %p msg %p\n", __func__, con, msg);
@@ -482,6 +450,10 @@ static int write_partial_message_data(struct ceph_connection *con)
*/
crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0;
while (cursor->total_resid) {
+ struct bio_vec bvec;
+ struct msghdr msghdr = {
+ .msg_flags = MSG_SPLICE_PAGES,
+ };
struct page *page;
size_t page_offset;
size_t length;
@@ -494,9 +466,12 @@ static int write_partial_message_data(struct ceph_connection *con)

page = ceph_msg_data_next(cursor, &page_offset, &length);
if (length == cursor->total_resid)
- more = MSG_MORE;
- ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
- more);
+ msghdr.msg_flags |= MSG_MORE;
+
+ bvec_set_page(&bvec, page, length, page_offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, length);
+
+ ret = sock_sendmsg(con->sock, &msghdr);
if (ret <= 0) {
if (do_datacrc)
msg->footer.data_crc = cpu_to_le32(crc);
@@ -526,7 +501,10 @@ static int write_partial_message_data(struct ceph_connection *con)
*/
static int write_partial_skip(struct ceph_connection *con)
{
- int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ struct bio_vec bvec;
+ struct msghdr msghdr = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_MORE,
+ };
int ret;

dout("%s %p %d left\n", __func__, con, con->v1.out_skip);
@@ -534,9 +512,11 @@ static int write_partial_skip(struct ceph_connection *con)
size_t size = min(con->v1.out_skip, (int)PAGE_SIZE);

if (size == con->v1.out_skip)
- more = MSG_MORE;
- ret = ceph_tcp_sendpage(con->sock, ceph_zero_page, 0, size,
- more);
+ msghdr.msg_flags &= ~MSG_MORE;
+ bvec_set_page(&bvec, ZERO_PAGE(0), size, 0);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+ ret = sock_sendmsg(con->sock, &msghdr);
if (ret <= 0)
goto out;
con->v1.out_skip -= ret;

2023-06-20 15:20:02

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
some data that's resident in the slab, copy it rather than returning EIO.
This can be made use of by a number of drivers in the kernel, including:
iwarp, ceph/rds, dlm, nvme, ocfs2, drdb. It could also be used by iscsi,
rxrpc, sunrpc, cifs and probably others.

skb_splice_from_iter() is given it's own fragment allocator as
page_frag_alloc_align() can't be used because it does no locking to prevent
parallel callers from racing. alloc_skb_frag() uses a separate folio for
each cpu and locks to the cpu whilst allocating, reenabling cpu migration
around folio allocation.

This could allocate a whole page instead for each fragment to be copied, as
alloc_skb_with_frags() would do instead, but that would waste a lot of
space (most of the fragments look like they're going to be small).

This allows an entire message that consists of, say, a protocol header or
two, a number of pages of data and a protocol footer to be sent using a
single call to sock_sendmsg().

The callers could be made to copy the data into fragments before calling
sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.

Signed-off-by: David Howells <[email protected]>
cc: Alexander Duyck <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Menglong Dong <[email protected]>
cc: [email protected]
---

Notes:
ver #3)
- Remove duplicate decl of skb_splice_from_iter().

ver #2)
- Fix parameter to put_cpu_ptr() to have an '&'.

include/linux/skbuff.h | 2 +
net/core/skbuff.c | 171 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..5f53bd5d375d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -5037,6 +5037,8 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
#endif
}

+void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
ssize_t maxsize, gfp_t gfp);

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fee2b1c105fe..d962c93a429d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6755,6 +6755,145 @@ nodefer: __kfree_skb(skb);
smp_call_function_single_async(cpu, &sd->defer_csd);
}

+struct skb_splice_frag_cache {
+ struct folio *folio;
+ void *virt;
+ unsigned int offset;
+ /* we maintain a pagecount bias, so that we dont dirty cache line
+ * containing page->_refcount every time we allocate a fragment.
+ */
+ unsigned int pagecnt_bias;
+ bool pfmemalloc;
+};
+
+static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
+
+/**
+ * alloc_skb_frag - Allocate a page fragment for using in a socket
+ * @fragsz: The size of fragment required
+ * @gfp: Allocation flags
+ */
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
+{
+ struct skb_splice_frag_cache *cache;
+ struct folio *folio, *spare = NULL;
+ size_t offset, fsize;
+ void *p;
+
+ if (WARN_ON_ONCE(fragsz == 0))
+ fragsz = 1;
+
+ cache = get_cpu_ptr(&skb_splice_frag_cache);
+reload:
+ folio = cache->folio;
+ offset = cache->offset;
+try_again:
+ if (fragsz > offset)
+ goto insufficient_space;
+
+ /* Make the allocation. */
+ cache->pagecnt_bias--;
+ offset = ALIGN_DOWN(offset - fragsz, SMP_CACHE_BYTES);
+ cache->offset = offset;
+ p = cache->virt + offset;
+ put_cpu_ptr(&skb_splice_frag_cache);
+ if (spare)
+ folio_put(spare);
+ return p;
+
+insufficient_space:
+ /* See if we can refurbish the current folio. */
+ if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
+ goto get_new_folio;
+ if (unlikely(cache->pfmemalloc)) {
+ __folio_put(folio);
+ goto get_new_folio;
+ }
+
+ fsize = folio_size(folio);
+ if (unlikely(fragsz > fsize))
+ goto frag_too_big;
+
+ /* OK, page count is 0, we can safely set it */
+ folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+ /* Reset page count bias and offset to start of new frag */
+ cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ offset = fsize;
+ goto try_again;
+
+get_new_folio:
+ if (!spare) {
+ cache->folio = NULL;
+ put_cpu_ptr(&skb_splice_frag_cache);
+
+#if PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE
+ spare = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC,
+ PAGE_FRAG_CACHE_MAX_ORDER);
+ if (!spare)
+#endif
+ spare = folio_alloc(gfp, 0);
+ if (!spare)
+ return NULL;
+
+ cache = get_cpu_ptr(&skb_splice_frag_cache);
+ /* We may now be on a different cpu and/or someone else may
+ * have refilled it
+ */
+ cache->pfmemalloc = folio_is_pfmemalloc(spare);
+ if (cache->folio)
+ goto reload;
+ }
+
+ cache->folio = spare;
+ cache->virt = folio_address(spare);
+ folio = spare;
+ spare = NULL;
+
+ /* Even if we own the page, we do not use atomic_set(). This would
+ * break get_page_unless_zero() users.
+ */
+ folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+ /* Reset page count bias and offset to start of new frag */
+ cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ offset = folio_size(folio);
+ goto try_again;
+
+frag_too_big:
+ /* The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+ * but the cache isn't big enough to satisfy the request, this may
+ * happen in low memory conditions. We don't release the cache page
+ * because it could make memory pressure worse so we simply return NULL
+ * here.
+ */
+ cache->offset = offset;
+ put_cpu_ptr(&skb_splice_frag_cache);
+ if (spare)
+ folio_put(spare);
+ return NULL;
+}
+EXPORT_SYMBOL(alloc_skb_frag);
+
+/**
+ * copy_skb_frag - Copy data into a page fragment.
+ * @s: The data to copy
+ * @len: The size of the data
+ * @gfp: Allocation flags
+ */
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp)
+{
+ void *p;
+
+ p = alloc_skb_frag(len, gfp);
+ if (!p)
+ return NULL;
+
+ return memcpy(p, s, len);
+}
+EXPORT_SYMBOL(copy_skb_frag);
+
static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
size_t offset, size_t len)
{
@@ -6808,17 +6947,43 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
break;
}

+ if (space == 0 &&
+ !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
+ pages[0], off)) {
+ iov_iter_revert(iter, len);
+ break;
+ }
+
i = 0;
do {
struct page *page = pages[i++];
size_t part = min_t(size_t, PAGE_SIZE - off, len);
-
- ret = -EIO;
- if (WARN_ON_ONCE(!sendpage_ok(page)))
+ bool put = false;
+
+ if (PageSlab(page)) {
+ const void *p;
+ void *q;
+
+ p = kmap_local_page(page);
+ q = copy_skb_frag(p + off, part, gfp);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(iter, len);
+ ret = -ENOMEM;
+ goto out;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ } else if (WARN_ON_ONCE(!sendpage_ok(page))) {
+ ret = -EIO;
goto out;
+ }

ret = skb_append_pagefrags(skb, page, off, part,
frag_limit);
+ if (put)
+ put_page(page);
if (ret < 0) {
iov_iter_revert(iter, len);
goto out;

2023-06-20 15:20:24

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 07/18] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()

Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.

Signed-off-by: David Howells <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Xiubo Li <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ceph/messenger_v2.c | 91 +++++++++--------------------------------
1 file changed, 19 insertions(+), 72 deletions(-)

diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index 301a991dc6a6..87ac97073e75 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -117,91 +117,38 @@ static int ceph_tcp_recv(struct ceph_connection *con)
return ret;
}

-static int do_sendmsg(struct socket *sock, struct iov_iter *it)
-{
- struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
- int ret;
-
- msg.msg_iter = *it;
- while (iov_iter_count(it)) {
- ret = sock_sendmsg(sock, &msg);
- if (ret <= 0) {
- if (ret == -EAGAIN)
- ret = 0;
- return ret;
- }
-
- iov_iter_advance(it, ret);
- }
-
- WARN_ON(msg_data_left(&msg));
- return 1;
-}
-
-static int do_try_sendpage(struct socket *sock, struct iov_iter *it)
-{
- struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
- struct bio_vec bv;
- int ret;
-
- if (WARN_ON(!iov_iter_is_bvec(it)))
- return -EINVAL;
-
- while (iov_iter_count(it)) {
- /* iov_iter_iovec() for ITER_BVEC */
- bvec_set_page(&bv, it->bvec->bv_page,
- min(iov_iter_count(it),
- it->bvec->bv_len - it->iov_offset),
- it->bvec->bv_offset + it->iov_offset);
-
- /*
- * sendpage cannot properly handle pages with
- * page_count == 0, we need to fall back to sendmsg if
- * that's the case.
- *
- * Same goes for slab pages: skb_can_coalesce() allows
- * coalescing neighboring slab objects into a single frag
- * which triggers one of hardened usercopy checks.
- */
- if (sendpage_ok(bv.bv_page)) {
- ret = sock->ops->sendpage(sock, bv.bv_page,
- bv.bv_offset, bv.bv_len,
- CEPH_MSG_FLAGS);
- } else {
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, bv.bv_len);
- ret = sock_sendmsg(sock, &msg);
- }
- if (ret <= 0) {
- if (ret == -EAGAIN)
- ret = 0;
- return ret;
- }
-
- iov_iter_advance(it, ret);
- }
-
- return 1;
-}
-
/*
* Write as much as possible. The socket is expected to be corked,
- * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
+ * so we don't bother with MSG_MORE here.
*
* Return:
- * 1 - done, nothing (else) to write
+ * >0 - done, nothing (else) to write
* 0 - socket is full, need to wait
* <0 - error
*/
static int ceph_tcp_send(struct ceph_connection *con)
{
+ struct msghdr msg = {
+ .msg_iter = con->v2.out_iter,
+ .msg_flags = CEPH_MSG_FLAGS,
+ };
int ret;

+ if (WARN_ON(!iov_iter_is_bvec(&con->v2.out_iter)))
+ return -EINVAL;
+
+ if (con->v2.out_iter_sendpage)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+
dout("%s con %p have %zu try_sendpage %d\n", __func__, con,
iov_iter_count(&con->v2.out_iter), con->v2.out_iter_sendpage);
- if (con->v2.out_iter_sendpage)
- ret = do_try_sendpage(con->sock, &con->v2.out_iter);
- else
- ret = do_sendmsg(con->sock, &con->v2.out_iter);
+
+ ret = sock_sendmsg(con->sock, &msg);
+ if (ret > 0)
+ iov_iter_advance(&con->v2.out_iter, ret);
+ else if (ret == -EAGAIN)
+ ret = 0;
+
dout("%s con %p ret %d left %zu\n", __func__, con, ret,
iov_iter_count(&con->v2.out_iter));
return ret;

2023-06-20 15:20:26

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 02/18] net: Display info about MSG_SPLICE_PAGES memory handling in proc

Display information about the memory handling MSG_SPLICE_PAGES does to copy
slabbed data into page fragments.

For each CPU that has a cached folio, it displays the folio pfn, the offset
pointer within the folio and the size of the folio.

It also displays the number of pages refurbished and the number of pages
replaced.

Signed-off-by: David Howells <[email protected]>
cc: Alexander Duyck <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Menglong Dong <[email protected]>
cc: [email protected]
---
net/core/skbuff.c | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d962c93a429d..36605510a76d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -83,6 +83,7 @@
#include <linux/user_namespace.h>
#include <linux/indirect_call_wrapper.h>
#include <linux/textsearch.h>
+#include <linux/proc_fs.h>

#include "dev.h"
#include "sock_destructor.h"
@@ -6758,6 +6759,7 @@ nodefer: __kfree_skb(skb);
struct skb_splice_frag_cache {
struct folio *folio;
void *virt;
+ unsigned int fsize;
unsigned int offset;
/* we maintain a pagecount bias, so that we dont dirty cache line
* containing page->_refcount every time we allocate a fragment.
@@ -6767,6 +6769,26 @@ struct skb_splice_frag_cache {
};

static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
+static atomic_t skb_splice_frag_replaced, skb_splice_frag_refurbished;
+
+static int skb_splice_show(struct seq_file *m, void *data)
+{
+ int cpu;
+
+ seq_printf(m, "refurb=%u repl=%u\n",
+ atomic_read(&skb_splice_frag_refurbished),
+ atomic_read(&skb_splice_frag_replaced));
+
+ for_each_possible_cpu(cpu) {
+ const struct skb_splice_frag_cache *cache =
+ per_cpu_ptr(&skb_splice_frag_cache, cpu);
+
+ seq_printf(m, "[%u] %lx %u/%u\n",
+ cpu, folio_pfn(cache->folio),
+ cache->offset, cache->fsize);
+ }
+ return 0;
+}

/**
* alloc_skb_frag - Allocate a page fragment for using in a socket
@@ -6803,17 +6825,21 @@ void *alloc_skb_frag(size_t fragsz, gfp_t gfp)

insufficient_space:
/* See if we can refurbish the current folio. */
- if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
+ if (!folio)
goto get_new_folio;
+ if (!folio_ref_sub_and_test(folio, cache->pagecnt_bias))
+ goto replace_folio;
if (unlikely(cache->pfmemalloc)) {
__folio_put(folio);
- goto get_new_folio;
+ goto replace_folio;
}

fsize = folio_size(folio);
if (unlikely(fragsz > fsize))
goto frag_too_big;

+ atomic_inc(&skb_splice_frag_refurbished);
+
/* OK, page count is 0, we can safely set it */
folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);

@@ -6822,6 +6848,8 @@ void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
offset = fsize;
goto try_again;

+replace_folio:
+ atomic_inc(&skb_splice_frag_replaced);
get_new_folio:
if (!spare) {
cache->folio = NULL;
@@ -6848,6 +6876,7 @@ void *alloc_skb_frag(size_t fragsz, gfp_t gfp)

cache->folio = spare;
cache->virt = folio_address(spare);
+ cache->fsize = folio_size(spare);
folio = spare;
spare = NULL;

@@ -6858,7 +6887,7 @@ void *alloc_skb_frag(size_t fragsz, gfp_t gfp)

/* Reset page count bias and offset to start of new frag */
cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- offset = folio_size(folio);
+ offset = cache->fsize;
goto try_again;

frag_too_big:
@@ -7007,3 +7036,10 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
return spliced ?: ret;
}
EXPORT_SYMBOL(skb_splice_from_iter);
+
+static int skb_splice_init(void)
+{
+ proc_create_single("pagefrags", S_IFREG | 0444, NULL, &skb_splice_show);
+ return 0;
+}
+late_initcall(skb_splice_init);

2023-06-20 15:20:27

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 09/18] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage. This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Christine Caulfield <[email protected]>
cc: David Teigland <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
fs/dlm/lowcomms.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 3d3802c47b8b..5c12d8cdfc16 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1395,8 +1395,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg)
/* Send a message */
static int send_to_sock(struct connection *con)
{
- const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
struct writequeue_entry *e;
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+ };
int len, offset, ret;

spin_lock_bh(&con->writequeue_lock);
@@ -1412,8 +1415,9 @@ static int send_to_sock(struct connection *con)
WARN_ON_ONCE(len == 0 && e->users == 0);
spin_unlock_bh(&con->writequeue_lock);

- ret = kernel_sendpage(con->sock, e->page, offset, len,
- msg_flags);
+ bvec_set_page(&bvec, e->page, len, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(con->sock, &msg);
trace_dlm_send(con->nodeid, ret);
if (ret == -EAGAIN || ret == 0) {
lock_sock(con->sock->sk);

2023-06-20 15:20:39

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 03/18] tcp_bpf, smc, tls, espintcp: Reduce MSG_SENDPAGE_NOTLAST usage

As MSG_SENDPAGE_NOTLAST is being phased out along with sendpage(), don't
use it further in than the sendpage methods, but rather translate it to
MSG_MORE and use that instead.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Sitnicki <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Karsten Graul <[email protected]>
cc: Wenjia Zhang <[email protected]>
cc: Jan Karcher <[email protected]>
cc: "D. Wythe" <[email protected]>
cc: Tony Lu <[email protected]>
cc: Wen Gu <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: Steffen Klassert <[email protected]>
cc: Herbert Xu <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---

Notes:
ver #3)
- In tcp_bpf, reset msg_flags on each iteration to clear MSG_MORE.
- In tcp_bpf, set MSG_MORE if there's more data in the sk_msg.

net/ipv4/tcp_bpf.c | 5 +++--
net/smc/smc_tx.c | 6 ++++--
net/tls/tls_device.c | 4 ++--
net/xfrm/espintcp.c | 10 ++++++----
4 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 5a84053ac62b..31d6005cea9b 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -88,9 +88,9 @@ static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
int flags, bool uncharge)
{
+ struct msghdr msghdr = {};
bool apply = apply_bytes;
struct scatterlist *sge;
- struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
struct page *page;
int size, ret = 0;
u32 off;
@@ -107,11 +107,12 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,

tcp_rate_check_app_limited(sk);
retry:
+ msghdr.msg_flags = flags | MSG_SPLICE_PAGES;
has_tx_ulp = tls_sw_has_ctx_tx(sk);
if (has_tx_ulp)
msghdr.msg_flags |= MSG_SENDPAGE_NOPOLICY;

- if (flags & MSG_SENDPAGE_NOTLAST)
+ if (size < sge->length && msg->sg.start != msg->sg.end)
msghdr.msg_flags |= MSG_MORE;

bvec_set_page(&bvec, page, size, off);
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 45128443f1f1..9b9e0a190734 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -168,8 +168,7 @@ static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
* should known how/when to uncork it.
*/
if ((msg->msg_flags & MSG_MORE ||
- smc_tx_is_corked(smc) ||
- msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
+ smc_tx_is_corked(smc)) &&
atomic_read(&conn->sndbuf_space))
return true;

@@ -306,6 +305,9 @@ int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
struct kvec iov;
int rc;

+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
+
iov.iov_base = kaddr + offset;
iov.iov_len = size;
iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index b82770f68807..975299d7213b 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -449,7 +449,7 @@ static int tls_push_data(struct sock *sk,
return -sk->sk_err;

flags |= MSG_SENDPAGE_DECRYPTED;
- tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+ tls_push_record_flags = flags | MSG_MORE;

timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
if (tls_is_partially_sent_record(tls_ctx)) {
@@ -532,7 +532,7 @@ static int tls_push_data(struct sock *sk,
if (!size) {
last_record:
tls_push_record_flags = flags;
- if (flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE)) {
+ if (flags & MSG_MORE) {
more = true;
break;
}
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 3504925babdb..d3b3f9e720b3 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -205,13 +205,15 @@ static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
static int espintcp_sendskmsg_locked(struct sock *sk,
struct espintcp_msg *emsg, int flags)
{
- struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
+ struct msghdr msghdr = {
+ .msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE,
+ };
struct sk_msg *skmsg = &emsg->skmsg;
+ bool more = flags & MSG_MORE;
struct scatterlist *sg;
int done = 0;
int ret;

- msghdr.msg_flags |= MSG_SENDPAGE_NOTLAST;
sg = &skmsg->sg.data[skmsg->sg.start];
do {
struct bio_vec bvec;
@@ -221,8 +223,8 @@ static int espintcp_sendskmsg_locked(struct sock *sk,

emsg->offset = 0;

- if (sg_is_last(sg))
- msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
+ if (sg_is_last(sg) && !more)
+ msghdr.msg_flags &= ~MSG_MORE;

p = sg_page(sg);
retry:

2023-06-20 15:21:18

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 16/18] iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage. This allows
multiple pages and multipage folios to be passed through.

TODO: iscsit_fe_sendpage_sg() should perhaps set up a bio_vec array for the
entire set of pages it's going to transfer plus two for the header and
trailer and page fragments to hold the header and trailer - and then call
sendmsg once for the entire message.

Signed-off-by: David Howells <[email protected]>
cc: Lee Duncan <[email protected]>
cc: Chris Leech <[email protected]>
cc: Mike Christie <[email protected]>
cc: Maurizio Lombardi <[email protected]>
cc: "James E.J. Bottomley" <[email protected]>
cc: "Martin K. Petersen" <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Al Viro <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---

Notes:
ver #2)
- Wrap lines at 80.

drivers/scsi/iscsi_tcp.c | 26 +++++++++---------------
drivers/scsi/iscsi_tcp.h | 2 +-
drivers/target/iscsi/iscsi_target_util.c | 15 ++++++++------
3 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 9637d4bc2bc9..9ab8555180a3 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -301,35 +301,32 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,

while (!iscsi_tcp_segment_done(tcp_conn, segment, 0, r)) {
struct scatterlist *sg;
+ struct msghdr msg = {};
+ struct bio_vec bv;
unsigned int offset, copy;
- int flags = 0;

r = 0;
offset = segment->copied;
copy = segment->size - offset;

if (segment->total_copied + segment->size < segment->total_size)
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE;

if (tcp_sw_conn->queue_recv)
- flags |= MSG_DONTWAIT;
+ msg.msg_flags |= MSG_DONTWAIT;

- /* Use sendpage if we can; else fall back to sendmsg */
if (!segment->data) {
+ if (!tcp_conn->iscsi_conn->datadgst_en)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
sg = segment->sg;
offset += segment->sg_offset + sg->offset;
- r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
- copy, flags);
+ bvec_set_page(&bv, sg_page(sg), copy, offset);
} else {
- struct msghdr msg = { .msg_flags = flags };
- struct kvec iov = {
- .iov_base = segment->data + offset,
- .iov_len = copy
- };
-
- r = kernel_sendmsg(sk, &msg, &iov, 1, copy);
+ bvec_set_virt(&bv, segment->data + offset, copy);
}
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, copy);

+ r = sock_sendmsg(sk, &msg);
if (r < 0) {
iscsi_tcp_segment_unmap(segment);
return r;
@@ -746,7 +743,6 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
sock_no_linger(sk);

iscsi_sw_tcp_conn_set_callbacks(conn);
- tcp_sw_conn->sendpage = tcp_sw_conn->sock->ops->sendpage;
/*
* set receive state machine into initial state
*/
@@ -777,8 +773,6 @@ static int iscsi_sw_tcp_conn_set_param(struct iscsi_cls_conn *cls_conn,
return -ENOTCONN;
}
iscsi_set_param(cls_conn, param, buf, buflen);
- tcp_sw_conn->sendpage = conn->datadgst_en ?
- sock_no_sendpage : tcp_sw_conn->sock->ops->sendpage;
mutex_unlock(&tcp_sw_conn->sock_lock);
break;
case ISCSI_PARAM_MAX_R2T:
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 68e14a344904..d6ec08d7eb63 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -48,7 +48,7 @@ struct iscsi_sw_tcp_conn {
uint32_t sendpage_failures_cnt;
uint32_t discontiguous_hdr_cnt;

- ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+ bool can_splice_to_tcp;
};

struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index b14835fcb033..6231fa4ef5c6 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1129,6 +1129,8 @@ int iscsit_fe_sendpage_sg(
struct iscsit_conn *conn)
{
struct scatterlist *sg = cmd->first_data_sg;
+ struct bio_vec bvec;
+ struct msghdr msghdr = { .msg_flags = MSG_SPLICE_PAGES, };
struct kvec iov;
u32 tx_hdr_size, data_len;
u32 offset = cmd->first_data_sg_off;
@@ -1172,17 +1174,18 @@ int iscsit_fe_sendpage_sg(
u32 space = (sg->length - offset);
u32 sub_len = min_t(u32, data_len, space);
send_pg:
- tx_sent = conn->sock->ops->sendpage(conn->sock,
- sg_page(sg), sg->offset + offset, sub_len, 0);
+ bvec_set_page(&bvec, sg_page(sg), sub_len, sg->offset + offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, sub_len);
+
+ tx_sent = conn->sock->ops->sendmsg(conn->sock, &msghdr,
+ sub_len);
if (tx_sent != sub_len) {
if (tx_sent == -EAGAIN) {
- pr_err("tcp_sendpage() returned"
- " -EAGAIN\n");
+ pr_err("sendmsg/splice returned -EAGAIN\n");
goto send_pg;
}

- pr_err("tcp_sendpage() failure: %d\n",
- tx_sent);
+ pr_err("sendmsg/splice failure: %d\n", tx_sent);
return -1;
}

2023-06-20 15:21:30

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.

Signed-off-by: David Howells <[email protected]>
Tested-by: Sagi Grimberg <[email protected]>
Acked-by: Willem de Bruijn <[email protected]>
cc: Keith Busch <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Chaitanya Kulkarni <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---

Notes:
ver #2)
- Wrap lines at 80.

ver #3)
- Split nvme/host from nvme/target changes.

drivers/nvme/host/tcp.c | 46 +++++++++++++++++++++--------------------
1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index bf0230442d57..6f31cdbb696a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -997,25 +997,25 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
u32 h2cdata_left = req->h2cdata_left;

while (true) {
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
+ };
struct page *page = nvme_tcp_req_cur_page(req);
size_t offset = nvme_tcp_req_cur_offset(req);
size_t len = nvme_tcp_req_cur_length(req);
bool last = nvme_tcp_pdu_last_send(req, len);
int req_data_sent = req->data_sent;
- int ret, flags = MSG_DONTWAIT;
+ int ret;

if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
- flags |= MSG_EOR;
+ msg.msg_flags |= MSG_EOR;
else
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE;

- if (sendpage_ok(page)) {
- ret = kernel_sendpage(queue->sock, page, offset, len,
- flags);
- } else {
- ret = sock_no_sendpage(queue->sock, page, offset, len,
- flags);
- }
+ bvec_set_page(&bvec, page, len, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(queue->sock, &msg);
if (ret <= 0)
return ret;

@@ -1054,22 +1054,24 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
bool inline_data = nvme_tcp_has_inline_data(req);
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) + hdgst - req->offset;
- int flags = MSG_DONTWAIT;
int ret;

if (inline_data || nvme_tcp_queue_more(queue))
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE;
else
- flags |= MSG_EOR;
+ msg.msg_flags |= MSG_EOR;

if (queue->hdr_digest && !req->offset)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

- ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
- offset_in_page(pdu) + req->offset, len, flags);
+ bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;

@@ -1093,6 +1095,8 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
struct nvme_tcp_data_pdu *pdu = nvme_tcp_req_data_pdu(req);
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_MORE, };
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) - req->offset + hdgst;
int ret;
@@ -1101,13 +1105,11 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

if (!req->h2cdata_left)
- ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
- offset_in_page(pdu) + req->offset, len,
- MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
- else
- ret = sock_no_sendpage(queue->sock, virt_to_page(pdu),
- offset_in_page(pdu) + req->offset, len,
- MSG_DONTWAIT | MSG_MORE);
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+
+ bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;

2023-06-20 15:21:42

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 14/18] drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()

Use sendmsg() conditionally with MSG_SPLICE_PAGES in _drbd_send_page()
rather than calling sendpage() or _drbd_no_send_page().

Signed-off-by: David Howells <[email protected]>
cc: Philipp Reisner <[email protected]>
cc: Lars Ellenberg <[email protected]>
cc: "Christoph Böhmwalder" <[email protected]>
cc: Jens Axboe <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---

Notes:
ver #2)
- Wrap lines at 80.

drivers/block/drbd/drbd_main.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 83987e7a5ef2..8a01a18a2550 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1540,7 +1540,8 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
int offset, size_t size, unsigned msg_flags)
{
struct socket *socket = peer_device->connection->data.socket;
- int len = size;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = msg_flags, };
int err = -EIO;

/* e.g. XFS meta- & log-data is in slab pages, which have a
@@ -1549,33 +1550,35 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
- if (drbd_disable_sendpage || !sendpage_ok(page))
- return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
+ if (!drbd_disable_sendpage && sendpage_ok(page))
+ msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;
+
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);

- msg_flags |= MSG_NOSIGNAL;
drbd_update_congested(peer_device->connection);
do {
int sent;

- sent = socket->ops->sendpage(socket, page, offset, len, msg_flags);
+ sent = sock_sendmsg(socket, &msg);
if (sent <= 0) {
if (sent == -EAGAIN) {
if (we_should_drop_the_connection(peer_device->connection, socket))
break;
continue;
}
- drbd_warn(peer_device->device, "%s: size=%d len=%d sent=%d\n",
- __func__, (int)size, len, sent);
+ drbd_warn(peer_device->device, "%s: size=%d len=%zu sent=%d\n",
+ __func__, (int)size, msg_data_left(&msg),
+ sent);
if (sent < 0)
err = sent;
break;
}
- len -= sent;
- offset += sent;
- } while (len > 0 /* THINK && device->cstate >= C_CONNECTED*/);
+ } while (msg_data_left(&msg)
+ /* THINK && device->cstate >= C_CONNECTED*/);
clear_bit(NET_CONGESTED, &peer_device->connection->flags);

- if (len == 0) {
+ if (!msg_data_left(&msg)) {
err = 0;
peer_device->device->send_cnt += size >> 9;
}

2023-06-20 15:21:43

by David Howells

[permalink] [raw]

Subject: [PATCH net-next v3 13/18] ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()

Fix ocfs2 to use the page fragment allocator rather than kzalloc in order
to allocate the buffers for the handshake message and keepalive request and
reply messages.

Switch from using sendpage() to using sendmsg() + MSG_SPLICE_PAGES so that
sendpage can be phased out.

Signed-off-by: David Howells <[email protected]>

cc: Mark Fasheh <[email protected]>
cc: Joel Becker <[email protected]>
cc: Joseph Qi <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
---

Notes:
ver #2)
- Wrap lines at 80.

fs/ocfs2/cluster/tcp.c | 109 +++++++++++++++++++++++------------------
1 file changed, 60 insertions(+), 49 deletions(-)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index aecbd712a00c..969d347ed9fa 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -110,9 +110,6 @@ static struct work_struct o2net_listen_work;
static struct o2hb_callback_func o2net_hb_up, o2net_hb_down;
#define O2NET_HB_PRI 0x1

-static struct o2net_handshake *o2net_hand;
-static struct o2net_msg *o2net_keep_req, *o2net_keep_resp;
-
static int o2net_sys_err_translations[O2NET_ERR_MAX] =
{[O2NET_ERR_NONE] = 0,
[O2NET_ERR_NO_HNDLR] = -ENOPROTOOPT,
@@ -930,19 +927,22 @@ static int o2net_send_tcp_msg(struct socket *sock, struct kvec *vec,
}

static void o2net_sendpage(struct o2net_sock_container *sc,
- void *kmalloced_virt,
- size_t size)
+ void *virt, size_t size)
{
struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);
+ struct msghdr msg = {};
+ struct bio_vec bv;
ssize_t ret;

+ bvec_set_virt(&bv, virt, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, size);
+
while (1) {
+ msg.msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES;
mutex_lock(&sc->sc_send_lock);
- ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
- virt_to_page(kmalloced_virt),
- offset_in_page(kmalloced_virt),
- size, MSG_DONTWAIT);
+ ret = sock_sendmsg(sc->sc_sock, &msg);
mutex_unlock(&sc->sc_send_lock);
+
if (ret == size)
break;
if (ret == (ssize_t)-EAGAIN) {
@@ -1168,6 +1168,7 @@ static int o2net_process_message(struct o2net_sock_container *sc,
struct o2net_msg *hdr)
{
struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);
+ struct o2net_msg *keep_resp;
int ret = 0, handler_status;
enum o2net_system_error syserr;
struct o2net_msg_handler *nmh = NULL;
@@ -1186,8 +1187,17 @@ static int o2net_process_message(struct o2net_sock_container *sc,
be32_to_cpu(hdr->status));
goto out;
case O2NET_MSG_KEEP_REQ_MAGIC:
- o2net_sendpage(sc, o2net_keep_resp,
- sizeof(*o2net_keep_resp));
+ keep_resp = alloc_skb_frag(sizeof(*keep_resp),
+ GFP_KERNEL);
+ if (!keep_resp) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ memset(keep_resp, 0, sizeof(*keep_resp));
+ keep_resp->magic =
+ cpu_to_be16(O2NET_MSG_KEEP_RESP_MAGIC);
+ o2net_sendpage(sc, keep_resp, sizeof(*keep_resp));
+ folio_put(virt_to_folio(keep_resp));
goto out;
case O2NET_MSG_KEEP_RESP_MAGIC:
goto out;
@@ -1439,15 +1449,23 @@ static void o2net_rx_until_empty(struct work_struct *work)
sc_put(sc);
}

-static void o2net_initialize_handshake(void)
+static struct o2net_handshake *o2net_initialize_handshake(void)
{
- o2net_hand->o2hb_heartbeat_timeout_ms = cpu_to_be32(
- O2HB_MAX_WRITE_TIMEOUT_MS);
- o2net_hand->o2net_idle_timeout_ms = cpu_to_be32(o2net_idle_timeout());
- o2net_hand->o2net_keepalive_delay_ms = cpu_to_be32(
- o2net_keepalive_delay());
- o2net_hand->o2net_reconnect_delay_ms = cpu_to_be32(
- o2net_reconnect_delay());
+ struct o2net_handshake *hand;
+
+ hand = alloc_skb_frag(sizeof(*hand), GFP_KERNEL);
+ if (!hand)
+ return NULL;
+
+ memset(hand, 0, sizeof(*hand));
+ hand->protocol_version = cpu_to_be64(O2NET_PROTOCOL_VERSION);
+ hand->connector_id = cpu_to_be64(1);
+ hand->o2hb_heartbeat_timeout_ms =
+ cpu_to_be32(O2HB_MAX_WRITE_TIMEOUT_MS);
+ hand->o2net_idle_timeout_ms = cpu_to_be32(o2net_idle_timeout());
+ hand->o2net_keepalive_delay_ms = cpu_to_be32(o2net_keepalive_delay());
+ hand->o2net_reconnect_delay_ms = cpu_to_be32(o2net_reconnect_delay());
+ return hand;
}

/* ------------------------------------------------------------ */
@@ -1456,16 +1474,22 @@ static void o2net_initialize_handshake(void)
* rx path will see the response and mark the sc valid */
static void o2net_sc_connect_completed(struct work_struct *work)
{
+ struct o2net_handshake *hand;
struct o2net_sock_container *sc =
container_of(work, struct o2net_sock_container,
sc_connect_work);

+ hand = o2net_initialize_handshake();
+ if (!hand)
+ goto out;
+
mlog(ML_MSG, "sc sending handshake with ver %llu id %llx\n",
(unsigned long long)O2NET_PROTOCOL_VERSION,
- (unsigned long long)be64_to_cpu(o2net_hand->connector_id));
+ (unsigned long long)be64_to_cpu(hand->connector_id));

- o2net_initialize_handshake();
- o2net_sendpage(sc, o2net_hand, sizeof(*o2net_hand));
+ o2net_sendpage(sc, hand, sizeof(*hand));
+ folio_put(virt_to_folio(hand));
+out:
sc_put(sc);
}

@@ -1475,8 +1499,15 @@ static void o2net_sc_send_keep_req(struct work_struct *work)
struct o2net_sock_container *sc =
container_of(work, struct o2net_sock_container,
sc_keepalive_work.work);
+ struct o2net_msg *keep_req;

- o2net_sendpage(sc, o2net_keep_req, sizeof(*o2net_keep_req));
+ keep_req = alloc_skb_frag(sizeof(*keep_req), GFP_KERNEL);
+ if (keep_req) {
+ memset(keep_req, 0, sizeof(*keep_req));
+ keep_req->magic = cpu_to_be16(O2NET_MSG_KEEP_REQ_MAGIC);
+ o2net_sendpage(sc, keep_req, sizeof(*keep_req));
+ folio_put(virt_to_folio(keep_req));
+ }
sc_put(sc);
}

@@ -1780,6 +1811,7 @@ static int o2net_accept_one(struct socket *sock, int *more)
struct socket *new_sock = NULL;
struct o2nm_node *node = NULL;
struct o2nm_node *local_node = NULL;
+ struct o2net_handshake *hand;
struct o2net_sock_container *sc = NULL;
struct o2net_node *nn;
unsigned int nofs_flag;
@@ -1882,8 +1914,11 @@ static int o2net_accept_one(struct socket *sock, int *more)
o2net_register_callbacks(sc->sc_sock->sk, sc);
o2net_sc_queue_work(sc, &sc->sc_rx_work);

- o2net_initialize_handshake();
- o2net_sendpage(sc, o2net_hand, sizeof(*o2net_hand));
+ hand = o2net_initialize_handshake();
+ if (hand) {
+ o2net_sendpage(sc, hand, sizeof(*hand));
+ folio_put(virt_to_folio(hand));
+ }

out:
if (new_sock)
@@ -2090,21 +2125,8 @@ int o2net_init(void)
unsigned long i;

o2quo_init();
-
o2net_debugfs_init();

- o2net_hand = kzalloc(sizeof(struct o2net_handshake), GFP_KERNEL);
- o2net_keep_req = kzalloc(sizeof(struct o2net_msg), GFP_KERNEL);
- o2net_keep_resp = kzalloc(sizeof(struct o2net_msg), GFP_KERNEL);
- if (!o2net_hand || !o2net_keep_req || !o2net_keep_resp)
- goto out;
-
- o2net_hand->protocol_version = cpu_to_be64(O2NET_PROTOCOL_VERSION);
- o2net_hand->connector_id = cpu_to_be64(1);
-
- o2net_keep_req->magic = cpu_to_be16(O2NET_MSG_KEEP_REQ_MAGIC);
- o2net_keep_resp->magic = cpu_to_be16(O2NET_MSG_KEEP_RESP_MAGIC);
-
for (i = 0; i < ARRAY_SIZE(o2net_nodes); i++) {
struct o2net_node *nn = o2net_nn_from_num(i);

@@ -2122,21 +2144,10 @@ int o2net_init(void)
}

return 0;
-
-out:
- kfree(o2net_hand);
- kfree(o2net_keep_req);
- kfree(o2net_keep_resp);
- o2net_debugfs_exit();
- o2quo_exit();
- return -ENOMEM;
}

void o2net_exit(void)
{
o2quo_exit();
- kfree(o2net_hand);
- kfree(o2net_keep_req);
- kfree(o2net_keep_resp);
o2net_debugfs_exit();
}

2023-06-21 10:23:34

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

One comment:

format for title in nvme-tcp is:

"nvme-tcp: ..." for host patches, and
"nvmet-tcp: ..." for target patches.

But this can be fixed up when applying the patch set.

Other than that, for both nvme patches:
Reviewed-by: Sagi Grimberg <[email protected]>

What tree will this be going from btw?

On 6/20/23 17:53, David Howells wrote:
> When transmitting data, call down into TCP using a single sendmsg with
> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
> performing several sendmsg and sendpage calls to transmit header, data
> pages and trailer.
>
> Signed-off-by: David Howells <[email protected]>
> Tested-by: Sagi Grimberg <[email protected]>
> Acked-by: Willem de Bruijn <[email protected]>
> cc: Keith Busch <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Christoph Hellwig <[email protected]>
> cc: Chaitanya Kulkarni <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
>
> Notes:
> ver #2)
> - Wrap lines at 80.
>
> ver #3)
> - Split nvme/host from nvme/target changes.
>
> drivers/nvme/host/tcp.c | 46 +++++++++++++++++++++--------------------
> 1 file changed, 24 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index bf0230442d57..6f31cdbb696a 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -997,25 +997,25 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
> u32 h2cdata_left = req->h2cdata_left;
>
> while (true) {
> + struct bio_vec bvec;
> + struct msghdr msg = {
> + .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
> + };
> struct page *page = nvme_tcp_req_cur_page(req);
> size_t offset = nvme_tcp_req_cur_offset(req);
> size_t len = nvme_tcp_req_cur_length(req);
> bool last = nvme_tcp_pdu_last_send(req, len);
> int req_data_sent = req->data_sent;
> - int ret, flags = MSG_DONTWAIT;
> + int ret;
>
> if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
> - flags |= MSG_EOR;
> + msg.msg_flags |= MSG_EOR;
> else
> - flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> + msg.msg_flags |= MSG_MORE;
>
> - if (sendpage_ok(page)) {
> - ret = kernel_sendpage(queue->sock, page, offset, len,
> - flags);
> - } else {
> - ret = sock_no_sendpage(queue->sock, page, offset, len,
> - flags);
> - }
> + bvec_set_page(&bvec, page, len, offset);
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> + ret = sock_sendmsg(queue->sock, &msg);
> if (ret <= 0)
> return ret;
>
> @@ -1054,22 +1054,24 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
> {
> struct nvme_tcp_queue *queue = req->queue;
> struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
> + struct bio_vec bvec;
> + struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
> bool inline_data = nvme_tcp_has_inline_data(req);
> u8 hdgst = nvme_tcp_hdgst_len(queue);
> int len = sizeof(*pdu) + hdgst - req->offset;
> - int flags = MSG_DONTWAIT;
> int ret;
>
> if (inline_data || nvme_tcp_queue_more(queue))
> - flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> + msg.msg_flags |= MSG_MORE;
> else
> - flags |= MSG_EOR;
> + msg.msg_flags |= MSG_EOR;
>
> if (queue->hdr_digest && !req->offset)
> nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>
> - ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
> - offset_in_page(pdu) + req->offset, len, flags);
> + bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> + ret = sock_sendmsg(queue->sock, &msg);
> if (unlikely(ret <= 0))
> return ret;
>
> @@ -1093,6 +1095,8 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
> {
> struct nvme_tcp_queue *queue = req->queue;
> struct nvme_tcp_data_pdu *pdu = nvme_tcp_req_data_pdu(req);
> + struct bio_vec bvec;
> + struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_MORE, };
> u8 hdgst = nvme_tcp_hdgst_len(queue);
> int len = sizeof(*pdu) - req->offset + hdgst;
> int ret;
> @@ -1101,13 +1105,11 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
> nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
>
> if (!req->h2cdata_left)
> - ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
> - offset_in_page(pdu) + req->offset, len,
> - MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
> - else
> - ret = sock_no_sendpage(queue->sock, virt_to_page(pdu),
> - offset_in_page(pdu) + req->offset, len,
> - MSG_DONTWAIT | MSG_MORE);
> + msg.msg_flags |= MSG_SPLICE_PAGES;
> +
> + bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> + ret = sock_sendmsg(queue->sock, &msg);
> if (unlikely(ret <= 0))
> return ret;
>
>

2023-06-21 12:41:58

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

Sagi Grimberg <[email protected]> wrote:

> What tree will this be going from btw?

It's aimed at net-next, as mentioned in the subject line.

Thanks,
David

2023-06-21 14:34:24

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

>> What tree will this be going from btw?
>
> It's aimed at net-next, as mentioned in the subject line.

ok, thanks.

2023-06-22 19:04:54

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Tue, 20 Jun 2023 15:53:20 +0100 David Howells wrote:
> If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> some data that's resident in the slab, copy it rather than returning EIO.

How did that happen? I thought MSG_SPLICE_PAGES comes from former
sendpage users and sendpage can't operate on slab pages.

> This can be made use of by a number of drivers in the kernel, including:
> iwarp, ceph/rds, dlm, nvme, ocfs2, drdb. It could also be used by iscsi,
> rxrpc, sunrpc, cifs and probably others.
>
> skb_splice_from_iter() is given it's own fragment allocator as
> page_frag_alloc_align() can't be used because it does no locking to prevent
> parallel callers from racing.

The locking is to local_bh_disable(). Does the milliont^w new frag
allocator have any additional benefits?

> alloc_skb_frag() uses a separate folio for
> each cpu and locks to the cpu whilst allocating, reenabling cpu migration
> around folio allocation.

2023-06-22 19:07:26

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Thu, Jun 22, 2023 at 11:12 AM Jakub Kicinski <[email protected]> wrote:
>
> On Tue, 20 Jun 2023 15:53:20 +0100 David Howells wrote:
> > If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> > some data that's resident in the slab, copy it rather than returning EIO.
>
> How did that happen? I thought MSG_SPLICE_PAGES comes from former
> sendpage users and sendpage can't operate on slab pages.
>
> > This can be made use of by a number of drivers in the kernel, including:
> > iwarp, ceph/rds, dlm, nvme, ocfs2, drdb. It could also be used by iscsi,
> > rxrpc, sunrpc, cifs and probably others.
> >
> > skb_splice_from_iter() is given it's own fragment allocator as
> > page_frag_alloc_align() can't be used because it does no locking to prevent
> > parallel callers from racing.
>
> The locking is to local_bh_disable(). Does the milliont^w new frag
> allocator have any additional benefits?

Actually I would be concerned about it causing confusion since it is
called out as being how we do it for sockets but we already have
several different socket based setups using skb_page_frag_refill() and
the like.

2023-06-22 20:58:22

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Jakub Kicinski <[email protected]> wrote:

> > If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> > some data that's resident in the slab, copy it rather than returning EIO.
>
> How did that happen? I thought MSG_SPLICE_PAGES comes from former
> sendpage users and sendpage can't operate on slab pages.

Some of my patches, take the siw one for example, now aggregate all the bits
that make up a message into a single sendmsg() call, including any protocol
header and trailer in the same bio_vec[] as the payload where before it would
have to do, say, sendmsg+sendpage+sendpage+...+sendpage+sendmsg.

I'm trying to make it so that I make the minimum number of sendmsg calls
(ie. 1 where possible) and the loop that processes the data is inside of that.
This offers the opportunity, at least in the future, to append slab data to an
already-existing private fragment in the skbuff.

> The locking is to local_bh_disable(). Does the milliont^w new frag
> allocator have any additional benefits?

It is shareable because it does locking. Multiple sockets of multiple
protocols can share the pages it has reserved. It drops the lock around calls
to the page allocator so that GFP_KERNEL/GFP_NOFS can be used with it.

Without this, the page fragment allocator would need to be per-socket, I
think, or be done further up the stack where the higher level drivers would
have to have a fragment bucket per whatever unit they use to deal with the
lack of locking.

Doing it here makes cleanup simpler since I just transfer my ref on the
fragment to the skbuff frag list and it will automatically be cleaned up with
the skbuff.

Willy suggested that I just allocate a page for each thing I want to copy, but
I would rather not do that for, say, an 8-byte bit of protocol data.

David

2023-06-22 21:15:59

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Thu, 22 Jun 2023 20:40:43 +0100 David Howells wrote:
> > How did that happen? I thought MSG_SPLICE_PAGES comes from former
> > sendpage users and sendpage can't operate on slab pages.
>
> Some of my patches, take the siw one for example, now aggregate all the bits
> that make up a message into a single sendmsg() call, including any protocol
> header and trailer in the same bio_vec[] as the payload where before it would
> have to do, say, sendmsg+sendpage+sendpage+...+sendpage+sendmsg.

Maybe it's just me but I'd prefer to keep the clear rule that splice
operates on pages not slab objects. SIW is the software / fake
implementation of RDMA, right? You couldn't have picked a less
important user :(

Paolo indicated that he'll take a look tomorrow, we'll see what he
thinks.

> I'm trying to make it so that I make the minimum number of sendmsg calls
> (ie. 1 where possible) and the loop that processes the data is inside of that.

The in-kernel users can be fixed to not use slab, and user space can't
feed us slab objects.

> This offers the opportunity, at least in the future, to append slab data to an
> already-existing private fragment in the skbuff.

Maybe we can get Eric to comment. The ability to identify "frag type"
seems cool indeed, but I haven't thought about using it to attach
slab objects.

> > The locking is to local_bh_disable(). Does the milliont^w new frag
> > allocator have any additional benefits?
>
> It is shareable because it does locking. Multiple sockets of multiple
> protocols can share the pages it has reserved. It drops the lock around calls
> to the page allocator so that GFP_KERNEL/GFP_NOFS can be used with it.
>
> Without this, the page fragment allocator would need to be per-socket, I
> think, or be done further up the stack where the higher level drivers would
> have to have a fragment bucket per whatever unit they use to deal with the
> lack of locking.

There's also the per task frag which can be used under normal conditions
(sk_use_task_frag).

> Doing it here makes cleanup simpler since I just transfer my ref on the
> fragment to the skbuff frag list and it will automatically be cleaned up with
> the skbuff.
>
> Willy suggested that I just allocate a page for each thing I want to copy, but
> I would rather not do that for, say, an 8-byte bit of protocol data.

TBH my intuition would also be get a full page and let the callers who
care about performance fix themselves. Assuming we want to let slab
objects in in the first place.

2023-06-22 23:34:49

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Jakub Kicinski <[email protected]> wrote:

> Maybe it's just me but I'd prefer to keep the clear rule that splice
> operates on pages not slab objects.

sendpage isn't only being used for splice(). Or were you referring to
splicing pages into socket buffers more generally?

> SIW is the software / fake implementation of RDMA, right? You couldn't have
> picked a less important user :(

ISCSI and sunrpc could both make use of this, as could ceph and others. I
have patches for sunrpc to make it condense into a single bio_vec[] and
sendmsg() in the server code (ie. nfsd) but for the moment, Chuck wanted me to
just do the xdr payload.

> > This offers the opportunity, at least in the future, to append slab data
> > to an already-existing private fragment in the skbuff.
>
> Maybe we can get Eric to comment. The ability to identify "frag type"
> seems cool indeed, but I haven't thought about using it to attach
> slab objects.

Unfortunately, you can't attach slab objects. Their lifetime isn't controlled
by put_page() or folio_put(). kmalloc()/kfree() doesn't refcount them -
they're recycled immediately. Hence why I was copying them. (Well, you
could attach, but then you need a callback mechanism).

What I'm trying to do is make it so that the process of calling sock_sendmsg()
with MSG_SPLICE_PAGES looks exactly the same as without: You fill in a
bio_vec[] pointing to your protocol header, the payload and the trailer,
pointing as appropriate to bits of slab, static, stack data or ref'able pages,
and call sendmsg and then the data will get copied or spliced as appropriate
to the page type, whether the MSG_SPLICE_PAGES flag is supplied and whether
the flag is supported.

There are a couple of things I'd like to avoid: (1) having to call
sock_sendmsg() more than once per message and (2) having sendmsg allocate more
space and make a copy of data that you had to copy into a frag before calling
sendmsg.

David

2023-06-23 02:58:06

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Thu, 22 Jun 2023 23:54:31 +0100 David Howells wrote:
> > Maybe it's just me but I'd prefer to keep the clear rule that splice
> > operates on pages not slab objects.
>
> sendpage isn't only being used for splice(). Or were you referring to
> splicing pages into socket buffers more generally?

Yes, sorry, any sort of "zero-copy attachment of data onto a socket
queue".

> > SIW is the software / fake implementation of RDMA, right? You couldn't have
> > picked a less important user :(
>
> ISCSI and sunrpc could both make use of this, as could ceph and others. I
> have patches for sunrpc to make it condense into a single bio_vec[] and
> sendmsg() in the server code (ie. nfsd) but for the moment, Chuck wanted me to
> just do the xdr payload.

But to be clear (and I'm not implying that it's not a strong enough
reason) - the only benefit from letting someone pass headers in a slab
object is that the code already uses kmalloc(), right? IOW it could be
changed to use frags without much of a LoC bloat?

> > Maybe we can get Eric to comment. The ability to identify "frag type"
> > seems cool indeed, but I haven't thought about using it to attach
> > slab objects.
>
> Unfortunately, you can't attach slab objects. Their lifetime isn't controlled
> by put_page() or folio_put(). kmalloc()/kfree() doesn't refcount them -
> they're recycled immediately. Hence why I was copying them. (Well, you
> could attach, but then you need a callback mechanism).

Right, right, I thought you were saying that _in the future_ we may try
to attach the slab objects as frags (and presumably copy when someone
tries to ref them). Maybe I over-interpreted.

> What I'm trying to do is make it so that the process of calling sock_sendmsg()
> with MSG_SPLICE_PAGES looks exactly the same as without: You fill in a
> bio_vec[] pointing to your protocol header, the payload and the trailer,
> pointing as appropriate to bits of slab, static, stack data or ref'able pages,
> and call sendmsg and then the data will get copied or spliced as appropriate
> to the page type, whether the MSG_SPLICE_PAGES flag is supplied and whether
> the flag is supported.
>
> There are a couple of things I'd like to avoid: (1) having to call
> sock_sendmsg() more than once per message and (2) having sendmsg allocate more
> space and make a copy of data that you had to copy into a frag before calling
> sendmsg.

If we're not planning to attach the slab objects as frags, then surely
doing kmalloc() + free() in the caller, and then allocating a frag and
copying the data over in the skb / socket code is also inefficient.
Fixing the caller gives all the benefits you want, and then some.

Granted some form of alloc_skb_frag() needs to be added so that callers
don't curse us, I'd start with something based on sk_page_frag().

Or we could pull the coping out into an intermediate helper which
first replaces all slab objects in the iovec with page frags and then
calls sock_sendmsg()? Maybe that's stupid...

Let's hear what others think. If we can't reach instant agreement --
can you strategically separate out the minimal set of changes required
to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
6.5.

2023-06-23 08:27:37

[permalink] [raw]

Subject: Re: [PATCH net-next v3 02/18] net: Display info about MSG_SPLICE_PAGES memory handling in proc

On Tue, 2023-06-20 at 15:53 +0100, David Howells wrote:
> Display information about the memory handling MSG_SPLICE_PAGES does to copy
> slabbed data into page fragments.
>
> For each CPU that has a cached folio, it displays the folio pfn, the offset
> pointer within the folio and the size of the folio.
>
> It also displays the number of pages refurbished and the number of pages
> replaced.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Alexander Duyck <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: David Ahern <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: Menglong Dong <[email protected]>
> cc: [email protected]
> ---
> net/core/skbuff.c | 42 +++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 39 insertions(+), 3 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index d962c93a429d..36605510a76d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -83,6 +83,7 @@
> #include <linux/user_namespace.h>
> #include <linux/indirect_call_wrapper.h>
> #include <linux/textsearch.h>
> +#include <linux/proc_fs.h>
>
> #include "dev.h"
> #include "sock_destructor.h"
> @@ -6758,6 +6759,7 @@ nodefer: __kfree_skb(skb);
> struct skb_splice_frag_cache {
> struct folio *folio;
> void *virt;
> + unsigned int fsize;
> unsigned int offset;
> /* we maintain a pagecount bias, so that we dont dirty cache line
> * containing page->_refcount every time we allocate a fragment.
> @@ -6767,6 +6769,26 @@ struct skb_splice_frag_cache {
> };
>
> static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
> +static atomic_t skb_splice_frag_replaced, skb_splice_frag_refurbished;

(in case we don't agree to restrict this series to just remove
MSG_SENDPAGE_NOTLAST)

Have you considered percpu counters instead of the above atomics?

I think the increments are in not so unlikely code-paths, and the
contention there could possibly hurt performances.

Thanks,

Paolo

2023-06-23 08:49:20

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Hi,

First thing first, I'm sorry for the delayed feedback. I was traveling.

On Tue, 2023-06-20 at 15:53 +0100, David Howells wrote:
> If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> some data that's resident in the slab, copy it rather than returning EIO.
> This can be made use of by a number of drivers in the kernel, including:
> iwarp, ceph/rds, dlm, nvme, ocfs2, drdb. It could also be used by iscsi,
> rxrpc, sunrpc, cifs and probably others.
>
> skb_splice_from_iter() is given it's own fragment allocator as
> page_frag_alloc_align() can't be used because it does no locking to prevent
> parallel callers from racing. alloc_skb_frag() uses a separate folio for
> each cpu and locks to the cpu whilst allocating, reenabling cpu migration
> around folio allocation.
>
> This could allocate a whole page instead for each fragment to be copied, as
> alloc_skb_with_frags() would do instead, but that would waste a lot of
> space (most of the fragments look like they're going to be small).
>
> This allows an entire message that consists of, say, a protocol header or
> two, a number of pages of data and a protocol footer to be sent using a
> single call to sock_sendmsg().
>
> The callers could be made to copy the data into fragments before calling
> sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Alexander Duyck <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: David Ahern <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: Menglong Dong <[email protected]>
> cc: [email protected]
> ---
>
> Notes:
> ver #3)
> - Remove duplicate decl of skb_splice_from_iter().
>
> ver #2)
> - Fix parameter to put_cpu_ptr() to have an '&'.
>
> include/linux/skbuff.h | 2 +
> net/core/skbuff.c | 171 ++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 170 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 91ed66952580..5f53bd5d375d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -5037,6 +5037,8 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
> #endif
> }
>
> +void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
> +void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
> ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
> ssize_t maxsize, gfp_t gfp);
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index fee2b1c105fe..d962c93a429d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6755,6 +6755,145 @@ nodefer: __kfree_skb(skb);
> smp_call_function_single_async(cpu, &sd->defer_csd);
> }
>
> +struct skb_splice_frag_cache {
> + struct folio *folio;
> + void *virt;
> + unsigned int offset;
> + /* we maintain a pagecount bias, so that we dont dirty cache line
> + * containing page->_refcount every time we allocate a fragment.
> + */
> + unsigned int pagecnt_bias;
> + bool pfmemalloc;
> +};
> +
> +static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
> +
> +/**
> + * alloc_skb_frag - Allocate a page fragment for using in a socket
> + * @fragsz: The size of fragment required
> + * @gfp: Allocation flags
> + */
> +void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
> +{
> + struct skb_splice_frag_cache *cache;
> + struct folio *folio, *spare = NULL;
> + size_t offset, fsize;
> + void *p;
> +
> + if (WARN_ON_ONCE(fragsz == 0))
> + fragsz = 1;
> +
> + cache = get_cpu_ptr(&skb_splice_frag_cache);
> +reload:
> + folio = cache->folio;
> + offset = cache->offset;
> +try_again:
> + if (fragsz > offset)
> + goto insufficient_space;
> +
> + /* Make the allocation. */
> + cache->pagecnt_bias--;
> + offset = ALIGN_DOWN(offset - fragsz, SMP_CACHE_BYTES);
> + cache->offset = offset;
> + p = cache->virt + offset;
> + put_cpu_ptr(&skb_splice_frag_cache);
> + if (spare)
> + folio_put(spare);
> + return p;
> +
> +insufficient_space:
> + /* See if we can refurbish the current folio. */
> + if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
> + goto get_new_folio;
> + if (unlikely(cache->pfmemalloc)) {
> + __folio_put(folio);
> + goto get_new_folio;
> + }
> +
> + fsize = folio_size(folio);
> + if (unlikely(fragsz > fsize))
> + goto frag_too_big;
> +
> + /* OK, page count is 0, we can safely set it */
> + folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
> +
> + /* Reset page count bias and offset to start of new frag */
> + cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> + offset = fsize;
> + goto try_again;

IMHO this function uses a bit too much labels and would be more easy to
read, e.g. moving the above chunk of code in conditional branch.

Even without such change, I think the above 'goto try_again;'
introduces an unneeded conditional, as at this point we know 'fragsz <=
fsize'.

> +
> +get_new_folio:
> + if (!spare) {
> + cache->folio = NULL;
> + put_cpu_ptr(&skb_splice_frag_cache);
> +
> +#if PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE
> + spare = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY |
> + __GFP_NOMEMALLOC,
> + PAGE_FRAG_CACHE_MAX_ORDER);
> + if (!spare)
> +#endif
> + spare = folio_alloc(gfp, 0);
> + if (!spare)
> + return NULL;
> +
> + cache = get_cpu_ptr(&skb_splice_frag_cache);
> + /* We may now be on a different cpu and/or someone else may
> + * have refilled it
> + */
> + cache->pfmemalloc = folio_is_pfmemalloc(spare);
> + if (cache->folio)
> + goto reload;

I think there is some problem with the above.

If cache->folio is != NULL, and cache->folio was not pfmemalloc-ed
while the spare one is, it looks like the wrong policy will be used.
And should be even worse if folio was pfmemalloc-ed while spare is not.

I think moving 'cache->pfmemalloc' initialization...

> + }
> +

... here should fix the above.

> + cache->folio = spare;
> + cache->virt = folio_address(spare);
> + folio = spare;
> + spare = NULL;
> +
> + /* Even if we own the page, we do not use atomic_set(). This would
> + * break get_page_unless_zero() users.
> + */
> + folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
> +
> + /* Reset page count bias and offset to start of new frag */
> + cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> + offset = folio_size(folio);
> + goto try_again;

What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
high order page, but order-0, pfmemalloc-ed page allocation is
successful? It looks like this could become an unbounded loop?

> +
> +frag_too_big:
> + /* The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
> + * but the cache isn't big enough to satisfy the request, this may
> + * happen in low memory conditions. We don't release the cache page
> + * because it could make memory pressure worse so we simply return NULL
> + * here.
> + */
> + cache->offset = offset;
> + put_cpu_ptr(&skb_splice_frag_cache);
> + if (spare)
> + folio_put(spare);
> + return NULL;
> +}
> +EXPORT_SYMBOL(alloc_skb_frag);
> +
> +/**
> + * copy_skb_frag - Copy data into a page fragment.
> + * @s: The data to copy
> + * @len: The size of the data
> + * @gfp: Allocation flags
> + */
> +void *copy_skb_frag(const void *s, size_t len, gfp_t gfp)
> +{
> + void *p;
> +
> + p = alloc_skb_frag(len, gfp);
> + if (!p)
> + return NULL;
> +
> + return memcpy(p, s, len);
> +}
> +EXPORT_SYMBOL(copy_skb_frag);
> +
> static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
> size_t offset, size_t len)
> {
> @@ -6808,17 +6947,43 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
> break;
> }
>
> + if (space == 0 &&
> + !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
> + pages[0], off)) {
> + iov_iter_revert(iter, len);
> + break;
> + }
> +
> i = 0;
> do {
> struct page *page = pages[i++];
> size_t part = min_t(size_t, PAGE_SIZE - off, len);
> -
> - ret = -EIO;
> - if (WARN_ON_ONCE(!sendpage_ok(page)))
> + bool put = false;
> +
> + if (PageSlab(page)) {

I'm a bit concerned from the above. If I read correctly, tcp 0-copy
will go through that for every page, even if the expected use-case is
always !PageSlub(page). compound_head() could be costly if the head
page is not hot on cache and I'm not sure if that could be the case for
tcp 0-copy. The bottom line is that I fear a possible regression here.

Given all the above, and the late stage of the current devel cycle,
would you consider slicing down this series to just kill
MSG_SENDPAGE_NOTLAST, as Jakub suggested?

Thanks!

Paolo

2023-06-23 09:20:24

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Jakub Kicinski <[email protected]> wrote:

> If we can't reach instant agreement --
> can you strategically separate out the minimal set of changes required
> to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
> 6.5.

Paolo Abeni <[email protected]> wrote:

> Given all the above, and the late stage of the current devel cycle,
> would you consider slicing down this series to just kill
> MSG_SENDPAGE_NOTLAST, as Jakub suggested?

I could do that.

There is also another alternative. I could just push the sendpage wrappers up
the stack into the higher-level callers. Basically this:

int udp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
struct bio_vec bvec;
struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES };

if (flags & MSG_SENDPAGE_NOTLAST)
msg.msg_flags |= MSG_MORE;

bvec_set_page(&bvec, page, size, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
return udp_sendmsg(sk, &msg, size);
}

and kill off sendpage and MSG_SENDPAGE_NOTLAST.

David

2023-06-23 09:30:24

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Paolo Abeni <[email protected]> wrote:

> IMHO this function uses a bit too much labels and would be more easy to
> read, e.g. moving the above chunk of code in conditional branch.

Maybe. I was trying to put the fast path up at the top without the slow path
bits in it, but I can put the "insufficient_space" bit there.

> Even without such change, I think the above 'goto try_again;'
> introduces an unneeded conditional, as at this point we know 'fragsz <=
> fsize'.

Good point.

> > + cache->pfmemalloc = folio_is_pfmemalloc(spare);
> > + if (cache->folio)
> > + goto reload;
>
> I think there is some problem with the above.
>
> If cache->folio is != NULL, and cache->folio was not pfmemalloc-ed
> while the spare one is, it looks like the wrong policy will be used.
> And should be even worse if folio was pfmemalloc-ed while spare is not.
>
> I think moving 'cache->pfmemalloc' initialization...
>
> > + }
> > +
>
> ... here should fix the above.

Yeah. We might have raced with someone else or been moved to another cpu and
there might now be a folio we can allocate from.

> > + /* Reset page count bias and offset to start of new frag */
> > + cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> > + offset = folio_size(folio);
> > + goto try_again;
>
> What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
> high order page, but order-0, pfmemalloc-ed page allocation is
> successful? It looks like this could become an unbounded loop?

It shouldn't. It should go:

try_again:
if (fragsz > offset)
goto insufficient_space;
insufficient_space:
/* See if we can refurbish the current folio. */
...
fsize = folio_size(folio);
if (unlikely(fragsz > fsize))
goto frag_too_big;
frag_too_big:
...
return NULL;

Though for safety's sake, it would make sense to put in a size check in the
case we fail to allocate a larger-order folio.

> > do {
> > struct page *page = pages[i++];
> > size_t part = min_t(size_t, PAGE_SIZE - off, len);
> > -
> > - ret = -EIO;
> > - if (WARN_ON_ONCE(!sendpage_ok(page)))
> > + bool put = false;
> > +
> > + if (PageSlab(page)) {
>
> I'm a bit concerned from the above. If I read correctly, tcp 0-copy

Well, splice()-to-tcp will; MSG_ZEROCOPY is unaffected.

> will go through that for every page, even if the expected use-case is
> always !PageSlub(page). compound_head() could be costly if the head
> page is not hot on cache and I'm not sure if that could be the case for
> tcp 0-copy. The bottom line is that I fear a possible regression here.

I can put the PageSlab() check inside the sendpage_ok() so the page flag is
only checked once. But PageSlab() doesn't check the headpage, only the page
it is given. sendpage_ok() is more the problem as it also calls
page_count(). I could drop the check.

David

2023-06-23 09:44:49

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Fri, 2023-06-23 at 10:06 +0100, David Howells wrote:
> Paolo Abeni <[email protected]> wrote:
>
> > IMHO this function uses a bit too much labels and would be more easy to
> > read, e.g. moving the above chunk of code in conditional branch.
>
> Maybe. I was trying to put the fast path up at the top without the slow path
> bits in it, but I can put the "insufficient_space" bit there.

I *think* you could move the insufficient_space in a separate helped,
that should achieve your goal with fewer labels and hopefully no
additional complexity.

>
> > Even without such change, I think the above 'goto try_again;'
> > introduces an unneeded conditional, as at this point we know 'fragsz <=
> > fsize'.
>
> Good point.
>
> > > + cache->pfmemalloc = folio_is_pfmemalloc(spare);
> > > + if (cache->folio)
> > > + goto reload;
> >
> > I think there is some problem with the above.
> >
> > If cache->folio is != NULL, and cache->folio was not pfmemalloc-ed
> > while the spare one is, it looks like the wrong policy will be used.
> > And should be even worse if folio was pfmemalloc-ed while spare is not.
> >
> > I think moving 'cache->pfmemalloc' initialization...
> >
> > > + }
> > > +
> >
> > ... here should fix the above.
>
> Yeah. We might have raced with someone else or been moved to another cpu and
> there might now be a folio we can allocate from.
>
> > > + /* Reset page count bias and offset to start of new frag */
> > > + cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
> > > + offset = folio_size(folio);
> > > + goto try_again;
> >
> > What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
> > high order page, but order-0, pfmemalloc-ed page allocation is
> > successful? It looks like this could become an unbounded loop?
>
> It shouldn't. It should go:
>
> try_again:
> if (fragsz > offset)
> goto insufficient_space;
> insufficient_space:
> /* See if we can refurbish the current folio. */
> ...

I think the critical path is with pfmemalloc-ed pages:

if (unlikely(cache->pfmemalloc)) {
__folio_put(folio);
goto get_new_folio;
}

just before the following.

> fsize = folio_size(folio);
> if (unlikely(fragsz > fsize))
> goto frag_too_big;
> frag_too_big:
> ...
> return NULL;
>
> Though for safety's sake, it would make sense to put in a size check in the
> case we fail to allocate a larger-order folio.
>
> > > do {
> > > struct page *page = pages[i++];
> > > size_t part = min_t(size_t, PAGE_SIZE - off, len);
> > > -
> > > - ret = -EIO;
> > > - if (WARN_ON_ONCE(!sendpage_ok(page)))
> > > + bool put = false;
> > > +
> > > + if (PageSlab(page)) {
> >
> > I'm a bit concerned from the above. If I read correctly, tcp 0-copy
>
> Well, splice()-to-tcp will; MSG_ZEROCOPY is unaffected.

Ah right! I got lost in some 'if' branch.

> > will go through that for every page, even if the expected use-case is
> > always !PageSlub(page). compound_head() could be costly if the head
> > page is not hot on cache and I'm not sure if that could be the case for
> > tcp 0-copy. The bottom line is that I fear a possible regression here.
>
> I can put the PageSlab() check inside the sendpage_ok() so the page flag is
> only checked once.

Perhaps I'm lost again, but AFAICS:

__PAGEFLAG(Slab, slab, PF_NO_TAIL)

// ...
#define __PAGEFLAG(uname, lname, policy) \
TESTPAGEFLAG(uname, lname, policy) \
// ...

#define TESTPAGEFLAG(uname, lname, policy) \
static __always_inline bool folio_test_##lname(struct folio *folio) \
{ return test_bit(PG_##lname, folio_flags(folio, FOLIO_##policy));} \
static __always_inline int Page##uname(struct page *page) \
{ return test_bit(PG_##lname, &policy(page, 0)->flags); }
// ... 'policy' is PF_NO_TAIL here

#define PF_NO_TAIL(page, enforce) ({ \
VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page); \
PF_POISONED_CHECK(compound_head(page)); })

It looks at compound_head in the end ?!?

> But PageSlab() doesn't check the headpage, only the page
> it is given. sendpage_ok() is more the problem as it also calls
> page_count(). I could drop the check.

Once the head page is hot on cache due to the previous check, it should
be cheap?

Cheers,

Paolo

2023-06-23 09:50:42

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 02/18] net: Display info about MSG_SPLICE_PAGES memory handling in proc

Paolo Abeni <[email protected]> wrote:

> Have you considered percpu counters instead of the above atomics?

Makes sense, since I've got per-cpu structs anyway.

David

2023-06-23 10:05:11

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Fri, 2023-06-23 at 10:08 +0100, David Howells wrote:
> Jakub Kicinski <[email protected]> wrote:
>
> > If we can't reach instant agreement --
> > can you strategically separate out the minimal set of changes required
> > to just kill MSG_SENDPAGE_NOTLAST. IMHO it's worth getting that into
> > 6.5.
>
> Paolo Abeni <[email protected]> wrote:
>
> > Given all the above, and the late stage of the current devel cycle,
> > would you consider slicing down this series to just kill
> > MSG_SENDPAGE_NOTLAST, as Jakub suggested?
>
> I could do that.
>
> There is also another alternative. I could just push the sendpage wrappers up
> the stack into the higher-level callers. Basically this:
>
> int udp_sendpage(struct sock *sk, struct page *page, int offset,
> size_t size, int flags)
> {
> struct bio_vec bvec;
> struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES };
>
> if (flags & MSG_SENDPAGE_NOTLAST)
> msg.msg_flags |= MSG_MORE;
>
> bvec_set_page(&bvec, page, size, offset);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
> return udp_sendmsg(sk, &msg, size);
> }
>
> and kill off sendpage and MSG_SENDPAGE_NOTLAST.

I'm unsure I follow the above ?!? I *thought* sendpage could be killed
even without patch 1/18 and 2/18, leaving some patches in this series
unmodified, and mangling those explicitly leveraging 1/18 to use
multiple sendmsg()s with different flags?

I haven't tried to code the above, but my wild guess/hope is that the
delta should be doable - ideally less then the other option.

Introducing slab support should still be possible later, with hopefully
less work.

Cheers,

Paolo

2023-06-23 10:18:49

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Paolo Abeni <[email protected]> wrote:

> > Maybe. I was trying to put the fast path up at the top without the slow path
> > bits in it, but I can put the "insufficient_space" bit there.
>
> I *think* you could move the insufficient_space in a separate helped,
> that should achieve your goal with fewer labels and hopefully no
> additional complexity.

It would require moving things in and out of variables more, but that's
probably fine in the slow path. The code I derived this from seems to do its
best only to touch memory when it absolutely has to. But doing what you
suggest would certainly make this more readable, I think.

> > > What if fragsz > PAGE_SIZE, we are consistently unable to allocate an
> > > high order page, but order-0, pfmemalloc-ed page allocation is
> > > successful? It looks like this could become an unbounded loop?
> >
> > It shouldn't. It should go:
> >
> > try_again:
> > if (fragsz > offset)
> > goto insufficient_space;
> > insufficient_space:
> > /* See if we can refurbish the current folio. */
> > ...
>
> I think the critical path is with pfmemalloc-ed pages:
>
> if (unlikely(cache->pfmemalloc)) {
> __folio_put(folio);
> goto get_new_folio;
> }

I see what you're getting at. I was thinking that you meant that the critical
bit was that we got into a loop because we never managed to allocate a folio
big enough.

Inserting a check in the event that we fail to allocate an order-3 folio would
take care of that, I think. After that point, we have a spare folio of
sufficient capacity, even if the folio currently in residence is marked
pfmemalloc.

> > > will go through that for every page, even if the expected use-case is
> > > always !PageSlub(page). compound_head() could be costly if the head
> > > page is not hot on cache and I'm not sure if that could be the case for
> > > tcp 0-copy. The bottom line is that I fear a possible regression here.
> >
> > I can put the PageSlab() check inside the sendpage_ok() so the page flag is
> > only checked once.
>
> Perhaps I'm lost again, but AFAICS:
>
> __PAGEFLAG(Slab, slab, PF_NO_TAIL)
> ...
> PF_POISONED_CHECK(compound_head(page)); })
>
> It looks at compound_head in the end ?!?

Fair point. Those macros are somewhat hard to read. Hopefully, all the
compound_head() calls will go away soon-ish.

2023-06-23 10:24:10

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

Paolo Abeni <[email protected]> wrote:

> I'm unsure I follow the above ?!? I *thought* sendpage could be killed
> even without patch 1/18 and 2/18, leaving some patches in this series
> unmodified, and mangling those explicitly leveraging 1/18 to use
> multiple sendmsg()s with different flags?

That's what I meant.

With the example, I was showing the minimum needed replacement for a call to
sendpage.

David

2023-06-23 10:34:40

[permalink] [raw]

Subject: Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

On Fri, 2023-06-23 at 11:06 +0100, David Howells wrote:
> Paolo Abeni <[email protected]> wrote:
>
> > I'm unsure I follow the above ?!? I *thought* sendpage could be killed
> > even without patch 1/18 and 2/18, leaving some patches in this series
> > unmodified, and mangling those explicitly leveraging 1/18 to use
> > multiple sendmsg()s with different flags?
>
> That's what I meant.
>
> With the example, I was showing the minimum needed replacement for a call to
> sendpage.

Thank LGTM!

Thanks,

Paolo

2023-06-29 15:06:14

by Aurelien Aptel

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

Hi David,

David Howells <[email protected]> writes:
> When transmitting data, call down into TCP using a single sendmsg with
> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
> performing several sendmsg and sendpage calls to transmit header, data
> pages and trailer.

This series makes my kernel crash.

From the current net-next main branch:

commit 9ae440b8fdd6772b6c007fa3d3766530a09c9045 (HEAD)
Merge: b545a13ca9b2 b848b26c6672
Author: Jakub Kicinski <[email protected]>
Date: Sat Jun 24 15:50:21 2023 -0700

Merge branch 'splice-net-switch-over-users-of-sendpage-and-remove-it'

Steps to reproduce:

* connect a remote nvme null block device (nvmet) with 1 IO queue to keep
things simple
* open /dev/nvme0n1 with O_RDWR|O_DIRECT|O_SYNC
* write() a 8k buffer or 4k buffer

Trace:

[ 311.766163] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 311.768136] #PF: supervisor read access in kernel mode
[ 311.769327] #PF: error_code(0x0000) - not-present page
[ 311.770393] PGD 148988067 P4D 0
[ 311.771074] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 311.771978] CPU: 0 PID: 180 Comm: kworker/0:1H Not tainted 6.4.0-rc7+ #27
[ 311.773380] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[ 311.774808] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[ 311.775547] RIP: 0010:skb_splice_from_iter+0xf1/0x370
[ 311.776176] Code: 8b 45 88 4d 89 fa 4d 89 e7 45 89 ec 44 89 e3 41 83
c4 01 83 fb 07 0f 87 56 02 00 00 48 8b 5c dd 90 41 bd 00 10 00 00 49 29
c5 <48> 8b 53 08 4d 39 f5 4d 0f 47 ee f6 c2 01 0f 85 c7 01 00 00 66 90
[ 311.778472] RSP: 0018:ff633e24c0747b08 EFLAGS: 00010206
[ 311.779115] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[ 311.780007] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff633e24c0747d30
[ 311.780861] RBP: ff633e24c0747bb0 R08: ff633e24c0747d40 R09: 000000006db29140
[ 311.781748] R10: ff3001bd00a22800 R11: 0000000008000000 R12: 0000000000000001
[ 311.782631] R13: 0000000000001000 R14: 0000000000001000 R15: 0000000000000000
[ 311.783506] FS: 0000000000000000(0000) GS:ff3001be77800000(0000) knlGS:0000000000000000
[ 311.784494] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 311.785197] CR2: 0000000000000008 CR3: 0000000107f5c001 CR4: 0000000000771ef0
[ 311.786076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 311.786948] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 311.787822] PKRU: 55555554
[ 311.788165] Call Trace:
[ 311.788480] <TASK>
[ 311.788756] ? show_regs+0x6e/0x80
[ 311.789189] ? __die+0x29/0x70
[ 311.789577] ? page_fault_oops+0x154/0x4a0
[ 311.790097] ? ip_output+0x7c/0x110
[ 311.790541] ? __sys_socketpair+0x1b4/0x280
[ 311.791065] ? __pfx_ip_finish_output+0x10/0x10
[ 311.791640] ? do_user_addr_fault+0x360/0x770
[ 311.792184] ? exc_page_fault+0x7d/0x190
[ 311.792677] ? asm_exc_page_fault+0x2b/0x30
[ 311.793198] ? skb_splice_from_iter+0xf1/0x370
[ 311.793748] ? skb_splice_from_iter+0xb7/0x370
[ 311.794312] ? __sk_mem_schedule+0x34/0x50
[ 311.794824] tcp_sendmsg_locked+0x3a6/0xdd0
[ 311.795344] ? tcp_push+0x10c/0x120
[ 311.795789] tcp_sendmsg+0x31/0x50
[ 311.796213] inet_sendmsg+0x47/0x80
[ 311.796655] sock_sendmsg+0x99/0xb0
[ 311.797095] ? inet_sendmsg+0x47/0x80
[ 311.797557] nvme_tcp_try_send_data+0x149/0x490 [nvme_tcp]
[ 311.798242] ? kvm_clock_get_cycles+0xd/0x20
[ 311.799181] nvme_tcp_try_send+0x1b7/0x300 [nvme_tcp]
[ 311.800133] nvme_tcp_io_work+0x40/0xc0 [nvme_tcp]
[ 311.801044] process_one_work+0x21c/0x430
[ 311.801847] worker_thread+0x54/0x3e0
[ 311.802611] ? __pfx_worker_thread+0x10/0x10
[ 311.803433] kthread+0xf8/0x130
[ 311.804116] ? __pfx_kthread+0x10/0x10
[ 311.804865] ret_from_fork+0x29/0x50
[ 311.805596] </TASK>
[ 311.806165] Modules linked in: mlx5_ib ib_uverbs ib_core nvme_tcp
mlx5_core mlxfw psample pci_hyperv_intf rpcsec_gss_krb5 nfsv3
auth_rpcgss nfs_acl nfsv4 nfs lockd grace fscache netfs nvme_fabrics
nvme_core nvme_common intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common nfit kvm_intel kvm rapl input_leds
serio_raw sunrpc binfmt_misc qemu_fw_cfg sch_fq_codel dm_multipath
scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon
efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear
hid_generic usbhid hid qxl drm_ttm_helper ttm crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel drm_kms_helper sha512_ssse3
syscopyarea sysfillrect sysimgblt aesni_intel crypto_simd i2c_i801 ahci
cryptd psmous e drm virtio_net i2c_smbus libahci lpc_ich net_failover
xhci_pci virtio_blk failover xhci_pci_renesas [last unloaded: ib_core]
[ 311.818698] CR2: 0000000000000008
[ 311.819437] ---[ end trace 0000000000000000 ]---

Cheers,

2023-06-29 15:23:45

by Aurelien Aptel

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

Sagi Grimberg <[email protected]> writes:
> Most likely this also reproduces with blktests?
> https://github.com/osandov/blktests
>
> simple way to check is to run:
> nvme_trtype=tcp ./check nvme
>
> This runs nvme tests over nvme-tcp.

Yes, it crashes similarly during test 10:

root@ktest:~/blktests# nvme_trtype=tcp ./check nvme
nvme/002 (create many subsystems and test discovery) [not run]
nvme_trtype=tcp is not supported in this test
nvme/003 (test if we're sending keep-alives to a discovery controller) [passed]
runtime ... 10.389s
nvme/004 (test nvme and nvmet UUID NS descriptors) [passed]
runtime ... 1.264s
nvme/005 (reset local loopback target) [passed]
runtime ... 1.403s
nvme/006 (create an NVMeOF target with a block device-backed ns) [passed]
runtime ... 0.129s
nvme/007 (create an NVMeOF target with a file-backed ns) [passed]
runtime ... 0.083s
nvme/008 (create an NVMeOF host with a block device-backed ns) [passed]
runtime ... 1.239s
nvme/009 (create an NVMeOF host with a file-backed ns) [passed]
runtime ... 1.229s
nvme/010 (run data verification fio job on NVMeOF block device-backed
ns)

Same trace, null ptr deref at skb_splice_from_iter+0x10a/0x370

Cheers,

2023-06-29 15:29:32

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

> Hi David,
>
> David Howells <[email protected]> writes:
>> When transmitting data, call down into TCP using a single sendmsg with
>> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
>> performing several sendmsg and sendpage calls to transmit header, data
>> pages and trailer.
>
> This series makes my kernel crash.
>
> From the current net-next main branch:
>
> commit 9ae440b8fdd6772b6c007fa3d3766530a09c9045 (HEAD)
> Merge: b545a13ca9b2 b848b26c6672
> Author: Jakub Kicinski <[email protected]>
> Date: Sat Jun 24 15:50:21 2023 -0700
>
> Merge branch 'splice-net-switch-over-users-of-sendpage-and-remove-it'
>
>
> Steps to reproduce:
>
> * connect a remote nvme null block device (nvmet) with 1 IO queue to keep
> things simple
> * open /dev/nvme0n1 with O_RDWR|O_DIRECT|O_SYNC
> * write() a 8k buffer or 4k buffer

Most likely this also reproduces with blktests?
https://github.com/osandov/blktests

simple way to check is to run:
nvme_trtype=tcp ./check nvme

This runs nvme tests over nvme-tcp.

No need for network, disk or anything. It runs
both nvme and nvmet over the lo device..

2023-06-29 21:55:09

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

Sagi Grimberg <[email protected]> wrote:

> simple way to check is to run:
> nvme_trtype=tcp ./check nvme

It says a lot of:

nvme/002 (create many subsystems and test discovery) [not run]
nvme is not available
nvme_trtype=tcp is not supported in this test
nvme/003 (test if we're sending keep-alives to a discovery controller) [not run]
nvme is not available
nvme/004 (test nvme and nvmet UUID NS descriptors) [not run]
nvme is not available
nvme/005 (reset local loopback target) [not run]
nvme is not available
...

I have the following NVMe config:

# NVME Support
CONFIG_NVME_COMMON=y
CONFIG_NVME_CORE=y
CONFIG_BLK_DEV_NVME=y
CONFIG_NVME_MULTIPATH=y
# CONFIG_NVME_VERBOSE_ERRORS is not set
# CONFIG_NVME_HWMON is not set
CONFIG_NVME_FABRICS=y
# CONFIG_NVME_RDMA is not set
# CONFIG_NVME_FC is not set
CONFIG_NVME_TCP=y
CONFIG_NVME_AUTH=y
CONFIG_NVME_TARGET=y
CONFIG_NVME_TARGET_PASSTHRU=y
CONFIG_NVME_TARGET_LOOP=y
# CONFIG_NVME_TARGET_RDMA is not set
# CONFIG_NVME_TARGET_FC is not set
CONFIG_NVME_TARGET_TCP=y
CONFIG_NVME_TARGET_AUTH=y
# end of NVME Support
CONFIG_RTC_NVMEM=y
CONFIG_NVMEM=y
# CONFIG_NVMEM_SYSFS is not set
# CONFIG_NVMEM_LAYOUT_SL28_VPD is not set
# CONFIG_NVMEM_LAYOUT_ONIE_TLV is not set
# CONFIG_NVMEM_RMEM is not set

Can you tell me what I'm missing?

David

2023-06-29 22:10:13

by Sagi Grimberg

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

> Sagi Grimberg <[email protected]> wrote:
>
>> simple way to check is to run:
>> nvme_trtype=tcp ./check nvme
>
> It says a lot of:
>
> nvme/002 (create many subsystems and test discovery) [not run]
> nvme is not available
> nvme_trtype=tcp is not supported in this test
> nvme/003 (test if we're sending keep-alives to a discovery controller) [not run]
> nvme is not available
> nvme/004 (test nvme and nvmet UUID NS descriptors) [not run]
> nvme is not available
> nvme/005 (reset local loopback target) [not run]
> nvme is not available
> ...
>
> I have the following NVMe config:
>
> # NVME Support
> CONFIG_NVME_COMMON=y
> CONFIG_NVME_CORE=y
> CONFIG_BLK_DEV_NVME=y
> CONFIG_NVME_MULTIPATH=y
> # CONFIG_NVME_VERBOSE_ERRORS is not set
> # CONFIG_NVME_HWMON is not set
> CONFIG_NVME_FABRICS=y
> # CONFIG_NVME_RDMA is not set
> # CONFIG_NVME_FC is not set
> CONFIG_NVME_TCP=y
> CONFIG_NVME_AUTH=y
> CONFIG_NVME_TARGET=y
> CONFIG_NVME_TARGET_PASSTHRU=y
> CONFIG_NVME_TARGET_LOOP=y
> # CONFIG_NVME_TARGET_RDMA is not set
> # CONFIG_NVME_TARGET_FC is not set
> CONFIG_NVME_TARGET_TCP=y
> CONFIG_NVME_TARGET_AUTH=y
> # end of NVME Support
> CONFIG_RTC_NVMEM=y
> CONFIG_NVMEM=y
> # CONFIG_NVMEM_SYSFS is not set
> # CONFIG_NVMEM_LAYOUT_SL28_VPD is not set
> # CONFIG_NVMEM_LAYOUT_ONIE_TLV is not set
> # CONFIG_NVMEM_RMEM is not set
>
> Can you tell me what I'm missing?

install nvme-cli.

nvme/010 is enough to reproduce I think:
nvme_trtype=tcp ./check nvme/010

2023-06-29 22:16:01

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

Meh:

if (!sendpage_ok(page))
- msg.msg_flags &= ~MSG_SPLICE_PAGES,
+ msg.msg_flags &= ~MSG_SPLICE_PAGES;

bvec_set_page(&bvec, page, len, offset);

David

2023-06-29 23:57:39

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

On Thu, 29 Jun 2023 22:34:59 +0100 David Howells wrote:
> if (!sendpage_ok(page))
> - msg.msg_flags &= ~MSG_SPLICE_PAGES,
> + msg.msg_flags &= ~MSG_SPLICE_PAGES;

????️

Let me CC llvm@ in case someone's there is willing to make
the compiler warn about this.

2023-06-30 16:31:24

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

On Fri, 30 Jun 2023 09:10:43 -0700 Nathan Chancellor wrote:
> > Let me CC llvm@ in case someone's there is willing to make
> > the compiler warn about this.
>
> Turns out clang already has a warning for this, -Wcomma:
>
> drivers/nvme/host/tcp.c:1017:38: error: possible misuse of comma operator here [-Werror,-Wcomma]
> 1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
> | ^
> drivers/nvme/host/tcp.c:1017:4: note: cast expression to void to silence warning
> 1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> | (void)( )
> 1 error generated.
>
> Let me do some wider build testing to see if it is viable to turn this
> on for the whole kernel because it seems worth it, at least in this
> case. There are a lot of cases where a warning won't be emitted (see the
> original upstream review for a list: https://reviews.llvm.org/D3976) but
> something is better than nothing, right? :)

Ah, neat. Misleading indentation is another possible angle, I reckon,
but not sure if that's enabled/possible to enable for the entire kernel
either :( We test-build with W=1 in networking, FWIW, so W=1 would be
enough for us.

2023-06-30 16:43:58

by Nathan Chancellor

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

On Thu, Jun 29, 2023 at 04:43:18PM -0700, Jakub Kicinski wrote:
> On Thu, 29 Jun 2023 22:34:59 +0100 David Howells wrote:
> > if (!sendpage_ok(page))
> > - msg.msg_flags &= ~MSG_SPLICE_PAGES,
> > + msg.msg_flags &= ~MSG_SPLICE_PAGES;
>
> ????️
>
> Let me CC llvm@ in case someone's there is willing to make
> the compiler warn about this.
>

Turns out clang already has a warning for this, -Wcomma:

drivers/nvme/host/tcp.c:1017:38: error: possible misuse of comma operator here [-Werror,-Wcomma]
1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
| ^
drivers/nvme/host/tcp.c:1017:4: note: cast expression to void to silence warning
1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| (void)( )
1 error generated.

Let me do some wider build testing to see if it is viable to turn this
on for the whole kernel because it seems worth it, at least in this
case. There are a lot of cases where a warning won't be emitted (see the
original upstream review for a list: https://reviews.llvm.org/D3976) but
something is better than nothing, right? :)

Cheers,
Nathan

2023-06-30 19:31:03

by Nathan Chancellor

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

On Fri, Jun 30, 2023 at 09:14:42AM -0700, Jakub Kicinski wrote:
> On Fri, 30 Jun 2023 09:10:43 -0700 Nathan Chancellor wrote:
> > > Let me CC llvm@ in case someone's there is willing to make
> > > the compiler warn about this.
> >
> > Turns out clang already has a warning for this, -Wcomma:
> >
> > drivers/nvme/host/tcp.c:1017:38: error: possible misuse of comma operator here [-Werror,-Wcomma]
> > 1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
> > | ^
> > drivers/nvme/host/tcp.c:1017:4: note: cast expression to void to silence warning
> > 1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
> > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > | (void)( )
> > 1 error generated.
> >
> > Let me do some wider build testing to see if it is viable to turn this
> > on for the whole kernel because it seems worth it, at least in this
> > case. There are a lot of cases where a warning won't be emitted (see the
> > original upstream review for a list: https://reviews.llvm.org/D3976) but
> > something is better than nothing, right? :)

Well, that was a pipe dream :/ In ARCH=arm multi_v7_defconfig alone,
there are 289 unique instances of the warning (although a good number
have multiple instances per line, so it is not quite as bad as it seems,
but still bad):

$ rg -- -Wcomma arm-multi_v7_defconfig.log | sort | uniq -c | wc -l
289

https://gist.github.com/nathanchance/907867e0a7adffc877fd39fd08853801

Probably not a good sign of the signal to noise ratio, I looked through
a good handful and all the cases I saw were not interesting... Perhaps
the warning could be tuned further to become useful for the kernel but
in its current form, it is definitely a no-go :/

> Ah, neat. Misleading indentation is another possible angle, I reckon,
> but not sure if that's enabled/possible to enable for the entire kernel

Yeah, I was surprised there was no warning for misleading indentation...
it is a part of -Wall for both clang and GCC, so it is on for the
kernel, it just appears not to trigger in this case.

> either :( We test-build with W=1 in networking, FWIW, so W=1 would be
> enough for us.

Unfortunately, even in its current form, it is way too noisy for W=1, as
the qualifier for W=1 is "do not occur too often". Probably could be
placed under W=2 but it still has the problem of wading through every
instance and it is basically a no-op because nobody tests with W=2.

Cheers,
Nathan

2023-07-07 21:10:25

by Nick Desaulniers

[permalink] [raw]

Subject: Re: [PATCH net-next v3 10/18] nvme/host: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

On Fri, Jun 30, 2023 at 12:28 PM Nathan Chancellor <[email protected]> wrote:
>
> On Fri, Jun 30, 2023 at 09:14:42AM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Jun 2023 09:10:43 -0700 Nathan Chancellor wrote:
> > > > Let me CC llvm@ in case someone's there is willing to make
> > > > the compiler warn about this.
> > >
> > > Turns out clang already has a warning for this, -Wcomma:
> > >
> > > drivers/nvme/host/tcp.c:1017:38: error: possible misuse of comma operator here [-Werror,-Wcomma]
> > > 1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
> > > | ^
> > > drivers/nvme/host/tcp.c:1017:4: note: cast expression to void to silence warning
> > > 1017 | msg.msg_flags &= ~MSG_SPLICE_PAGES,
> > > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > | (void)( )
> > > 1 error generated.
> > >
> > > Let me do some wider build testing to see if it is viable to turn this
> > > on for the whole kernel because it seems worth it, at least in this
> > > case. There are a lot of cases where a warning won't be emitted (see the
> > > original upstream review for a list: https://reviews.llvm.org/D3976) but
> > > something is better than nothing, right? :)
>
> Well, that was a pipe dream :/ In ARCH=arm multi_v7_defconfig alone,
> there are 289 unique instances of the warning (although a good number
> have multiple instances per line, so it is not quite as bad as it seems,
> but still bad):
>
> $ rg -- -Wcomma arm-multi_v7_defconfig.log | sort | uniq -c | wc -l
> 289
>
> https://gist.github.com/nathanchance/907867e0a7adffc877fd39fd08853801

It's definitely interesting to take a look at some of these cases.
Some are pretty funny IMO.

>
> Probably not a good sign of the signal to noise ratio, I looked through
> a good handful and all the cases I saw were not interesting... Perhaps
> the warning could be tuned further to become useful for the kernel but
> in its current form, it is definitely a no-go :/
>
> > Ah, neat. Misleading indentation is another possible angle, I reckon,
> > but not sure if that's enabled/possible to enable for the entire kernel
>
> Yeah, I was surprised there was no warning for misleading indentation...
> it is a part of -Wall for both clang and GCC, so it is on for the
> kernel, it just appears not to trigger in this case.
>
> > either :( We test-build with W=1 in networking, FWIW, so W=1 would be
> > enough for us.
>
> Unfortunately, even in its current form, it is way too noisy for W=1, as
> the qualifier for W=1 is "do not occur too often". Probably could be
> placed under W=2 but it still has the problem of wading through every
> instance and it is basically a no-op because nobody tests with W=2.
>
> Cheers,
> Nathan
>

--
Thanks,
~Nick Desaulniers