Here's the final set of patches towards the removal of sendpage. All the
drivers that use sendpage() get switched over to using sendmsg() with
MSG_SPLICE_PAGES.
skb_splice_from_iter() is given the facility to copy slab data() into
fragments - or to coalesce them (in future) with other unspliced buffers in
the target skbuff. This means that the caller can work the same way, no
matter whether MSG_SPLICE_PAGES is supplied or not and no matter if the
protocol just ignores it. If MSG_SPLICE_PAGES is not supplied or if it is
just ignored, the data will get copied as normal rather than being spliced.
For the moment, skb_splice_from_iter() is equipped with its own fragment
allocator - one that has percpu pages to allocate from to deal with
parallel callers but that can also drop the percpu lock around calling the
page allocator.
The following changes are made:
(1) Introduce an SMP-safe shared fragment allocator and make
skb_splice_from_iter() use it. The allocator is exported so that
ocfs2 can use it.
This now doesn't alter the existing page_frag_cache allocator.
(2) Expose information from the allocator in /proc/. This is useful for
debugging it, but could be dropped.
(3) Make the protocol drivers behave according to MSG_MORE, not
MSG_SENDPAGE_NOTLAST. The latter is restricted to turning on MSG_MORE
in the sendpage() wrappers.
(4) Make siw, ceph/rds, skb_send_sock, dlm, nvme, smc, ocfs2, drbd and
iscsi use sendmsg(), not sendpage and make them specify MSG_MORE
instead of MSG_SENDPAGE_NOTLAST.
ocfs2 now allocates fragments for a couple of cases where it would
otherwise pass in a pointer to shared data that doesn't seem to
sufficient locking.
(5) Make drbd coalesce its entire message into a single sendmsg().
(6) Kill off sendpage and clean up MSG_SENDPAGE_NOTLAST.
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-3-frag
David
Changes
=======
ver #2)
- Wrapped some lines at 80.
- Fixed parameter to put_cpu_ptr() to have an '&'.
- Use "unsigned int" rather than "unsigned".
- Removed duplicate word in comment.
- Filled in the commit message on the last patch.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1
Link: https://lore.kernel.org/r/[email protected]/ # v1
David Howells (17):
net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)
net: Display info about MSG_SPLICE_PAGES memory handling in proc
tcp_bpf, smc, tls, espintcp: Reduce MSG_SENDPAGE_NOTLAST usage
siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
nvme: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES
ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
drdb: Send an entire bio in a single sendmsg
iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
net: Kill MSG_SENDPAGE_NOTLAST
Documentation/bpf/map_sockmap.rst | 10 +-
Documentation/filesystems/locking.rst | 2 -
Documentation/filesystems/vfs.rst | 1 -
Documentation/networking/scaling.rst | 4 +-
crypto/af_alg.c | 28 --
crypto/algif_aead.c | 22 +-
crypto/algif_rng.c | 2 -
crypto/algif_skcipher.c | 14 -
drivers/block/drbd/drbd_main.c | 88 +++---
drivers/infiniband/sw/siw/siw_qp_tx.c | 231 +++-------------
.../chelsio/inline_crypto/chtls/chtls.h | 2 -
.../chelsio/inline_crypto/chtls/chtls_io.c | 14 -
.../chelsio/inline_crypto/chtls/chtls_main.c | 1 -
drivers/nvme/host/tcp.c | 46 ++--
drivers/nvme/target/tcp.c | 46 ++--
drivers/scsi/iscsi_tcp.c | 26 +-
drivers/scsi/iscsi_tcp.h | 2 +-
drivers/target/iscsi/iscsi_target_util.c | 15 +-
fs/dlm/lowcomms.c | 10 +-
fs/nfsd/vfs.c | 2 +-
fs/ocfs2/cluster/tcp.c | 109 ++++----
include/crypto/if_alg.h | 2 -
include/linux/net.h | 8 -
include/linux/skbuff.h | 5 +
include/linux/socket.h | 4 +-
include/net/inet_common.h | 2 -
include/net/sock.h | 6 -
include/net/tcp.h | 4 -
net/appletalk/ddp.c | 1 -
net/atm/pvc.c | 1 -
net/atm/svc.c | 1 -
net/ax25/af_ax25.c | 1 -
net/caif/caif_socket.c | 2 -
net/can/bcm.c | 1 -
net/can/isotp.c | 1 -
net/can/j1939/socket.c | 1 -
net/can/raw.c | 1 -
net/ceph/messenger_v1.c | 58 ++--
net/ceph/messenger_v2.c | 91 ++-----
net/core/skbuff.c | 257 ++++++++++++++++--
net/core/sock.c | 35 +--
net/dccp/ipv4.c | 1 -
net/dccp/ipv6.c | 1 -
net/ieee802154/socket.c | 2 -
net/ipv4/af_inet.c | 21 --
net/ipv4/tcp.c | 43 +--
net/ipv4/tcp_bpf.c | 30 +-
net/ipv4/tcp_ipv4.c | 1 -
net/ipv4/udp.c | 15 -
net/ipv4/udp_impl.h | 2 -
net/ipv4/udplite.c | 1 -
net/ipv6/af_inet6.c | 3 -
net/ipv6/raw.c | 1 -
net/ipv6/tcp_ipv6.c | 1 -
net/kcm/kcmsock.c | 20 --
net/key/af_key.c | 1 -
net/l2tp/l2tp_ip.c | 1 -
net/l2tp/l2tp_ip6.c | 1 -
net/llc/af_llc.c | 1 -
net/mctp/af_mctp.c | 1 -
net/mptcp/protocol.c | 2 -
net/netlink/af_netlink.c | 1 -
net/netrom/af_netrom.c | 1 -
net/packet/af_packet.c | 2 -
net/phonet/socket.c | 2 -
net/qrtr/af_qrtr.c | 1 -
net/rds/af_rds.c | 1 -
net/rds/tcp_send.c | 74 ++---
net/rose/af_rose.c | 1 -
net/rxrpc/af_rxrpc.c | 1 -
net/sctp/protocol.c | 1 -
net/smc/af_smc.c | 29 --
net/smc/smc_stats.c | 2 +-
net/smc/smc_stats.h | 1 -
net/smc/smc_tx.c | 20 +-
net/smc/smc_tx.h | 2 -
net/socket.c | 48 ----
net/tipc/socket.c | 3 -
net/tls/tls.h | 6 -
net/tls/tls_device.c | 24 +-
net/tls/tls_main.c | 9 +-
net/tls/tls_sw.c | 37 +--
net/unix/af_unix.c | 19 --
net/vmw_vsock/af_vsock.c | 3 -
net/x25/af_x25.c | 1 -
net/xdp/xsk.c | 1 -
net/xfrm/espintcp.c | 10 +-
.../perf/trace/beauty/include/linux/socket.h | 1 -
tools/perf/trace/beauty/msg_flags.c | 3 -
89 files changed, 547 insertions(+), 1063 deletions(-)
As MSG_SENDPAGE_NOTLAST is being phased out along with sendpage(), don't
use it further in than the sendpage methods, but rather translate it to
MSG_MORE and use that instead.
Signed-off-by: David Howells <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Sitnicki <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Karsten Graul <[email protected]>
cc: Wenjia Zhang <[email protected]>
cc: Jan Karcher <[email protected]>
cc: "D. Wythe" <[email protected]>
cc: Tony Lu <[email protected]>
cc: Wen Gu <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: Steffen Klassert <[email protected]>
cc: Herbert Xu <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
net/ipv4/tcp_bpf.c | 3 ---
net/smc/smc_tx.c | 6 ++++--
net/tls/tls_device.c | 4 ++--
net/xfrm/espintcp.c | 10 ++++++----
4 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 5a84053ac62b..adcba77b0c50 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -111,9 +111,6 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
if (has_tx_ulp)
msghdr.msg_flags |= MSG_SENDPAGE_NOPOLICY;
- if (flags & MSG_SENDPAGE_NOTLAST)
- msghdr.msg_flags |= MSG_MORE;
-
bvec_set_page(&bvec, page, size, off);
iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
ret = tcp_sendmsg_locked(sk, &msghdr, size);
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 45128443f1f1..9b9e0a190734 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -168,8 +168,7 @@ static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
* should known how/when to uncork it.
*/
if ((msg->msg_flags & MSG_MORE ||
- smc_tx_is_corked(smc) ||
- msg->msg_flags & MSG_SENDPAGE_NOTLAST) &&
+ smc_tx_is_corked(smc)) &&
atomic_read(&conn->sndbuf_space))
return true;
@@ -306,6 +305,9 @@ int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
struct kvec iov;
int rc;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
+
iov.iov_base = kaddr + offset;
iov.iov_len = size;
iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index b82770f68807..975299d7213b 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -449,7 +449,7 @@ static int tls_push_data(struct sock *sk,
return -sk->sk_err;
flags |= MSG_SENDPAGE_DECRYPTED;
- tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST;
+ tls_push_record_flags = flags | MSG_MORE;
timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
if (tls_is_partially_sent_record(tls_ctx)) {
@@ -532,7 +532,7 @@ static int tls_push_data(struct sock *sk,
if (!size) {
last_record:
tls_push_record_flags = flags;
- if (flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE)) {
+ if (flags & MSG_MORE) {
more = true;
break;
}
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 3504925babdb..d3b3f9e720b3 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -205,13 +205,15 @@ static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
static int espintcp_sendskmsg_locked(struct sock *sk,
struct espintcp_msg *emsg, int flags)
{
- struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
+ struct msghdr msghdr = {
+ .msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE,
+ };
struct sk_msg *skmsg = &emsg->skmsg;
+ bool more = flags & MSG_MORE;
struct scatterlist *sg;
int done = 0;
int ret;
- msghdr.msg_flags |= MSG_SENDPAGE_NOTLAST;
sg = &skmsg->sg.data[skmsg->sg.start];
do {
struct bio_vec bvec;
@@ -221,8 +223,8 @@ static int espintcp_sendskmsg_locked(struct sock *sk,
emsg->offset = 0;
- if (sg_is_last(sg))
- msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
+ if (sg_is_last(sg) && !more)
+ msghdr.msg_flags &= ~MSG_MORE;
p = sg_page(sg);
retry:
Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage. This allows
multiple pages and multipage folios to be passed through.
TODO: iscsit_fe_sendpage_sg() should perhaps set up a bio_vec array for the
entire set of pages it's going to transfer plus two for the header and
trailer and page fragments to hold the header and trailer - and then call
sendmsg once for the entire message.
Signed-off-by: David Howells <[email protected]>
cc: Lee Duncan <[email protected]>
cc: Chris Leech <[email protected]>
cc: Mike Christie <[email protected]>
cc: Maurizio Lombardi <[email protected]>
cc: "James E.J. Bottomley" <[email protected]>
cc: "Martin K. Petersen" <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Al Viro <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
Notes:
ver #2)
- Wrap lines at 80.
drivers/scsi/iscsi_tcp.c | 26 +++++++++---------------
drivers/scsi/iscsi_tcp.h | 2 +-
drivers/target/iscsi/iscsi_target_util.c | 15 ++++++++------
3 files changed, 20 insertions(+), 23 deletions(-)
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 9637d4bc2bc9..9ab8555180a3 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -301,35 +301,32 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,
while (!iscsi_tcp_segment_done(tcp_conn, segment, 0, r)) {
struct scatterlist *sg;
+ struct msghdr msg = {};
+ struct bio_vec bv;
unsigned int offset, copy;
- int flags = 0;
r = 0;
offset = segment->copied;
copy = segment->size - offset;
if (segment->total_copied + segment->size < segment->total_size)
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE;
if (tcp_sw_conn->queue_recv)
- flags |= MSG_DONTWAIT;
+ msg.msg_flags |= MSG_DONTWAIT;
- /* Use sendpage if we can; else fall back to sendmsg */
if (!segment->data) {
+ if (!tcp_conn->iscsi_conn->datadgst_en)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
sg = segment->sg;
offset += segment->sg_offset + sg->offset;
- r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
- copy, flags);
+ bvec_set_page(&bv, sg_page(sg), copy, offset);
} else {
- struct msghdr msg = { .msg_flags = flags };
- struct kvec iov = {
- .iov_base = segment->data + offset,
- .iov_len = copy
- };
-
- r = kernel_sendmsg(sk, &msg, &iov, 1, copy);
+ bvec_set_virt(&bv, segment->data + offset, copy);
}
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, copy);
+ r = sock_sendmsg(sk, &msg);
if (r < 0) {
iscsi_tcp_segment_unmap(segment);
return r;
@@ -746,7 +743,6 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
sock_no_linger(sk);
iscsi_sw_tcp_conn_set_callbacks(conn);
- tcp_sw_conn->sendpage = tcp_sw_conn->sock->ops->sendpage;
/*
* set receive state machine into initial state
*/
@@ -777,8 +773,6 @@ static int iscsi_sw_tcp_conn_set_param(struct iscsi_cls_conn *cls_conn,
return -ENOTCONN;
}
iscsi_set_param(cls_conn, param, buf, buflen);
- tcp_sw_conn->sendpage = conn->datadgst_en ?
- sock_no_sendpage : tcp_sw_conn->sock->ops->sendpage;
mutex_unlock(&tcp_sw_conn->sock_lock);
break;
case ISCSI_PARAM_MAX_R2T:
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 68e14a344904..d6ec08d7eb63 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -48,7 +48,7 @@ struct iscsi_sw_tcp_conn {
uint32_t sendpage_failures_cnt;
uint32_t discontiguous_hdr_cnt;
- ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+ bool can_splice_to_tcp;
};
struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index b14835fcb033..6231fa4ef5c6 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1129,6 +1129,8 @@ int iscsit_fe_sendpage_sg(
struct iscsit_conn *conn)
{
struct scatterlist *sg = cmd->first_data_sg;
+ struct bio_vec bvec;
+ struct msghdr msghdr = { .msg_flags = MSG_SPLICE_PAGES, };
struct kvec iov;
u32 tx_hdr_size, data_len;
u32 offset = cmd->first_data_sg_off;
@@ -1172,17 +1174,18 @@ int iscsit_fe_sendpage_sg(
u32 space = (sg->length - offset);
u32 sub_len = min_t(u32, data_len, space);
send_pg:
- tx_sent = conn->sock->ops->sendpage(conn->sock,
- sg_page(sg), sg->offset + offset, sub_len, 0);
+ bvec_set_page(&bvec, sg_page(sg), sub_len, sg->offset + offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, sub_len);
+
+ tx_sent = conn->sock->ops->sendmsg(conn->sock, &msghdr,
+ sub_len);
if (tx_sent != sub_len) {
if (tx_sent == -EAGAIN) {
- pr_err("tcp_sendpage() returned"
- " -EAGAIN\n");
+ pr_err("sendmsg/splice returned -EAGAIN\n");
goto send_pg;
}
- pr_err("tcp_sendpage() failure: %d\n",
- tx_sent);
+ pr_err("sendmsg/splice failure: %d\n", tx_sent);
return -1;
}
Drop the smc_sendpage() code as smc_sendmsg() just passes the call down to
the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
its sendmsg implementation.
Signed-off-by: David Howells <[email protected]>
cc: Karsten Graul <[email protected]>
cc: Wenjia Zhang <[email protected]>
cc: Jan Karcher <[email protected]>
cc: "D. Wythe" <[email protected]>
cc: Tony Lu <[email protected]>
cc: Wen Gu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/smc/af_smc.c | 29 -----------------------------
net/smc/smc_stats.c | 2 +-
net/smc/smc_stats.h | 1 -
net/smc/smc_tx.c | 22 +---------------------
net/smc/smc_tx.h | 2 --
5 files changed, 2 insertions(+), 54 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 538e9c6ec8c9..a7f887d91d89 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -3133,34 +3133,6 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd,
return put_user(answ, (int __user *)arg);
}
-static ssize_t smc_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- struct smc_sock *smc;
- int rc = -EPIPE;
-
- smc = smc_sk(sk);
- lock_sock(sk);
- if (sk->sk_state != SMC_ACTIVE) {
- release_sock(sk);
- goto out;
- }
- release_sock(sk);
- if (smc->use_fallback) {
- rc = kernel_sendpage(smc->clcsock, page, offset,
- size, flags);
- } else {
- lock_sock(sk);
- rc = smc_tx_sendpage(smc, page, offset, size, flags);
- release_sock(sk);
- SMC_STAT_INC(smc, sendpage_cnt);
- }
-
-out:
- return rc;
-}
-
/* Map the affected portions of the rmbe into an spd, note the number of bytes
* to splice in conn->splice_pending, and press 'go'. Delays consumer cursor
* updates till whenever a respective page has been fully processed.
@@ -3232,7 +3204,6 @@ static const struct proto_ops smc_sock_ops = {
.sendmsg = smc_sendmsg,
.recvmsg = smc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = smc_sendpage,
.splice_read = smc_splice_read,
};
diff --git a/net/smc/smc_stats.c b/net/smc/smc_stats.c
index e80e34f7ac15..ca14c0f3a07d 100644
--- a/net/smc/smc_stats.c
+++ b/net/smc/smc_stats.c
@@ -227,7 +227,7 @@ static int smc_nl_fill_stats_tech_data(struct sk_buff *skb,
SMC_NLA_STATS_PAD))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SENDPAGE_CNT,
- smc_tech->sendpage_cnt,
+ 0,
SMC_NLA_STATS_PAD))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CORK_CNT,
diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h
index 84b7ecd8c05c..b60fe1eb37ab 100644
--- a/net/smc/smc_stats.h
+++ b/net/smc/smc_stats.h
@@ -71,7 +71,6 @@ struct smc_stats_tech {
u64 clnt_v2_succ_cnt;
u64 srv_v1_succ_cnt;
u64 srv_v2_succ_cnt;
- u64 sendpage_cnt;
u64 urg_data_cnt;
u64 splice_cnt;
u64 cork_cnt;
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 9b9e0a190734..5147207808e5 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -167,8 +167,7 @@ static bool smc_tx_should_cork(struct smc_sock *smc, struct msghdr *msg)
* sndbuf_space is still available. The applications
* should known how/when to uncork it.
*/
- if ((msg->msg_flags & MSG_MORE ||
- smc_tx_is_corked(smc)) &&
+ if ((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc)) &&
atomic_read(&conn->sndbuf_space))
return true;
@@ -297,25 +296,6 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
return rc;
}
-int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
- size_t size, int flags)
-{
- struct msghdr msg = {.msg_flags = flags};
- char *kaddr = kmap(page);
- struct kvec iov;
- int rc;
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
- rc = smc_tx_sendmsg(smc, &msg, size);
- kunmap(page);
- return rc;
-}
-
/***************************** sndbuf consumer *******************************/
/* sndbuf consumer: actual data transfer of one target chunk with ISM write */
diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
index 34b578498b1f..a59f370b8b43 100644
--- a/net/smc/smc_tx.h
+++ b/net/smc/smc_tx.h
@@ -31,8 +31,6 @@ void smc_tx_pending(struct smc_connection *conn);
void smc_tx_work(struct work_struct *work);
void smc_tx_init(struct smc_sock *smc);
int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
-int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
- size_t size, int flags);
int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
void smc_tx_sndbuf_nonfull(struct smc_sock *smc);
void smc_tx_consumer_update(struct smc_connection *conn, bool force);
Use sendmsg() conditionally with MSG_SPLICE_PAGES in _drbd_send_page()
rather than calling sendpage() or _drbd_no_send_page().
Signed-off-by: David Howells <[email protected]>
cc: Philipp Reisner <[email protected]>
cc: Lars Ellenberg <[email protected]>
cc: "Christoph Böhmwalder" <[email protected]>
cc: Jens Axboe <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
Notes:
ver #2)
- Wrap lines at 80.
drivers/block/drbd/drbd_main.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 83987e7a5ef2..8a01a18a2550 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1540,7 +1540,8 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
int offset, size_t size, unsigned msg_flags)
{
struct socket *socket = peer_device->connection->data.socket;
- int len = size;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = msg_flags, };
int err = -EIO;
/* e.g. XFS meta- & log-data is in slab pages, which have a
@@ -1549,33 +1550,35 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
- if (drbd_disable_sendpage || !sendpage_ok(page))
- return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
+ if (!drbd_disable_sendpage && sendpage_ok(page))
+ msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;
+
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- msg_flags |= MSG_NOSIGNAL;
drbd_update_congested(peer_device->connection);
do {
int sent;
- sent = socket->ops->sendpage(socket, page, offset, len, msg_flags);
+ sent = sock_sendmsg(socket, &msg);
if (sent <= 0) {
if (sent == -EAGAIN) {
if (we_should_drop_the_connection(peer_device->connection, socket))
break;
continue;
}
- drbd_warn(peer_device->device, "%s: size=%d len=%d sent=%d\n",
- __func__, (int)size, len, sent);
+ drbd_warn(peer_device->device, "%s: size=%d len=%zu sent=%d\n",
+ __func__, (int)size, msg_data_left(&msg),
+ sent);
if (sent < 0)
err = sent;
break;
}
- len -= sent;
- offset += sent;
- } while (len > 0 /* THINK && device->cstate >= C_CONNECTED*/);
+ } while (msg_data_left(&msg)
+ /* THINK && device->cstate >= C_CONNECTED*/);
clear_bit(NET_CONGESTED, &peer_device->connection->flags);
- if (len == 0) {
+ if (!msg_data_left(&msg)) {
err = 0;
peer_device->device->send_cnt += size >> 9;
}
If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
some data that's resident in the slab, copy it rather than returning EIO.
This can be made use of by a number of drivers in the kernel, including:
iwarp, ceph/rds, dlm, nvme, ocfs2, drdb. It could also be used by iscsi,
rxrpc, sunrpc, cifs and probably others.
skb_splice_from_iter() is given it's own fragment allocator as
page_frag_alloc_align() can't be used because it does no locking to prevent
parallel callers from racing. alloc_skb_frag() uses a separate folio for
each cpu and locks to the cpu whilst allocating, reenabling cpu migration
around folio allocation.
This could allocate a whole page instead for each fragment to be copied, as
alloc_skb_with_frags() would do instead, but that would waste a lot of
space (most of the fragments look like they're going to be small).
This allows an entire message that consists of, say, a protocol header or
two, a number of pages of data and a protocol footer to be sent using a
single call to sock_sendmsg().
The callers could be made to copy the data into fragments before calling
sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.
Signed-off-by: David Howells <[email protected]>
cc: Alexander Duyck <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: David Ahern <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Menglong Dong <[email protected]>
cc: [email protected]
---
Notes:
ver #2)
- Fix parameter to put_cpu_ptr() to have an '&'.
include/linux/skbuff.h | 5 ++
net/core/skbuff.c | 171 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 173 insertions(+), 3 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 91ed66952580..0ba776cd9be8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -5037,6 +5037,11 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
#endif
}
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
+ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
+ ssize_t maxsize, gfp_t gfp);
+
ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
ssize_t maxsize, gfp_t gfp);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fee2b1c105fe..d962c93a429d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6755,6 +6755,145 @@ nodefer: __kfree_skb(skb);
smp_call_function_single_async(cpu, &sd->defer_csd);
}
+struct skb_splice_frag_cache {
+ struct folio *folio;
+ void *virt;
+ unsigned int offset;
+ /* we maintain a pagecount bias, so that we dont dirty cache line
+ * containing page->_refcount every time we allocate a fragment.
+ */
+ unsigned int pagecnt_bias;
+ bool pfmemalloc;
+};
+
+static DEFINE_PER_CPU(struct skb_splice_frag_cache, skb_splice_frag_cache);
+
+/**
+ * alloc_skb_frag - Allocate a page fragment for using in a socket
+ * @fragsz: The size of fragment required
+ * @gfp: Allocation flags
+ */
+void *alloc_skb_frag(size_t fragsz, gfp_t gfp)
+{
+ struct skb_splice_frag_cache *cache;
+ struct folio *folio, *spare = NULL;
+ size_t offset, fsize;
+ void *p;
+
+ if (WARN_ON_ONCE(fragsz == 0))
+ fragsz = 1;
+
+ cache = get_cpu_ptr(&skb_splice_frag_cache);
+reload:
+ folio = cache->folio;
+ offset = cache->offset;
+try_again:
+ if (fragsz > offset)
+ goto insufficient_space;
+
+ /* Make the allocation. */
+ cache->pagecnt_bias--;
+ offset = ALIGN_DOWN(offset - fragsz, SMP_CACHE_BYTES);
+ cache->offset = offset;
+ p = cache->virt + offset;
+ put_cpu_ptr(&skb_splice_frag_cache);
+ if (spare)
+ folio_put(spare);
+ return p;
+
+insufficient_space:
+ /* See if we can refurbish the current folio. */
+ if (!folio || !folio_ref_sub_and_test(folio, cache->pagecnt_bias))
+ goto get_new_folio;
+ if (unlikely(cache->pfmemalloc)) {
+ __folio_put(folio);
+ goto get_new_folio;
+ }
+
+ fsize = folio_size(folio);
+ if (unlikely(fragsz > fsize))
+ goto frag_too_big;
+
+ /* OK, page count is 0, we can safely set it */
+ folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+ /* Reset page count bias and offset to start of new frag */
+ cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ offset = fsize;
+ goto try_again;
+
+get_new_folio:
+ if (!spare) {
+ cache->folio = NULL;
+ put_cpu_ptr(&skb_splice_frag_cache);
+
+#if PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE
+ spare = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC,
+ PAGE_FRAG_CACHE_MAX_ORDER);
+ if (!spare)
+#endif
+ spare = folio_alloc(gfp, 0);
+ if (!spare)
+ return NULL;
+
+ cache = get_cpu_ptr(&skb_splice_frag_cache);
+ /* We may now be on a different cpu and/or someone else may
+ * have refilled it
+ */
+ cache->pfmemalloc = folio_is_pfmemalloc(spare);
+ if (cache->folio)
+ goto reload;
+ }
+
+ cache->folio = spare;
+ cache->virt = folio_address(spare);
+ folio = spare;
+ spare = NULL;
+
+ /* Even if we own the page, we do not use atomic_set(). This would
+ * break get_page_unless_zero() users.
+ */
+ folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+ /* Reset page count bias and offset to start of new frag */
+ cache->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ offset = folio_size(folio);
+ goto try_again;
+
+frag_too_big:
+ /* The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+ * but the cache isn't big enough to satisfy the request, this may
+ * happen in low memory conditions. We don't release the cache page
+ * because it could make memory pressure worse so we simply return NULL
+ * here.
+ */
+ cache->offset = offset;
+ put_cpu_ptr(&skb_splice_frag_cache);
+ if (spare)
+ folio_put(spare);
+ return NULL;
+}
+EXPORT_SYMBOL(alloc_skb_frag);
+
+/**
+ * copy_skb_frag - Copy data into a page fragment.
+ * @s: The data to copy
+ * @len: The size of the data
+ * @gfp: Allocation flags
+ */
+void *copy_skb_frag(const void *s, size_t len, gfp_t gfp)
+{
+ void *p;
+
+ p = alloc_skb_frag(len, gfp);
+ if (!p)
+ return NULL;
+
+ return memcpy(p, s, len);
+}
+EXPORT_SYMBOL(copy_skb_frag);
+
static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
size_t offset, size_t len)
{
@@ -6808,17 +6947,43 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
break;
}
+ if (space == 0 &&
+ !skb_can_coalesce(skb, skb_shinfo(skb)->nr_frags,
+ pages[0], off)) {
+ iov_iter_revert(iter, len);
+ break;
+ }
+
i = 0;
do {
struct page *page = pages[i++];
size_t part = min_t(size_t, PAGE_SIZE - off, len);
-
- ret = -EIO;
- if (WARN_ON_ONCE(!sendpage_ok(page)))
+ bool put = false;
+
+ if (PageSlab(page)) {
+ const void *p;
+ void *q;
+
+ p = kmap_local_page(page);
+ q = copy_skb_frag(p + off, part, gfp);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(iter, len);
+ ret = -ENOMEM;
+ goto out;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ } else if (WARN_ON_ONCE(!sendpage_ok(page))) {
+ ret = -EIO;
goto out;
+ }
ret = skb_append_pagefrags(skb, page, off, part,
frag_limit);
+ if (put)
+ put_page(page);
if (ret < 0) {
iov_iter_revert(iter, len);
goto out;
David Howells wrote:
> If sendmsg() is passed MSG_SPLICE_PAGES and is given a buffer that contains
> some data that's resident in the slab, copy it rather than returning EIO.
> This can be made use of by a number of drivers in the kernel, including:
> iwarp, ceph/rds, dlm, nvme, ocfs2, drdb. It could also be used by iscsi,
> rxrpc, sunrpc, cifs and probably others.
>
> skb_splice_from_iter() is given it's own fragment allocator as
> page_frag_alloc_align() can't be used because it does no locking to prevent
> parallel callers from racing. alloc_skb_frag() uses a separate folio for
> each cpu and locks to the cpu whilst allocating, reenabling cpu migration
> around folio allocation.
>
> This could allocate a whole page instead for each fragment to be copied, as
> alloc_skb_with_frags() would do instead, but that would waste a lot of
> space (most of the fragments look like they're going to be small).
>
> This allows an entire message that consists of, say, a protocol header or
> two, a number of pages of data and a protocol footer to be sent using a
> single call to sock_sendmsg().
>
> The callers could be made to copy the data into fragments before calling
> sendmsg(), but that then penalises them if MSG_SPLICE_PAGES gets ignored.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Alexander Duyck <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: David Ahern <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: Menglong Dong <[email protected]>
> cc: [email protected]
> ---
>
> Notes:
> ver #2)
> - Fix parameter to put_cpu_ptr() to have an '&'.
>
> include/linux/skbuff.h | 5 ++
> net/core/skbuff.c | 171 ++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 173 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 91ed66952580..0ba776cd9be8 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -5037,6 +5037,11 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb)
> #endif
> }
>
> +void *alloc_skb_frag(size_t fragsz, gfp_t gfp);
> +void *copy_skb_frag(const void *s, size_t len, gfp_t gfp);
> +ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
> + ssize_t maxsize, gfp_t gfp);
> +
> ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
> ssize_t maxsize, gfp_t gfp);
>
duplicate declaration
(no need to respin just for this, imho)