Hi Willy, Dave, et al.,
[NOTE! This patchset is a work in progress and some modules will not
compile with it.]
I've been looking at how to make pipes handle the splicing in of multipage
folios and also looking to see if I could implement a suggestion from Willy
that pipe_buffers could perhaps hold a list of pages (which could make
splicing simpler - an entire splice segment would go in a single
pipe_buffer).
There are a couple of issues here:
(1) Gifting/stealing a multipage folio is really tricky. I think that if
a multipage folio if gifted, the gift flag should be quietly dropped.
Userspace has no control over what splice() and vmsplice() will see in
the pagecache.
(2) The sendpage op expects to be given a single page and various network
protocols just attach that to a socket buffer.
This patchset aims to deal with the second by removing the ->sendpage()
operation and replacing it with sendmsg() and a new internal flag
MSG_SPLICE_PAGES. As sendmsg() takes an I/O iterator, this also affords
the opportunity to pass a slew of pages in one go, rather than one at a
time.
If MSG_SPLICE_PAGES is set, the current implementation requires that the
iterator be ITER_BVEC-type and that the pages can be retained by calling
get_page() on them. Note that I'm accessing the bvec[] directly, but
should really use iov_iter_extract_pages() which would allow an
ITER_XARRAY-type iterator to be used also.
The patchset consists of the following parts:
(1) Define the MSG_SPLICE_PAGES flag.
(2) Provide a simple allocator that takes pages and splits pieces off them
on request and returns them with a ref on the page. Unlike with slab
memory, the lifetime of the allocated memory is controlled by the page
refcount. This allows protocol bits to be included in the same bvec[]
as the data.
(3) Implement MSG_SPLICE_PAGES support in TCP.
(4) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
various callers.
(5) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
a wrapper around sendmsg().
(6) Implement MSG_SPLICE_PAGES support in AF_UNIX.
(7) Implement MSG_SPLICE_PAGES support in AF_ALG and make
af_alg_sendpage() just a wrapper around sendmsg().
(8) Rename pipe_to_sendpage() to pipe_to_sendmsg() and make it a wrapper
around sendmsg().
(9) Remove sendpage file operation.
(10) Convert siw, ceph, iscsi and tcp_bpf to use sendmsg() instead of
tcp_sendpage().
(11) Make skb_send_sock() use sendmsg().
(12) Remove AF_ALG's hash_sendpage() as hash_sendmsg() seems to do paste
the page pointers in anyway.
(13) Convert ceph, rds, dlm and sunrpc to use sendmsg().
(14) Remove the sendpage socket operation.
This leaves the implementation of MSG_SPLICE_PAGES in AF_TLS, AF_KCM,
AF_SMC and Chelsio-TLS which I'm going to need help with, and cleaning up
the use of kernel_sendpage in AF_KCM, AF_SMC and NVMe over TCP still to be
done.
I'm wondering about how best to proceed further:
- Rather than providing a special allocator, should protocols implementing
MSG_SPLICE_PAGES recognise pages that belong to the slab allocator and
copy the content of those to the skbuff and only directly attach the
source page if it's not a slab page?
- Should MSG_SPLICE_PAGES work with ITER_XARRAY as well as ITER_BVEC?
- Should MSG_SPLICE_PAGES just be a hint and get ignored if the conditions
for using it are not met rather than giving an error?
- Should pages attached to a pipe be pinned (ie. FOLL_PIN) rather than
simply ref'd (ie. FOLL_GET) so that the DIO issue doesn't occur on
spliced pages?
- Similarly, should pages undergoing zerocopy be pinned when attached to
an skbuff rather than being simply ref'd? I have a patch to note in the
bottom two bits of the frag page pointer if they are pinned, ref'd or
neither.
I have tested AF_UNIX splicing - which, surprisingly, seems nearly twice as
fast - TCP splicing, the siw driver (softIWarp RDMA with nfs and cifs),
sunrpc (with nfsd) and UDP (using a patched rxrpc).
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-sendpage
David
David Howells (28):
net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
tcp: Support MSG_SPLICE_PAGES
tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
tcp_sendmsg
espintcp: Inline do_tcp_sendpages()
tls: Inline do_tcp_sendpages()
siw: Inline do_tcp_sendpages()
tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
ip, udp: Support MSG_SPLICE_PAGES
udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
af_unix: Support MSG_SPLICE_PAGES
crypto: af_alg: Indent the loop in af_alg_sendmsg()
crypto: af_alg: Support MSG_SPLICE_PAGES
crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES
splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
Remove file->f_op->sendpage
siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
tcp_bpf: Make tcp_bpf_sendpage() go through
tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
algif: Remove hash_sendpage*()
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
Documentation/networking/scaling.rst | 4 +-
crypto/Kconfig | 1 +
crypto/af_alg.c | 137 +++++--------
crypto/algif_aead.c | 40 ++--
crypto/algif_hash.c | 66 ------
crypto/algif_rng.c | 2 -
crypto/algif_skcipher.c | 22 +-
drivers/infiniband/sw/siw/siw_qp_tx.c | 224 +++++----------------
drivers/target/iscsi/iscsi_target_util.c | 14 +-
fs/dlm/lowcomms.c | 10 +-
fs/splice.c | 42 ++--
include/linux/fs.h | 3 -
include/linux/net.h | 8 -
include/linux/socket.h | 1 +
include/linux/splice.h | 2 +
include/linux/zcopy_alloc.h | 16 ++
include/net/inet_common.h | 2 -
include/net/sock.h | 6 -
include/net/tcp.h | 2 -
include/net/tls.h | 2 +-
mm/Makefile | 2 +-
mm/zcopy_alloc.c | 129 ++++++++++++
net/appletalk/ddp.c | 1 -
net/atm/pvc.c | 1 -
net/atm/svc.c | 1 -
net/ax25/af_ax25.c | 1 -
net/caif/caif_socket.c | 2 -
net/can/bcm.c | 1 -
net/can/isotp.c | 1 -
net/can/j1939/socket.c | 1 -
net/can/raw.c | 1 -
net/ceph/messenger_v1.c | 58 ++----
net/ceph/messenger_v2.c | 89 ++-------
net/core/skbuff.c | 49 +++--
net/core/sock.c | 35 +---
net/dccp/ipv4.c | 1 -
net/dccp/ipv6.c | 1 -
net/ieee802154/socket.c | 2 -
net/ipv4/af_inet.c | 21 --
net/ipv4/ip_output.c | 89 ++++++++-
net/ipv4/tcp.c | 244 +++++------------------
net/ipv4/tcp_bpf.c | 72 ++-----
net/ipv4/tcp_ipv4.c | 1 -
net/ipv4/udp.c | 54 -----
net/ipv4/udp_impl.h | 2 -
net/ipv4/udplite.c | 1 -
net/ipv6/af_inet6.c | 3 -
net/ipv6/raw.c | 1 -
net/ipv6/tcp_ipv6.c | 1 -
net/key/af_key.c | 1 -
net/l2tp/l2tp_ip.c | 1 -
net/l2tp/l2tp_ip6.c | 1 -
net/llc/af_llc.c | 1 -
net/mctp/af_mctp.c | 1 -
net/mptcp/protocol.c | 2 -
net/netlink/af_netlink.c | 1 -
net/netrom/af_netrom.c | 1 -
net/packet/af_packet.c | 2 -
net/phonet/socket.c | 2 -
net/qrtr/af_qrtr.c | 1 -
net/rds/af_rds.c | 1 -
net/rds/tcp_send.c | 80 ++++----
net/rose/af_rose.c | 1 -
net/rxrpc/af_rxrpc.c | 1 -
net/sctp/protocol.c | 1 -
net/socket.c | 74 +------
net/sunrpc/svcsock.c | 70 ++-----
net/sunrpc/xdr.c | 24 ++-
net/tipc/socket.c | 3 -
net/tls/tls_main.c | 24 ++-
net/unix/af_unix.c | 223 +++++++--------------
net/vmw_vsock/af_vsock.c | 3 -
net/x25/af_x25.c | 1 -
net/xdp/xsk.c | 1 -
net/xfrm/espintcp.c | 10 +-
75 files changed, 687 insertions(+), 1313 deletions(-)
create mode 100644 include/linux/zcopy_alloc.h
create mode 100644 mm/zcopy_alloc.c
Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can.
This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/socket.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..a67d02da3c54 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -327,6 +327,7 @@ struct ucred {
*/
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
+#define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file
descriptor received through
If a network protocol sendmsg() sees MSG_SPLICE_DATA, it expects that the
iterator is of ITER_BVEC type and that all the pages can have refs taken on
them with get_page() and discarded with put_page(). Bits of network
filesystem protocol data, however, are typically contained in slab memory
for which the cleanup method is kfree(), not put_page(), so this doesn't
work.
Provide a simple allocator, zcopy_alloc(), that allocates a page at a time
per-cpu and sequentially breaks off pieces and hands them out with a ref as
it's asked for them. The caller disposes of the memory it was given by
calling put_page(). When a page is all parcelled out, it is abandoned by
the allocator and another page is obtained. The page will get cleaned up
when the last skbuff fragment is destroyed.
A helper function, zcopy_memdup() is provided to call zcopy_alloc() and
copy the data it is given into it.
[!] I'm not sure this is the best way to do things. A better way might be
to make the network protocol look at the page and copy it if it's a
slab object rather than taking a ref on it.
Signed-off-by: David Howells <[email protected]>
cc: Bernard Metzler <[email protected]>
cc: Tom Talpey <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
include/linux/zcopy_alloc.h | 16 +++++
mm/Makefile | 2 +-
mm/zcopy_alloc.c | 129 ++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+), 1 deletion(-)
create mode 100644 include/linux/zcopy_alloc.h
create mode 100644 mm/zcopy_alloc.c
diff --git a/include/linux/zcopy_alloc.h b/include/linux/zcopy_alloc.h
new file mode 100644
index 000000000000..8eb205678073
--- /dev/null
+++ b/include/linux/zcopy_alloc.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Defs for for zerocopy filler fragment allocator.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#ifndef _LINUX_ZCOPY_ALLOC_H
+#define _LINUX_ZCOPY_ALLOC_H
+
+struct bio_vec;
+
+int zcopy_alloc(size_t size, struct bio_vec *bvec, gfp_t gfp);
+int zcopy_memdup(size_t size, const void *p, struct bio_vec *bvec, gfp_t gfp);
+
+#endif /* _LINUX_ZCOPY_ALLOC_H */
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..3848f43751ee 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
- compaction.o \
+ compaction.o zcopy_alloc.o \
interval_tree.o list_lru.o workingset.o \
debug.o gup.o mmap_lock.o $(mmu-y)
diff --git a/mm/zcopy_alloc.c b/mm/zcopy_alloc.c
new file mode 100644
index 000000000000..7b219392e829
--- /dev/null
+++ b/mm/zcopy_alloc.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Allocator for zerocopy filler fragments
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ *
+ * Provide a facility whereby pieces of bufferage can be allocated for
+ * insertion into bio_vec arrays intended for zerocopying, allowing protocol
+ * stuff to be mixed in with data.
+ *
+ * Unlike objects allocated from the slab, the lifetime of these pieces of
+ * buffer are governed purely by the refcount of the page in which they reside.
+ */
+
+#include <linux/export.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/zcopy_alloc.h>
+#include <linux/bvec.h>
+
+struct zcopy_alloc_info {
+ struct folio *folio; /* Page currently being allocated from */
+ struct folio *spare; /* Spare page */
+ unsigned int used; /* Amount of folio used */
+ spinlock_t lock; /* Allocation lock (needs bh-disable) */
+};
+
+static struct zcopy_alloc_info __percpu *zcopy_alloc_info;
+
+static int __init zcopy_alloc_init(void)
+{
+ zcopy_alloc_info = alloc_percpu(struct zcopy_alloc_info);
+ if (!zcopy_alloc_info)
+ panic("Unable to set up zcopy_alloc allocator\n");
+ return 0;
+}
+subsys_initcall(zcopy_alloc_init);
+
+/**
+ * zcopy_alloc - Allocate some memory for use in zerocopy
+ * @size: The amount of memory (maximum 1/2 page).
+ * @bvec: Where to store the details of the memory
+ * @gfp: Allocation flags under which to make an allocation
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data. Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and pieces are broken off
+ * sequentially and given to the allocators with a ref until it no longer has
+ * enough spare space, at which point the allocator's ref is dropped and a new
+ * folio is allocated. The folio remains in existence until the last ref held
+ * by, say, a sk_buff is discarded and then the page is returned to the
+ * allocator.
+ *
+ * Returns 0 on success and -ENOMEM on allocation failure. If successful, the
+ * details of the allocated memory are placed in *%bvec.
+ *
+ * The allocated memory should be disposed of with folio_put().
+ */
+int zcopy_alloc(size_t size, struct bio_vec *bvec, gfp_t gfp)
+{
+ struct zcopy_alloc_info *info;
+ struct folio *folio, *spare = NULL;
+ size_t full_size = round_up(size, 8);
+
+ if (WARN_ON_ONCE(full_size > PAGE_SIZE / 2))
+ return -ENOMEM; /* Allocate pages */
+
+try_again:
+ info = get_cpu_ptr(zcopy_alloc_info);
+
+ folio = info->folio;
+ if (folio && folio_size(folio) - info->used < full_size) {
+ folio_put(folio);
+ folio = info->folio = NULL;
+ }
+ if (spare && !info->spare) {
+ info->spare = spare;
+ spare = NULL;
+ }
+ if (!folio && info->spare) {
+ folio = info->folio = info->spare;
+ info->spare = NULL;
+ info->used = 0;
+ }
+ if (folio) {
+ bvec_set_folio(bvec, folio, size, info->used);
+ info->used += full_size;
+ if (info->used < folio_size(folio))
+ folio_get(folio);
+ else
+ info->folio = NULL;
+ }
+
+ put_cpu_ptr(zcopy_alloc_info);
+ if (folio) {
+ if (spare)
+ folio_put(spare);
+ return 0;
+ }
+
+ spare = folio_alloc(gfp, 0);
+ if (!spare)
+ return -ENOMEM;
+ goto try_again;
+}
+EXPORT_SYMBOL(zcopy_alloc);
+
+/**
+ * zcopy_memdup - Allocate some memory for use in zerocopy and fill it
+ * @size: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @bvec: Where to store the details of the memory
+ * @gfp: Allocation flags under which to make an allocation
+ */
+int zcopy_memdup(size_t size, const void *p, struct bio_vec *bvec, gfp_t gfp)
+{
+ void *q;
+
+ if (zcopy_alloc(size, bvec, gfp) < 0)
+ return -ENOMEM;
+
+ q = kmap_local_folio(page_folio(bvec->bv_page), bvec->bv_offset);
+ memcpy(q, p, size);
+ kunmap_local(q);
+ return 0;
+}
+EXPORT_SYMBOL(zcopy_memdup);
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it. This is part of replacing ->sendpage() with a call to
sendmsg() with MSG_SPLICE_PAGES set.
Signed-off-by: David Howells <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Sitnicki <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ipv4/tcp_bpf.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cf26d65ca389..7f17134637eb 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -72,11 +72,13 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
{
bool apply = apply_bytes;
struct scatterlist *sge;
+ struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
struct page *page;
int size, ret = 0;
u32 off;
while (1) {
+ struct bio_vec bvec;
bool has_tx_ulp;
sge = sk_msg_elem(msg, msg->sg.start);
@@ -88,16 +90,18 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
tcp_rate_check_app_limited(sk);
retry:
has_tx_ulp = tls_sw_has_ctx_tx(sk);
- if (has_tx_ulp) {
- flags |= MSG_SENDPAGE_NOPOLICY;
- ret = kernel_sendpage_locked(sk,
- page, off, size, flags);
- } else {
- ret = do_tcp_sendpages(sk, page, off, size, flags);
- }
+ if (has_tx_ulp)
+ msghdr.msg_flags |= MSG_SENDPAGE_NOPOLICY;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msghdr.msg_flags |= MSG_MORE;
+
+ bvec_set_page(&bvec, page, size, off);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ ret = tcp_sendmsg_locked(sk, &msghdr, size);
if (ret <= 0)
return ret;
+
if (apply)
apply_bytes -= ret;
msg->sg.size -= ret;
@@ -398,7 +402,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
long timeo;
int flags;
- /* Don't let internal do_tcp_sendpages() flags through */
+ /* Don't let internal sendpage flags through */
flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
flags |= MSG_NO_SHARED_FRAGS;
Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 53 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..77c0c69208a5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
int flags, err, copied = 0;
int mss_now = 0, size_goal, copied_syn = 0;
int process_backlog = 0;
- bool zc = false;
+ int zc = 0;
long timeo;
flags = msg->msg_flags;
@@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (msg->msg_ubuf) {
uarg = msg->msg_ubuf;
net_zcopy_get(uarg);
- zc = sk->sk_route_caps & NETIF_F_SG;
+ if (sk->sk_route_caps & NETIF_F_SG)
+ zc = 1;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
err = -ENOBUFS;
goto out_err;
}
- zc = sk->sk_route_caps & NETIF_F_SG;
- if (!zc)
+ if (sk->sk_route_caps & NETIF_F_SG)
+ zc = 1;
+ else
uarg_to_msgzc(uarg)->zerocopy = 0;
}
+ } else if (unlikely(flags & MSG_SPLICE_PAGES) && size) {
+ if (!iov_iter_is_bvec(&msg->msg_iter))
+ return -EINVAL;
+ if (sk->sk_route_caps & NETIF_F_SG)
+ zc = 2;
}
if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1345,7 +1352,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (copy > msg_data_left(msg))
copy = msg_data_left(msg);
- if (!zc) {
+ if (zc == 0) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
@@ -1390,7 +1397,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
page_ref_inc(pfrag->page);
}
pfrag->offset += copy;
- } else {
+ } else if (zc == 1) {
/* First append to a fragless skb builds initial
* pure zerocopy skb
*/
@@ -1411,6 +1418,46 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (err < 0)
goto do_error;
copy = err;
+ } else if (zc == 2) {
+ /* Splice in data. */
+ const struct bio_vec *bv = msg->msg_iter.bvec;
+ size_t seg = iov_iter_single_seg_count(&msg->msg_iter);
+ size_t off = bv->bv_offset + msg->msg_iter.iov_offset;
+ bool can_coalesce;
+ int i = skb_shinfo(skb)->nr_frags;
+
+ if (copy > seg)
+ copy = seg;
+
+ can_coalesce = skb_can_coalesce(skb, i, bv->bv_page, off);
+ if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
+ tcp_mark_push(tp, skb);
+ goto new_segment;
+ }
+ if (tcp_downgrade_zcopy_pure(sk, skb))
+ goto wait_for_space;
+
+ copy = tcp_wmem_schedule(sk, copy);
+ if (!copy)
+ goto wait_for_space;
+
+ if (can_coalesce) {
+ skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+ } else {
+ get_page(bv->bv_page);
+ skb_fill_page_desc_noacc(skb, i, bv->bv_page, off, copy);
+ }
+ iov_iter_advance(&msg->msg_iter, copy);
+
+ if (!(flags & MSG_NO_SHARED_FRAGS))
+ skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+
+ skb->len += copy;
+ skb->data_len += copy;
+ skb->truesize += copy;
+ sk_wmem_queued_add(sk, copy);
+ sk_mem_charge(sk, copy);
+
}
if (!copied)
Convert do_tcp_sendpages() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself. do_tcp_sendpages() can then be
inlined in subsequent patches into its callers.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/tcp.c | 160 +++----------------------------------------------
1 file changed, 9 insertions(+), 151 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 77c0c69208a5..7c3acc5673e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,163 +971,21 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}
-static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
- struct page *page, int offset, size_t *size)
-{
- struct sk_buff *skb = tcp_write_queue_tail(sk);
- struct tcp_sock *tp = tcp_sk(sk);
- bool can_coalesce;
- int copy, i;
-
- if (!skb || (copy = size_goal - skb->len) <= 0 ||
- !tcp_skb_can_collapse_to(skb)) {
-new_segment:
- if (!sk_stream_memory_free(sk))
- return NULL;
-
- skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
- tcp_rtx_and_write_queues_empty(sk));
- if (!skb)
- return NULL;
-
-#ifdef CONFIG_TLS_DEVICE
- skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
-#endif
- tcp_skb_entail(sk, skb);
- copy = size_goal;
- }
-
- if (copy > *size)
- copy = *size;
-
- i = skb_shinfo(skb)->nr_frags;
- can_coalesce = skb_can_coalesce(skb, i, page, offset);
- if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
- tcp_mark_push(tp, skb);
- goto new_segment;
- }
- if (tcp_downgrade_zcopy_pure(sk, skb))
- return NULL;
-
- copy = tcp_wmem_schedule(sk, copy);
- if (!copy)
- return NULL;
-
- if (can_coalesce) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
- } else {
- get_page(page);
- skb_fill_page_desc_noacc(skb, i, page, offset, copy);
- }
-
- if (!(flags & MSG_NO_SHARED_FRAGS))
- skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
- skb->len += copy;
- skb->data_len += copy;
- skb->truesize += copy;
- sk_wmem_queued_add(sk, copy);
- sk_mem_charge(sk, copy);
- WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
- TCP_SKB_CB(skb)->end_seq += copy;
- tcp_skb_pcount_set(skb, 0);
-
- *size = copy;
- return skb;
-}
-
ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct tcp_sock *tp = tcp_sk(sk);
- int mss_now, size_goal;
- int err;
- ssize_t copied;
- long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
-
- if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- WARN_ONCE(!sendpage_ok(page),
- "page must not be a Slab one and have page_count > 0"))
- return -EINVAL;
-
- /* Wait for a connection to finish. One exception is TCP Fast Open
- * (passive side) where data is allowed to be sent before a connection
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
- !tcp_passive_fastopen(sk)) {
- err = sk_stream_wait_connect(sk, &timeo);
- if (err != 0)
- goto out_err;
- }
-
- sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
-
- mss_now = tcp_send_mss(sk, &size_goal, flags);
- copied = 0;
-
- err = -EPIPE;
- if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- goto out_err;
-
- while (size > 0) {
- struct sk_buff *skb;
- size_t copy = size;
-
- skb = tcp_build_frag(sk, size_goal, flags, page, offset, ©);
- if (!skb)
- goto wait_for_space;
-
- if (!copied)
- TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
- copied += copy;
- offset += copy;
- size -= copy;
- if (!size)
- goto out;
-
- if (skb->len < size_goal || (flags & MSG_OOB))
- continue;
-
- if (forced_push(tp)) {
- tcp_mark_push(tp, skb);
- __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
- } else if (skb == tcp_send_head(sk))
- tcp_push_one(sk, mss_now);
- continue;
-
-wait_for_space:
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
- tcp_push(sk, flags & ~MSG_MORE, mss_now,
- TCP_NAGLE_PUSH, size_goal);
-
- err = sk_stream_wait_memory(sk, &timeo);
- if (err != 0)
- goto do_error;
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES,
+ };
- mss_now = tcp_send_mss(sk, &size_goal, flags);
- }
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-out:
- if (copied) {
- tcp_tx_timestamp(sk, sk->sk_tsflags);
- if (!(flags & MSG_SENDPAGE_NOTLAST))
- tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
- }
- return copied;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
-do_error:
- tcp_remove_empty_skb(sk);
- if (copied)
- goto out;
-out_err:
- /* make sure we wake any epoll edge trigger waiter */
- if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
- sk->sk_write_space(sk);
- tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
- }
- return sk_stream_error(sk, flags, err);
+ return tcp_sendmsg_locked(sk, &msg, size);
}
EXPORT_SYMBOL_GPL(do_tcp_sendpages);
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed. This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.
Signed-off-by: David Howells <[email protected]>
cc: Steffen Klassert <[email protected]>
cc: Herbert Xu <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/xfrm/espintcp.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 872b80188e83..3504925babdb 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -205,14 +205,16 @@ static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
static int espintcp_sendskmsg_locked(struct sock *sk,
struct espintcp_msg *emsg, int flags)
{
+ struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
struct sk_msg *skmsg = &emsg->skmsg;
struct scatterlist *sg;
int done = 0;
int ret;
- flags |= MSG_SENDPAGE_NOTLAST;
+ msghdr.msg_flags |= MSG_SENDPAGE_NOTLAST;
sg = &skmsg->sg.data[skmsg->sg.start];
do {
+ struct bio_vec bvec;
size_t size = sg->length - emsg->offset;
int offset = sg->offset + emsg->offset;
struct page *p;
@@ -220,11 +222,13 @@ static int espintcp_sendskmsg_locked(struct sock *sk,
emsg->offset = 0;
if (sg_is_last(sg))
- flags &= ~MSG_SENDPAGE_NOTLAST;
+ msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
p = sg_page(sg);
retry:
- ret = do_tcp_sendpages(sk, p, offset, size, flags);
+ bvec_set_page(&bvec, p, size, offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ ret = tcp_sendmsg_locked(sk, &msghdr, size);
if (ret < 0) {
emsg->offset = offset - sg->offset;
skmsg->sg.start += done;
Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the
source iterator if given and if ITER_BVEC and copying the data in
otherwise.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/unix/af_unix.c | 84 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 68 insertions(+), 16 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 347122c3575e..6f3454db9c53 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2151,6 +2151,44 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
}
#endif
+/*
+ * Extract pages from a BVEC-type iterator and add them to the socket buffer.
+ */
+static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
+ struct iov_iter *iter, ssize_t maxsize)
+{
+ const struct bio_vec *bv = iter->bvec;
+ unsigned long start = iter->iov_offset;
+ unsigned int i;
+ ssize_t ret = 0;
+
+ for (i = 0; i < iter->nr_segs; i++) {
+ size_t off, len;
+
+ len = bv[i].bv_len;
+ if (start >= len) {
+ start -= len;
+ continue;
+ }
+
+ len = min_t(size_t, maxsize, len - start);
+ off = bv[i].bv_offset + start;
+
+ if (skb_append_pagefrags(skb, bv->bv_page, off, len) < 0)
+ break;
+
+ ret += len;
+ maxsize -= len;
+ if (maxsize <= 0)
+ break;
+ start = 0;
+ }
+
+ if (ret > 0)
+ iov_iter_advance(iter, ret);
+ return ret;
+}
+
static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
size_t len)
{
@@ -2194,19 +2232,25 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
while (sent < len) {
size = len - sent;
- /* Keep two messages in the pipe so it schedules better */
- size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+ if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+ skb = sock_alloc_send_pskb(sk, 0, 0,
+ msg->msg_flags & MSG_DONTWAIT,
+ &err, 0);
+ } else {
+ /* Keep two messages in the pipe so it schedules better */
+ size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
- /* allow fallback to order-0 allocations */
- size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
+ /* allow fallback to order-0 allocations */
+ size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
- data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
+ data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
- data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
+ data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
- skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
- msg->msg_flags & MSG_DONTWAIT, &err,
- get_order(UNIX_SKB_FRAGS_SZ));
+ skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
+ msg->msg_flags & MSG_DONTWAIT, &err,
+ get_order(UNIX_SKB_FRAGS_SZ));
+ }
if (!skb)
goto out_err;
@@ -2218,13 +2262,21 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
}
fds_sent = true;
- skb_put(skb, size - data_len);
- skb->data_len = data_len;
- skb->len = size;
- err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
- if (err) {
- kfree_skb(skb);
- goto out_err;
+ if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+ size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size);
+ skb->data_len += size;
+ skb->len += size;
+ skb->truesize += size;
+ refcount_add(size, &sk->sk_wmem_alloc);
+ } else {
+ skb_put(skb, size - data_len);
+ skb->data_len = data_len;
+ skb->len = size;
+ err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+ if (err) {
+ kfree_skb(skb);
+ goto out_err;
+ }
}
unix_state_lock(other);
Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/ip_output.c | 89 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 86 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4e4e308c3230..721d7e4343ed 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -977,7 +977,7 @@ static int __ip_append_data(struct sock *sk,
int err;
int offset = 0;
bool zc = false;
- unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
+ unsigned int maxfraglen, fragheaderlen, maxnonfragsize, xlength;
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
@@ -1017,6 +1017,7 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
+ xlength = length;
if ((flags & MSG_ZEROCOPY) && length) {
struct msghdr *msg = from;
@@ -1047,6 +1048,16 @@ static int __ip_append_data(struct sock *sk,
skb_zcopy_set(skb, uarg, &extra_uref);
}
}
+ } else if ((flags & MSG_SPLICE_PAGES) && length) {
+ struct msghdr *msg = from;
+
+ if (!iov_iter_is_bvec(&msg->msg_iter))
+ return -EINVAL;
+ if (inet->hdrincl)
+ return -EPERM;
+ if (!(rt->dst.dev->features & NETIF_F_SG))
+ return -EOPNOTSUPP;
+ xlength = transhdrlen; /* We need an empty buffer to attach stuff to */
}
cork->length += length;
@@ -1074,6 +1085,50 @@ static int __ip_append_data(struct sock *sk,
unsigned int alloclen, alloc_extra;
unsigned int pagedlen;
struct sk_buff *skb_prev;
+
+ if (unlikely(flags & MSG_SPLICE_PAGES)) {
+ skb_prev = skb;
+ fraggap = skb_prev->len - maxfraglen;
+
+ alloclen = fragheaderlen + hh_len + fraggap + 15;
+ skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
+ if (unlikely(!skb)) {
+ err = -ENOBUFS;
+ goto error;
+ }
+
+ /*
+ * Fill in the control structures
+ */
+ skb->ip_summed = CHECKSUM_NONE;
+ skb->csum = 0;
+ skb_reserve(skb, hh_len);
+
+ /*
+ * Find where to start putting bytes.
+ */
+ skb_put(skb, fragheaderlen + fraggap);
+ skb_reset_network_header(skb);
+ skb->transport_header = (skb->network_header +
+ fragheaderlen);
+ if (fraggap) {
+ skb->csum = skb_copy_and_csum_bits(
+ skb_prev, maxfraglen,
+ skb_transport_header(skb),
+ fraggap);
+ skb_prev->csum = csum_sub(skb_prev->csum,
+ skb->csum);
+ pskb_trim_unique(skb_prev, maxfraglen);
+ }
+
+ /*
+ * Put the packet on the pending queue.
+ */
+ __skb_queue_tail(&sk->sk_write_queue, skb);
+ continue;
+ }
+ xlength = length;
+
alloc_new_skb:
skb_prev = skb;
if (skb_prev)
@@ -1085,7 +1140,7 @@ static int __ip_append_data(struct sock *sk,
* If remaining data exceeds the mtu,
* we know we need more fragment(s).
*/
- datalen = length + fraggap;
+ datalen = xlength + fraggap;
if (datalen > mtu - fragheaderlen)
datalen = maxfraglen - fragheaderlen;
fraglen = datalen + fragheaderlen;
@@ -1099,7 +1154,7 @@ static int __ip_append_data(struct sock *sk,
* because we have no idea what fragment will be
* the last.
*/
- if (datalen == length + fraggap)
+ if (datalen == xlength + fraggap)
alloc_extra += rt->dst.trailer_len;
if ((flags & MSG_MORE) &&
@@ -1206,6 +1261,34 @@ static int __ip_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
+ } else if (flags & MSG_SPLICE_PAGES) {
+ struct msghdr *msg = from;
+ struct iov_iter *iter = &msg->msg_iter;
+ const struct bio_vec *bv = iter->bvec;
+
+ if (iov_iter_count(iter) <= 0) {
+ err = -EIO;
+ goto error;
+ }
+
+ copy = iov_iter_single_seg_count(&msg->msg_iter);
+
+ err = skb_append_pagefrags(skb, bv->bv_page,
+ bv->bv_offset + iter->iov_offset,
+ copy);
+ if (err < 0)
+ goto error;
+
+ if (skb->ip_summed == CHECKSUM_NONE) {
+ __wsum csum;
+ csum = csum_page(bv->bv_page,
+ bv->bv_offset + iter->iov_offset, copy);
+ skb->csum = csum_block_add(skb->csum, csum, skb->len);
+ }
+
+ iov_iter_advance(iter, copy);
+ skb_len_add(skb, copy);
+ refcount_add(copy, &sk->sk_wmem_alloc);
} else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;
Convert udp_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/udp.c | 50 +++++++++-----------------------------------------
1 file changed, 9 insertions(+), 41 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..097feb92e215 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1332,52 +1332,20 @@ EXPORT_SYMBOL(udp_sendmsg);
int udp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct inet_sock *inet = inet_sk(sk);
- struct udp_sock *up = udp_sk(sk);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE
+ };
int ret;
- if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- if (!up->pending) {
- struct msghdr msg = { .msg_flags = flags|MSG_MORE };
-
- /* Call udp_sendmsg to specify destination address which
- * sendpage interface can't pass.
- * This will succeed only when the socket is connected.
- */
- ret = udp_sendmsg(sk, &msg, 0);
- if (ret < 0)
- return ret;
- }
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
lock_sock(sk);
-
- if (unlikely(!up->pending)) {
- release_sock(sk);
-
- net_dbg_ratelimited("cork failed\n");
- return -EINVAL;
- }
-
- ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
- page, offset, size, flags);
- if (ret == -EOPNOTSUPP) {
- release_sock(sk);
- return sock_no_sendpage(sk->sk_socket, page, offset,
- size, flags);
- }
- if (ret < 0) {
- udp_flush_pending_frames(sk);
- goto out;
- }
-
- up->len += size;
- if (!(READ_ONCE(up->corkflag) || (flags&MSG_MORE)))
- ret = udp_push_pending_frames(sk);
- if (!ret)
- ret = size;
-out:
+ ret = udp_sendmsg(sk, &msg, size);
release_sock(sk);
return ret;
}
Fold do_tcp_sendpages() into its last remaining caller,
tcp_sendpage_locked().
Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/tcp.h | 2 --
net/ipv4/tcp.c | 21 +++++++--------------
2 files changed, 7 insertions(+), 16 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..844bc8e6a714 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -333,8 +333,6 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
int flags);
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
size_t size, int flags);
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle,
int size_goal);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7c3acc5673e9..f1454e4497df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,14 +971,19 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
+ size_t size, int flags)
{
struct bio_vec bvec;
struct msghdr msg = {
.msg_flags = flags | MSG_SPLICE_PAGES,
};
+ if (!(sk->sk_route_caps & NETIF_F_SG))
+ return sock_no_sendpage_locked(sk, page, offset, size, flags);
+
+ tcp_rate_check_app_limited(sk); /* is sending application-limited? */
+
bvec_set_page(&bvec, page, size, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
@@ -987,18 +992,6 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
return tcp_sendmsg_locked(sk, &msg, size);
}
-EXPORT_SYMBOL_GPL(do_tcp_sendpages);
-
-int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- if (!(sk->sk_route_caps & NETIF_F_SG))
- return sock_no_sendpage_locked(sk, page, offset, size, flags);
-
- tcp_rate_check_app_limited(sk); /* is sending application-limited? */
-
- return do_tcp_sendpages(sk, page, offset, size, flags);
-}
EXPORT_SYMBOL_GPL(tcp_sendpage_locked);
int tcp_sendpage(struct sock *sk, struct page *page, int offset,
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed. This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.
Signed-off-by: David Howells <[email protected]>
cc: Bernard Metzler <[email protected]>
cc: Tom Talpey <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
drivers/infiniband/sw/siw/siw_qp_tx.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 05052b49107f..8fc179321e2b 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -313,7 +313,7 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
}
/*
- * 0copy TCP transmit interface: Use do_tcp_sendpages.
+ * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
*
* Using sendpage to push page by page appears to be less efficient
* than using sendmsg, even if data are copied.
@@ -324,20 +324,27 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
size_t size)
{
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT |
+ MSG_SENDPAGE_NOTLAST),
+ };
struct sock *sk = s->sk;
- int i = 0, rv = 0, sent = 0,
- flags = MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST;
+ int i = 0, rv = 0, sent = 0;
while (size) {
size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
if (size + offset <= PAGE_SIZE)
- flags = MSG_MORE | MSG_DONTWAIT;
+ msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT;
tcp_rate_check_app_limited(sk);
+ bvec_set_page(&bvec, page[i], bytes, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
try_page_again:
lock_sock(sk);
- rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags);
+ rv = tcp_sendmsg_locked(sk, &msg, size);
release_sock(sk);
if (rv > 0) {
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed. This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.
Signed-off-by: David Howells <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/tls.h | 2 +-
net/tls/tls_main.c | 24 +++++++++++++++---------
2 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/include/net/tls.h b/include/net/tls.h
index 154949c7b0c8..d31521c36a84 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -256,7 +256,7 @@ struct tls_context {
struct scatterlist *partially_sent_record;
u16 partially_sent_offset;
- bool in_tcp_sendpages;
+ bool splicing_pages;
bool pending_open_record_frags;
struct mutex tx_lock; /* protects partially_sent_* fields and
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 3735cb00905d..8802b4f8b652 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -124,7 +124,10 @@ int tls_push_sg(struct sock *sk,
u16 first_offset,
int flags)
{
- int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST,
+ };
int ret = 0;
struct page *p;
size_t size;
@@ -133,16 +136,19 @@ int tls_push_sg(struct sock *sk,
size = sg->length - offset;
offset += sg->offset;
- ctx->in_tcp_sendpages = true;
+ ctx->splicing_pages = true;
while (1) {
if (sg_is_last(sg))
- sendpage_flags = flags;
+ msg.msg_flags = flags | MSG_SPLICE_PAGES;
/* is sending application-limited? */
tcp_rate_check_app_limited(sk);
p = sg_page(sg);
retry:
- ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
+ bvec_set_page(&bvec, p, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+ ret = tcp_sendmsg_locked(sk, &msg, size);
if (ret != size) {
if (ret > 0) {
@@ -154,7 +160,7 @@ int tls_push_sg(struct sock *sk,
offset -= sg->offset;
ctx->partially_sent_offset = offset;
ctx->partially_sent_record = (void *)sg;
- ctx->in_tcp_sendpages = false;
+ ctx->splicing_pages = false;
return ret;
}
@@ -168,7 +174,7 @@ int tls_push_sg(struct sock *sk,
size = sg->length;
}
- ctx->in_tcp_sendpages = false;
+ ctx->splicing_pages = false;
return 0;
}
@@ -246,11 +252,11 @@ static void tls_write_space(struct sock *sk)
{
struct tls_context *ctx = tls_get_ctx(sk);
- /* If in_tcp_sendpages call lower protocol write space handler
+ /* If splicing_pages call lower protocol write space handler
* to ensure we wake up any waiting operations there. For example
- * if do_tcp_sendpages where to call sk_wait_event.
+ * if splicing pages where to call sk_wait_event.
*/
- if (ctx->in_tcp_sendpages) {
+ if (ctx->splicing_pages) {
ctx->sk_write_space(sk);
return;
}
Make AF_ALG sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
[!] Note that this makes use of netfs_extract_iter_to_sg() from netfslib.
This probably needs moving to core code somewhere.
Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/Kconfig | 1 +
crypto/af_alg.c | 29 +++++++++++++++++++++++++++--
crypto/algif_aead.c | 22 +++++++++++-----------
crypto/algif_skcipher.c | 8 ++++----
4 files changed, 43 insertions(+), 17 deletions(-)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 9c86f7045157..8c04ecbb4395 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1297,6 +1297,7 @@ menu "Userspace interface"
config CRYPTO_USER_API
tristate
+ select NETFS_SUPPORT # for netfs_extract_iter_to_sg()
config CRYPTO_USER_API_HASH
tristate "Hash algorithms"
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index feb989b32606..80ab4f6e018c 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -22,6 +22,7 @@
#include <linux/sched/signal.h>
#include <linux/security.h>
#include <linux/string.h>
+#include <linux/netfs.h>
#include <keys/user-type.h>
#include <keys/trusted-type.h>
#include <keys/encrypted-type.h>
@@ -970,6 +971,10 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
bool init = false;
int err = 0;
+ if ((msg->msg_flags & MSG_SPLICE_PAGES) &&
+ !iov_iter_is_bvec(&msg->msg_iter))
+ return -EINVAL;
+
if (msg->msg_controllen) {
err = af_alg_cmsg_send(msg, &con);
if (err)
@@ -1015,7 +1020,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
while (size) {
struct scatterlist *sg;
size_t len = size;
- size_t plen;
+ ssize_t plen;
/* use the existing memory in an allocated page */
if (ctx->merge) {
@@ -1060,7 +1065,27 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
if (sgl->cur)
sg_unmark_end(sg + sgl->cur - 1);
- if (1 /* TODO check MSG_SPLICE_PAGES */) {
+ if (msg->msg_flags & MSG_SPLICE_PAGES) {
+ struct sg_table sgtable = {
+ .sgl = sg,
+ .nents = sgl->cur,
+ .orig_nents = sgl->cur,
+ };
+
+ plen = netfs_extract_iter_to_sg(&msg->msg_iter, len,
+ &sgtable, MAX_SGL_ENTS, 0);
+ if (plen < 0) {
+ err = plen;
+ goto unlock;
+ }
+
+ for (; sgl->cur < sgtable.nents; sgl->cur++)
+ get_page(sg_page(&sg[sgl->cur]));
+ len -= plen;
+ ctx->used += plen;
+ copied += plen;
+ size -= plen;
+ } else {
do {
struct page *pg;
unsigned int i = sgl->cur;
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 42493b4d8ce4..279eb17a1dfc 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -9,8 +9,8 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage/sendmsg. Filling
- * up the TX SGL does not cause a crypto operation -- the data will only be
+ * filled by user space with the data submitted via sendpage. Filling up
+ * the TX SGL does not cause a crypto operation -- the data will only be
* tracked by the kernel. Upon receipt of one recvmsg call, the caller must
* provide a buffer which is tracked with the RX SGL.
*
@@ -113,19 +113,19 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
}
/*
- * Data length provided by caller via sendmsg/sendpage that has not
- * yet been processed.
+ * Data length provided by caller via sendmsg that has not yet been
+ * processed.
*/
used = ctx->used;
/*
- * Make sure sufficient data is present -- note, the same check is
- * also present in sendmsg/sendpage. The checks in sendpage/sendmsg
- * shall provide an information to the data sender that something is
- * wrong, but they are irrelevant to maintain the kernel integrity.
- * We need this check here too in case user space decides to not honor
- * the error message in sendmsg/sendpage and still call recvmsg. This
- * check here protects the kernel integrity.
+ * Make sure sufficient data is present -- note, the same check is also
+ * present in sendmsg. The checks in sendmsg shall provide an
+ * information to the data sender that something is wrong, but they are
+ * irrelevant to maintain the kernel integrity. We need this check
+ * here too in case user space decides to not honor the error message
+ * in sendmsg and still call recvmsg. This check here protects the
+ * kernel integrity.
*/
if (!aead_sufficient_data(sk))
return -EINVAL;
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index ee8890ee8f33..021f9ce7e87c 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -9,10 +9,10 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage/sendmsg. Filling
- * up the TX SGL does not cause a crypto operation -- the data will only be
- * tracked by the kernel. Upon receipt of one recvmsg call, the caller must
- * provide a buffer which is tracked with the RX SGL.
+ * filled by user space with the data submitted via sendmsg. Filling up the TX
+ * SGL does not cause a crypto operation -- the data will only be tracked by
+ * the kernel. Upon receipt of one recvmsg call, the caller must provide a
+ * buffer which is tracked with the RX SGL.
*
* During the processing of the recvmsg operation, the cipher request is
* allocated and prepared. As part of the recvmsg operation, the processed
Convert af_alg_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself.
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
[!] Note that this makes use of netfs_extract_iter_to_sg() from netfslib.
This probably needs moving to core code somewhere.
Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/af_alg.c | 53 +++++++++----------------------------------------
1 file changed, 9 insertions(+), 44 deletions(-)
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 80ab4f6e018c..0e77fce60876 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -1148,53 +1148,18 @@ EXPORT_SYMBOL_GPL(af_alg_sendmsg);
ssize_t af_alg_sendpage(struct socket *sock, struct page *page,
int offset, size_t size, int flags)
{
- struct sock *sk = sock->sk;
- struct alg_sock *ask = alg_sk(sk);
- struct af_alg_ctx *ctx = ask->private;
- struct af_alg_tsgl *sgl;
- int err = -EINVAL;
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
-
- lock_sock(sk);
- if (!ctx->more && ctx->used)
- goto unlock;
-
- if (!size)
- goto done;
-
- if (!af_alg_writable(sk)) {
- err = af_alg_wait_for_wmem(sk, flags);
- if (err)
- goto unlock;
- }
-
- err = af_alg_alloc_tsgl(sk);
- if (err)
- goto unlock;
-
- ctx->merge = 0;
- sgl = list_entry(ctx->tsgl_list.prev, struct af_alg_tsgl, list);
-
- if (sgl->cur)
- sg_unmark_end(sgl->sg + sgl->cur - 1);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES,
+ };
- sg_mark_end(sgl->sg + sgl->cur);
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- get_page(page);
- sg_set_page(sgl->sg + sgl->cur, page, size, offset);
- sgl->cur++;
- ctx->used += size;
-
-done:
- ctx->more = flags & MSG_MORE;
-
-unlock:
- af_alg_data_wakeup(sk);
- release_sock(sk);
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
- return err ?: size;
+ return sock_sendmsg(sock, &msg);
}
EXPORT_SYMBOL_GPL(af_alg_sendpage);
Put the loop in af_alg_sendmsg() into an if-statement to indent it to make
the next patch easier to review as that will add another branch to handle
MSG_SPLICE_PAGES to the if-statement.
Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/af_alg.c | 50 +++++++++++++++++++++++++------------------------
1 file changed, 26 insertions(+), 24 deletions(-)
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 5f7252a5b7b4..feb989b32606 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -1060,35 +1060,37 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
if (sgl->cur)
sg_unmark_end(sg + sgl->cur - 1);
- do {
- struct page *pg;
- unsigned int i = sgl->cur;
+ if (1 /* TODO check MSG_SPLICE_PAGES */) {
+ do {
+ struct page *pg;
+ unsigned int i = sgl->cur;
- plen = min_t(size_t, len, PAGE_SIZE);
+ plen = min_t(size_t, len, PAGE_SIZE);
- pg = alloc_page(GFP_KERNEL);
- if (!pg) {
- err = -ENOMEM;
- goto unlock;
- }
+ pg = alloc_page(GFP_KERNEL);
+ if (!pg) {
+ err = -ENOMEM;
+ goto unlock;
+ }
- sg_assign_page(sg + i, pg);
+ sg_assign_page(sg + i, pg);
- err = memcpy_from_msg(page_address(sg_page(sg + i)),
- msg, plen);
- if (err) {
- __free_page(sg_page(sg + i));
- sg_assign_page(sg + i, NULL);
- goto unlock;
- }
+ err = memcpy_from_msg(page_address(sg_page(sg + i)),
+ msg, plen);
+ if (err) {
+ __free_page(sg_page(sg + i));
+ sg_assign_page(sg + i, NULL);
+ goto unlock;
+ }
- sg[i].length = plen;
- len -= plen;
- ctx->used += plen;
- copied += plen;
- size -= plen;
- sgl->cur++;
- } while (len && sgl->cur < MAX_SGL_ENTS);
+ sg[i].length = plen;
+ len -= plen;
+ ctx->used += plen;
+ copied += plen;
+ size -= plen;
+ sgl->cur++;
+ } while (len && sgl->cur < MAX_SGL_ENTS);
+ }
if (!size)
sg_mark_end(sg + sgl->cur - 1);
When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.
To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator. The header and trailer (if present) are copied into
memory acquired from zcopy_alloc() which just breaks a page up into small
pieces that can be freed with put_page().
Signed-off-by: David Howells <[email protected]>
cc: Bernard Metzler <[email protected]>
cc: Tom Talpey <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
drivers/infiniband/sw/siw/siw_qp_tx.c | 231 +++++---------------------
1 file changed, 46 insertions(+), 185 deletions(-)
diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 8fc179321e2b..ec4f0ac324ce 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -8,6 +8,7 @@
#include <linux/net.h>
#include <linux/scatterlist.h>
#include <linux/highmem.h>
+#include <linux/zcopy_alloc.h>
#include <net/tcp.h>
#include <rdma/iw_cm.h>
@@ -312,114 +313,8 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
return rv;
}
-/*
- * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
- *
- * Using sendpage to push page by page appears to be less efficient
- * than using sendmsg, even if data are copied.
- *
- * A general performance limitation might be the extra four bytes
- * trailer checksum segment to be pushed after user data.
- */
-static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
- size_t size)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT |
- MSG_SENDPAGE_NOTLAST),
- };
- struct sock *sk = s->sk;
- int i = 0, rv = 0, sent = 0;
-
- while (size) {
- size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
-
- if (size + offset <= PAGE_SIZE)
- msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT;
-
- tcp_rate_check_app_limited(sk);
- bvec_set_page(&bvec, page[i], bytes, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
-try_page_again:
- lock_sock(sk);
- rv = tcp_sendmsg_locked(sk, &msg, size);
- release_sock(sk);
-
- if (rv > 0) {
- size -= rv;
- sent += rv;
- if (rv != bytes) {
- offset += rv;
- bytes -= rv;
- goto try_page_again;
- }
- offset = 0;
- } else {
- if (rv == -EAGAIN || rv == 0)
- break;
- return rv;
- }
- i++;
- }
- return sent;
-}
-
-/*
- * siw_0copy_tx()
- *
- * Pushes list of pages to TCP socket. If pages from multiple
- * SGE's, all referenced pages of each SGE are pushed in one
- * shot.
- */
-static int siw_0copy_tx(struct socket *s, struct page **page,
- struct siw_sge *sge, unsigned int offset,
- unsigned int size)
-{
- int i = 0, sent = 0, rv;
- int sge_bytes = min(sge->length - offset, size);
-
- offset = (sge->laddr + offset) & ~PAGE_MASK;
-
- while (sent != size) {
- rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes);
- if (rv >= 0) {
- sent += rv;
- if (size == sent || sge_bytes > rv)
- break;
-
- i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT;
- sge++;
- sge_bytes = min(sge->length, size - sent);
- offset = sge->laddr & ~PAGE_MASK;
- } else {
- sent = rv;
- break;
- }
- }
- return sent;
-}
-
#define MAX_TRAILER (MPA_CRC_SIZE + 4)
-static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len)
-{
- int i;
-
- /*
- * Work backwards through the array to honor the kmap_local_page()
- * ordering requirements.
- */
- for (i = (len-1); i >= 0; i--) {
- if (kmap_mask & BIT(i)) {
- unsigned long addr = (unsigned long)iov[i].iov_base;
-
- kunmap_local((void *)(addr & PAGE_MASK));
- }
- }
-}
-
/*
* siw_tx_hdt() tries to push a complete packet to TCP where all
* packet fragments are referenced by the elements of one iovec.
@@ -439,15 +334,13 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
{
struct siw_wqe *wqe = &c_tx->wqe_active;
struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
- struct kvec iov[MAX_ARRAY];
- struct page *page_array[MAX_ARRAY];
+ struct bio_vec bvec[MAX_ARRAY];
struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
sge_off = c_tx->sge_off, sge_idx = c_tx->sge_idx,
pbl_idx = c_tx->pbl_idx;
- unsigned long kmap_mask = 0L;
if (c_tx->state == SIW_SEND_HDR) {
if (c_tx->use_sendpage) {
@@ -457,10 +350,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
c_tx->state = SIW_SEND_DATA;
} else {
- iov[0].iov_base =
- (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent;
- iov[0].iov_len = hdr_len =
- c_tx->ctrl_len - c_tx->ctrl_sent;
+ const void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent;
+
+ hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent;
+ rv = zcopy_memdup(hdr_len, hdr, &bvec[0], GFP_NOFS);
+ if (rv < 0)
+ goto done;
seg = 1;
}
}
@@ -478,28 +373,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
} else {
is_kva = 1;
}
- if (is_kva && !c_tx->use_sendpage) {
- /*
- * tx from kernel virtual address: either inline data
- * or memory region with assigned kernel buffer
- */
- iov[seg].iov_base =
- (void *)(uintptr_t)(sge->laddr + sge_off);
- iov[seg].iov_len = sge_len;
-
- if (do_crc)
- crypto_shash_update(c_tx->mpa_crc_hd,
- iov[seg].iov_base,
- sge_len);
- sge_off += sge_len;
- data_len -= sge_len;
- seg++;
- goto sge_done;
- }
while (sge_len) {
size_t plen = min((int)PAGE_SIZE - fp_off, sge_len);
- void *kaddr;
if (!is_kva) {
struct page *p;
@@ -512,33 +388,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
p = siw_get_upage(mem->umem,
sge->laddr + sge_off);
if (unlikely(!p)) {
- siw_unmap_pages(iov, kmap_mask, seg);
wqe->processed -= c_tx->bytes_unsent;
rv = -EFAULT;
goto done_crc;
}
- page_array[seg] = p;
-
- if (!c_tx->use_sendpage) {
- void *kaddr = kmap_local_page(p);
-
- /* Remember for later kunmap() */
- kmap_mask |= BIT(seg);
- iov[seg].iov_base = kaddr + fp_off;
- iov[seg].iov_len = plen;
-
- if (do_crc)
- crypto_shash_update(
- c_tx->mpa_crc_hd,
- iov[seg].iov_base,
- plen);
- } else if (do_crc) {
- kaddr = kmap_local_page(p);
- crypto_shash_update(c_tx->mpa_crc_hd,
- kaddr + fp_off,
- plen);
- kunmap_local(kaddr);
- }
+
+ bvec_set_page(&bvec[seg], p, plen, fp_off);
} else {
/*
* Cast to an uintptr_t to preserve all 64 bits
@@ -552,12 +407,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
* bits on a 64 bit platform and 32 bits on a
* 32 bit platform.
*/
- page_array[seg] = virt_to_page((void *)(va & PAGE_MASK));
- if (do_crc)
- crypto_shash_update(
- c_tx->mpa_crc_hd,
- (void *)va,
- plen);
+ bvec_set_virt(&bvec[seg], (void *)va, plen);
+ }
+
+ if (do_crc) {
+ void *kaddr = kmap_local_page(bvec[seg].bv_page);
+ crypto_shash_update(c_tx->mpa_crc_hd,
+ kaddr + bvec[seg].bv_offset,
+ bvec[seg].bv_len);
+ kunmap_local(kaddr);
}
sge_len -= plen;
@@ -567,13 +425,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
if (++seg > (int)MAX_ARRAY) {
siw_dbg_qp(tx_qp(c_tx), "to many fragments\n");
- siw_unmap_pages(iov, kmap_mask, seg-1);
wqe->processed -= c_tx->bytes_unsent;
rv = -EMSGSIZE;
goto done_crc;
}
}
-sge_done:
+
/* Update SGE variables at end of SGE */
if (sge_off == sge->length &&
(data_len != 0 || wqe->processed < wqe->bytes)) {
@@ -582,15 +439,8 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
sge_off = 0;
}
}
- /* trailer */
- if (likely(c_tx->state != SIW_SEND_TRAILER)) {
- iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad];
- iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad);
- } else {
- iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent];
- iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent;
- }
+ /* Set the CRC in the trailer */
if (c_tx->pad) {
*(u32 *)c_tx->trailer.pad = 0;
if (do_crc)
@@ -603,23 +453,31 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
else if (do_crc)
crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)&c_tx->trailer.crc);
- data_len = c_tx->bytes_unsent;
+ /* Copy the trailer and add it to the output list */
+ if (likely(c_tx->state != SIW_SEND_TRAILER)) {
+ void *trl = &c_tx->trailer.pad[4 - c_tx->pad];
- if (c_tx->use_sendpage) {
- rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
- c_tx->sge_off, data_len);
- if (rv == data_len) {
- rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
- if (rv > 0)
- rv += data_len;
- else
- rv = data_len;
- }
+ trl_len = MAX_TRAILER - (4 - c_tx->pad);
+ rv = zcopy_memdup(trl_len, trl, &bvec[seg], GFP_NOFS);
+ if (rv < 0)
+ goto done_crc;
} else {
- rv = kernel_sendmsg(s, &msg, iov, seg + 1,
- hdr_len + data_len + trl_len);
- siw_unmap_pages(iov, kmap_mask, seg);
+ void *trl = &c_tx->trailer.pad[c_tx->ctrl_sent];
+
+ trl_len = MAX_TRAILER - c_tx->ctrl_sent;
+ rv = zcopy_memdup(trl_len, trl, &bvec[seg], GFP_NOFS);
+ if (rv < 0)
+ goto done_crc;
}
+
+ data_len = c_tx->bytes_unsent;
+
+ if (c_tx->use_sendpage)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1,
+ hdr_len + data_len + trl_len);
+ rv = sock_sendmsg(s, &msg);
+
if (rv < (int)hdr_len) {
/* Not even complete hdr pushed or negative rv */
wqe->processed -= data_len;
@@ -680,6 +538,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
}
done_crc:
c_tx->do_crc = 0;
+ if (c_tx->state == SIW_SEND_HDR)
+ folio_put(page_folio(bvec[0].bv_page));
+ folio_put(page_folio(bvec[seg].bv_page));
done:
return rv;
}
Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() to splice data from
a pipe to a socket. This paves the way for passing in multiple pages at
once from a pipe and the handling of multipage folios.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
fs/splice.c | 42 +++++++++++++++++++++++-------------------
include/linux/fs.h | 2 --
include/linux/splice.h | 2 ++
net/socket.c | 26 ++------------------------
4 files changed, 27 insertions(+), 45 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index f46dd1fb367b..23ead122d631 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -32,6 +32,7 @@
#include <linux/uio.h>
#include <linux/security.h>
#include <linux/gfp.h>
+#include <linux/net.h>
#include <linux/socket.h>
#include <linux/sched/signal.h>
@@ -410,29 +411,32 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
};
EXPORT_SYMBOL(nosteal_pipe_buf_ops);
+#ifdef CONFIG_NET
/*
* Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
* using sendpage(). Return the number of bytes sent.
*/
-static int pipe_to_sendpage(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf, struct splice_desc *sd)
+static int pipe_to_sendmsg(struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf, struct splice_desc *sd)
{
- struct file *file = sd->u.file;
- loff_t pos = sd->pos;
- int more;
-
- if (!likely(file->f_op->sendpage))
- return -EINVAL;
+ struct socket *sock = sock_from_file(sd->u.file);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES,
+ };
- more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
+ if (sd->flags & SPLICE_F_MORE)
+ msg.msg_flags |= MSG_MORE;
if (sd->len < sd->total_len &&
pipe_occupancy(pipe->head, pipe->tail) > 1)
- more |= MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE;
- return file->f_op->sendpage(file, buf->page, buf->offset,
- sd->len, &pos, more);
+ bvec_set_page(&bvec, buf->page, sd->len, buf->offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, sd->len);
+ return sock_sendmsg(sock, &msg);
}
+#endif
static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
{
@@ -614,7 +618,7 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
* Description:
* This function does little more than loop over the pipe and call
* @actor to do the actual moving of a single struct pipe_buffer to
- * the desired destination. See pipe_to_file, pipe_to_sendpage, or
+ * the desired destination. See pipe_to_file, pipe_to_sendmsg, or
* pipe_to_user.
*
*/
@@ -795,8 +799,9 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
EXPORT_SYMBOL(iter_file_splice_write);
+#ifdef CONFIG_NET
/**
- * generic_splice_sendpage - splice data from a pipe to a socket
+ * splice_to_socket - splice data from a pipe to a socket
* @pipe: pipe to splice from
* @out: socket to write to
* @ppos: position in @out
@@ -808,13 +813,12 @@ EXPORT_SYMBOL(iter_file_splice_write);
* is involved.
*
*/
-ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
- loff_t *ppos, size_t len, unsigned int flags)
+ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags)
{
- return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendpage);
+ return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendmsg);
}
-
-EXPORT_SYMBOL(generic_splice_sendpage);
+#endif
static int warn_unsupported(struct file *file, const char *op)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..f3ccc243851e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2740,8 +2740,6 @@ extern ssize_t generic_file_splice_read(struct file *, loff_t *,
struct pipe_inode_info *, size_t, unsigned int);
extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
struct file *, loff_t *, size_t, unsigned int);
-extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
- struct file *out, loff_t *, size_t len, unsigned int flags);
extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
loff_t *opos, size_t len, unsigned int flags);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 8f052c3dae95..e6153feda86c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -87,6 +87,8 @@ extern long do_splice(struct file *in, loff_t *off_in,
extern long do_tee(struct file *in, struct file *out, size_t len,
unsigned int flags);
+extern ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags);
/*
* for dynamic pipe sizing
diff --git a/net/socket.c b/net/socket.c
index 6bae8ce7059e..1b48a976b8cc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -57,6 +57,7 @@
#include <linux/mm.h>
#include <linux/socket.h>
#include <linux/file.h>
+#include <linux/splice.h>
#include <linux/net.h>
#include <linux/interrupt.h>
#include <linux/thread_info.h>
@@ -126,8 +127,6 @@ static long compat_sock_ioctl(struct file *file,
unsigned int cmd, unsigned long arg);
#endif
static int sock_fasync(int fd, struct file *filp, int on);
-static ssize_t sock_sendpage(struct file *file, struct page *page,
- int offset, size_t size, loff_t *ppos, int more);
static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
@@ -162,8 +161,7 @@ static const struct file_operations socket_file_ops = {
.mmap = sock_mmap,
.release = sock_close,
.fasync = sock_fasync,
- .sendpage = sock_sendpage,
- .splice_write = generic_splice_sendpage,
+ .splice_write = splice_to_socket,
.splice_read = sock_splice_read,
.show_fdinfo = sock_show_fdinfo,
};
@@ -1062,26 +1060,6 @@ int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
}
EXPORT_SYMBOL(kernel_recvmsg);
-static ssize_t sock_sendpage(struct file *file, struct page *page,
- int offset, size_t size, loff_t *ppos, int more)
-{
- struct socket *sock;
- int flags;
- int ret;
-
- sock = file->private_data;
-
- flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
- /* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
- flags |= more;
-
- ret = kernel_sendpage(sock, page, offset, size, flags);
-
- if (trace_sock_send_length_enabled())
- call_trace_sock_send_length(sock->sk, ret, 0);
- return ret;
-}
-
static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
Remove file->f_op->sendpage as splicing to a socket now calls sendmsg
rather than sendpage.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/fs.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3ccc243851e..a9f1b2543d2c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1773,7 +1773,6 @@ struct file_operations {
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
- ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage in
skb_send_sock(). This causes pages to be spliced from the source iterator
if possible (the iterator must be ITER_BVEC and the pages must be
spliceable).
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Note that this could perhaps be improved to fill out a bvec array with all
the frags and then make a single sendmsg call, possibly sticking the header
on the front also.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/core/skbuff.c | 49 ++++++++++++++++++++++++++---------------------
1 file changed, 27 insertions(+), 22 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eb7d33b41e71..9fa333e26b7d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2927,32 +2927,32 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
}
EXPORT_SYMBOL_GPL(skb_splice_bits);
-static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg,
- struct kvec *vec, size_t num, size_t size)
+static int sendmsg_locked(struct sock *sk, struct msghdr *msg)
{
struct socket *sock = sk->sk_socket;
+ size_t size = msg_data_left(msg);
if (!sock)
return -EINVAL;
- return kernel_sendmsg(sock, msg, vec, num, size);
+
+ if (!sock->ops->sendmsg_locked)
+ return sock_no_sendmsg_locked(sk, msg, size);
+
+ return sock->ops->sendmsg_locked(sk, msg, size);
}
-static int sendpage_unlocked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg)
{
struct socket *sock = sk->sk_socket;
if (!sock)
return -EINVAL;
- return kernel_sendpage(sock, page, offset, size, flags);
+ return sock_sendmsg(sock, msg);
}
-typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg,
- struct kvec *vec, size_t num, size_t size);
-typedef int (*sendpage_func)(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
+typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg);
static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
- int len, sendmsg_func sendmsg, sendpage_func sendpage)
+ int len, sendmsg_func sendmsg)
{
unsigned int orig_len = len;
struct sk_buff *head = skb;
@@ -2972,8 +2972,9 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
memset(&msg, 0, sizeof(msg));
msg.msg_flags = MSG_DONTWAIT;
- ret = INDIRECT_CALL_2(sendmsg, kernel_sendmsg_locked,
- sendmsg_unlocked, sk, &msg, &kv, 1, slen);
+ iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &kv, 1, slen);
+ ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+ sendmsg_unlocked, sk, &msg);
if (ret <= 0)
goto error;
@@ -3004,11 +3005,17 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
slen = min_t(size_t, len, skb_frag_size(frag) - offset);
while (slen) {
- ret = INDIRECT_CALL_2(sendpage, kernel_sendpage_locked,
- sendpage_unlocked, sk,
- skb_frag_page(frag),
- skb_frag_off(frag) + offset,
- slen, MSG_DONTWAIT);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT,
+ };
+
+ bvec_set_page(&bvec, skb_frag_page(frag), slen,
+ skb_frag_off(frag) + offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, slen);
+
+ ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+ sendmsg_unlocked, sk, &msg);
if (ret <= 0)
goto error;
@@ -3045,16 +3052,14 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
int len)
{
- return __skb_send_sock(sk, skb, offset, len, kernel_sendmsg_locked,
- kernel_sendpage_locked);
+ return __skb_send_sock(sk, skb, offset, len, sendmsg_locked);
}
EXPORT_SYMBOL_GPL(skb_send_sock_locked);
/* Send skb data on a socket. Socket must be unlocked. */
int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len)
{
- return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked,
- sendpage_unlocked);
+ return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked);
}
/**
Translate tcp_bpf_sendpage() calls to tcp_bpf_sendmsg(MSG_SPLICE_PAGES).
Signed-off-by: David Howells <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Sitnicki <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ipv4/tcp_bpf.c | 49 +++++++++-------------------------------------
1 file changed, 9 insertions(+), 40 deletions(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 7f17134637eb..de37a4372437 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -485,49 +485,18 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct sk_msg tmp, *msg = NULL;
- int err = 0, copied = 0;
- struct sk_psock *psock;
- bool enospc = false;
-
- psock = sk_psock_get(sk);
- if (unlikely(!psock))
- return tcp_sendpage(sk, page, offset, size, flags);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES,
+ };
- lock_sock(sk);
- if (psock->cork) {
- msg = psock->cork;
- } else {
- msg = &tmp;
- sk_msg_init(msg);
- }
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- /* Catch case where ring is full and sendpage is stalled. */
- if (unlikely(sk_msg_full(msg)))
- goto out_err;
-
- sk_msg_page_add(msg, page, size, offset);
- sk_mem_charge(sk, size);
- copied = size;
- if (sk_msg_full(msg))
- enospc = true;
- if (psock->cork_bytes) {
- if (size > psock->cork_bytes)
- psock->cork_bytes = 0;
- else
- psock->cork_bytes -= size;
- if (psock->cork_bytes && !enospc)
- goto out_err;
- /* All cork bytes are accounted, rerun the prog. */
- psock->eval = __SK_NONE;
- psock->cork_bytes = 0;
- }
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
- err = tcp_bpf_send_verdict(sk, psock, msg, &copied, flags);
-out_err:
- release_sock(sk);
- sk_psock_put(sk, psock);
- return copied ? copied : err;
+ return tcp_bpf_sendmsg(sk, &msg, size);
}
enum {
Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.
Signed-off-by: David Howells <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Xiubo Li <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ceph/messenger_v1.c | 58 ++++++++++++++---------------------------
1 file changed, 19 insertions(+), 39 deletions(-)
diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index d664cb1593a7..b2d801a49122 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -74,37 +74,6 @@ static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
return r;
}
-/*
- * @more: either or both of MSG_MORE and MSG_SENDPAGE_NOTLAST
- */
-static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int more)
-{
- ssize_t (*sendpage)(struct socket *sock, struct page *page,
- int offset, size_t size, int flags);
- int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
- int ret;
-
- /*
- * sendpage cannot properly handle pages with page_count == 0,
- * we need to fall back to sendmsg if that's the case.
- *
- * Same goes for slab pages: skb_can_coalesce() allows
- * coalescing neighboring slab objects into a single frag which
- * triggers one of hardened usercopy checks.
- */
- if (sendpage_ok(page))
- sendpage = sock->ops->sendpage;
- else
- sendpage = sock_no_sendpage;
-
- ret = sendpage(sock, page, offset, size, flags);
- if (ret == -EAGAIN)
- ret = 0;
-
- return ret;
-}
-
static void con_out_kvec_reset(struct ceph_connection *con)
{
BUG_ON(con->v1.out_skip);
@@ -464,7 +433,6 @@ static int write_partial_message_data(struct ceph_connection *con)
struct ceph_msg *msg = con->out_msg;
struct ceph_msg_data_cursor *cursor = &msg->cursor;
bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
- int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
u32 crc;
dout("%s %p msg %p\n", __func__, con, msg);
@@ -482,6 +450,10 @@ static int write_partial_message_data(struct ceph_connection *con)
*/
crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0;
while (cursor->total_resid) {
+ struct bio_vec bvec;
+ struct msghdr msghdr = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST,
+ };
struct page *page;
size_t page_offset;
size_t length;
@@ -494,9 +466,12 @@ static int write_partial_message_data(struct ceph_connection *con)
page = ceph_msg_data_next(cursor, &page_offset, &length);
if (length == cursor->total_resid)
- more = MSG_MORE;
- ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
- more);
+ msghdr.msg_flags |= MSG_MORE;
+
+ bvec_set_page(&bvec, page, length, page_offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, length);
+
+ ret = sock_sendmsg(con->sock, &msghdr);
if (ret <= 0) {
if (do_datacrc)
msg->footer.data_crc = cpu_to_le32(crc);
@@ -526,7 +501,10 @@ static int write_partial_message_data(struct ceph_connection *con)
*/
static int write_partial_skip(struct ceph_connection *con)
{
- int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ struct bio_vec bvec;
+ struct msghdr msghdr = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST | MSG_MORE,
+ };
int ret;
dout("%s %p %d left\n", __func__, con, con->v1.out_skip);
@@ -534,9 +512,11 @@ static int write_partial_skip(struct ceph_connection *con)
size_t size = min(con->v1.out_skip, (int)PAGE_SIZE);
if (size == con->v1.out_skip)
- more = MSG_MORE;
- ret = ceph_tcp_sendpage(con->sock, ceph_zero_page, 0, size,
- more);
+ msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
+ bvec_set_page(&bvec, ZERO_PAGE(0), size, 0);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+ ret = sock_sendmsg(con->sock, &msghdr);
if (ret <= 0)
goto out;
con->v1.out_skip -= ret;
Remove hash_sendpage*() and use hash_sendmsg() as the latter seems to just
use the source pages directly anyway.
Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/algif_hash.c | 66 ---------------------------------------------
1 file changed, 66 deletions(-)
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 1d017ec5c63c..52f5828a054a 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -129,58 +129,6 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
return err ?: copied;
}
-static ssize_t hash_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- struct alg_sock *ask = alg_sk(sk);
- struct hash_ctx *ctx = ask->private;
- int err;
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
-
- lock_sock(sk);
- sg_init_table(ctx->sgl.sg, 1);
- sg_set_page(ctx->sgl.sg, page, size, offset);
-
- if (!(flags & MSG_MORE)) {
- err = hash_alloc_result(sk, ctx);
- if (err)
- goto unlock;
- } else if (!ctx->more)
- hash_free_result(sk, ctx);
-
- ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, ctx->result, size);
-
- if (!(flags & MSG_MORE)) {
- if (ctx->more)
- err = crypto_ahash_finup(&ctx->req);
- else
- err = crypto_ahash_digest(&ctx->req);
- } else {
- if (!ctx->more) {
- err = crypto_ahash_init(&ctx->req);
- err = crypto_wait_req(err, &ctx->wait);
- if (err)
- goto unlock;
- }
-
- err = crypto_ahash_update(&ctx->req);
- }
-
- err = crypto_wait_req(err, &ctx->wait);
- if (err)
- goto unlock;
-
- ctx->more = flags & MSG_MORE;
-
-unlock:
- release_sock(sk);
-
- return err ?: size;
-}
-
static int hash_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
int flags)
{
@@ -285,7 +233,6 @@ static struct proto_ops algif_hash_ops = {
.release = af_alg_release,
.sendmsg = hash_sendmsg,
- .sendpage = hash_sendpage,
.recvmsg = hash_recvmsg,
.accept = hash_accept,
};
@@ -337,18 +284,6 @@ static int hash_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return hash_sendmsg(sock, msg, size);
}
-static ssize_t hash_sendpage_nokey(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
-
- err = hash_check_key(sock);
- if (err)
- return err;
-
- return hash_sendpage(sock, page, offset, size, flags);
-}
-
static int hash_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
@@ -387,7 +322,6 @@ static struct proto_ops algif_hash_ops_nokey = {
.release = af_alg_release,
.sendmsg = hash_sendmsg_nokey,
- .sendpage = hash_sendpage_nokey,
.recvmsg = hash_recvmsg_nokey,
.accept = hash_accept_nokey,
};
When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header and data
pages.
To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator. The header are copied into memory acquired from
zcopy_alloc() which just breaks a page up into small pieces that can be
freed with put_page().
Signed-off-by: David Howells <[email protected]>
cc: Santosh Shilimkar <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
net/rds/tcp_send.c | 80 ++++++++++++++++++++--------------------------
1 file changed, 35 insertions(+), 45 deletions(-)
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 8c4d1d6e9249..0d6eb85a930d 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -32,6 +32,7 @@
*/
#include <linux/kernel.h>
#include <linux/in.h>
+#include <linux/zcopy_alloc.h>
#include <net/tcp.h>
#include "rds_single_path.h"
@@ -52,29 +53,24 @@ void rds_tcp_xmit_path_complete(struct rds_conn_path *cp)
tcp_sock_set_cork(tc->t_sock->sk, false);
}
-/* the core send_sem serializes this with other xmit and shutdown */
-static int rds_tcp_sendmsg(struct socket *sock, void *data, unsigned int len)
-{
- struct kvec vec = {
- .iov_base = data,
- .iov_len = len,
- };
- struct msghdr msg = {
- .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL,
- };
-
- return kernel_sendmsg(sock, &msg, &vec, 1, vec.iov_len);
-}
-
/* the core send_sem serializes this with other xmit and shutdown */
int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
unsigned int hdr_off, unsigned int sg, unsigned int off)
{
struct rds_conn_path *cp = rm->m_inc.i_conn_path;
struct rds_tcp_connection *tc = cp->cp_transport_data;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+ };
+ struct bio_vec *bvec;
+ unsigned int i, size = 0, ix = 0;
+ bool free_hdr = false;
int done = 0;
- int ret = 0;
- int more;
+ int ret = -ENOMEM;
+
+ bvec = kmalloc_array(1 + sg, sizeof(struct bio_vec), GFP_KERNEL);
+ if (!bvec)
+ goto out;
if (hdr_off == 0) {
/*
@@ -101,41 +97,30 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
/* see rds_tcp_write_space() */
set_bit(SOCK_NOSPACE, &tc->t_sock->sk->sk_socket->flags);
- ret = rds_tcp_sendmsg(tc->t_sock,
- (void *)&rm->m_inc.i_hdr + hdr_off,
- sizeof(rm->m_inc.i_hdr) - hdr_off);
+ ret = zcopy_memdup(sizeof(rm->m_inc.i_hdr) - hdr_off,
+ (void *)&rm->m_inc.i_hdr + hdr_off,
+ &bvec[ix], GFP_KERNEL);
if (ret < 0)
goto out;
- done += ret;
- if (hdr_off + done != sizeof(struct rds_header))
- goto out;
+ free_hdr = true;
+ size += bvec[ix].bv_len;
+ ix++;
}
- more = rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0;
- while (sg < rm->data.op_nents) {
- int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
-
- ret = tc->t_sock->ops->sendpage(tc->t_sock,
- sg_page(&rm->data.op_sg[sg]),
- rm->data.op_sg[sg].offset + off,
- rm->data.op_sg[sg].length - off,
- flags);
- rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void *)sg_page(&rm->data.op_sg[sg]),
- rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off,
- ret);
- if (ret <= 0)
- break;
-
- off += ret;
- done += ret;
- if (off == rm->data.op_sg[sg].length) {
- off = 0;
- sg++;
- }
- if (sg == rm->data.op_nents - 1)
- more = 0;
+ for (i = sg; i < rm->data.op_nents; i++) {
+ bvec_set_page(&bvec[ix],
+ sg_page(&rm->data.op_sg[i]),
+ rm->data.op_sg[i].length - off,
+ rm->data.op_sg[i].offset + off);
+ off = 0;
+ size += bvec[ix].bv_len;
+ ix++;
}
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, ix, size);
+ ret = sock_sendmsg(tc->t_sock, &msg);
+ rdsdebug("tcp sendmsg-splice %u,%u ret %d\n", ix, size, ret);
+
out:
if (ret <= 0) {
/* write_space will hit after EAGAIN, all else fatal */
@@ -158,6 +143,11 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
}
if (done == 0)
done = ret;
+ if (bvec) {
+ if (free_hdr)
+ put_page(bvec[0].bv_page);
+ kfree(bvec);
+ }
return done;
}
Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage. This allows
multiple pages and multipage folios to be passed through.
TODO: iscsit_fe_sendpage_sg() should perhaps set up a bio_vec array for the
entire set of pages it's going to transfer plus two for the header and
trailer and use zcopy_alloc() to allocate the header and trailer - and then
call sendmsg once for the entire message.
Signed-off-by: David Howells <[email protected]>
cc: "Martin K. Petersen" <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
drivers/target/iscsi/iscsi_target_util.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 26dc8ed3045b..c7d58e41ac3b 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1078,6 +1078,8 @@ int iscsit_fe_sendpage_sg(
struct iscsit_conn *conn)
{
struct scatterlist *sg = cmd->first_data_sg;
+ struct bio_vec bvec;
+ struct msghdr msghdr = { .msg_flags = MSG_SPLICE_PAGES, };
struct kvec iov;
u32 tx_hdr_size, data_len;
u32 offset = cmd->first_data_sg_off;
@@ -1121,17 +1123,17 @@ int iscsit_fe_sendpage_sg(
u32 space = (sg->length - offset);
u32 sub_len = min_t(u32, data_len, space);
send_pg:
- tx_sent = conn->sock->ops->sendpage(conn->sock,
- sg_page(sg), sg->offset + offset, sub_len, 0);
+ bvec_set_page(&bvec, sg_page(sg), sub_len, sg->offset + offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, sub_len);
+
+ tx_sent = conn->sock->ops->sendmsg(conn->sock, &msghdr, sub_len);
if (tx_sent != sub_len) {
if (tx_sent == -EAGAIN) {
- pr_err("tcp_sendpage() returned"
- " -EAGAIN\n");
+ pr_err("sendmsg/splice returned -EAGAIN\n");
goto send_pg;
}
- pr_err("tcp_sendpage() failure: %d\n",
- tx_sent);
+ pr_err("sendmsg/splice failure: %d\n", tx_sent);
return -1;
}
When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage. This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.
Signed-off-by: David Howells <[email protected]>
cc: Christine Caulfield <[email protected]>
cc: David Teigland <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
fs/dlm/lowcomms.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index a9b14f81d655..9c0c691b6106 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1394,8 +1394,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg)
/* Send a message */
static int send_to_sock(struct connection *con)
{
- const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
struct writequeue_entry *e;
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+ };
int len, offset, ret;
spin_lock_bh(&con->writequeue_lock);
@@ -1411,8 +1414,9 @@ static int send_to_sock(struct connection *con)
WARN_ON_ONCE(len == 0 && e->users == 0);
spin_unlock_bh(&con->writequeue_lock);
- ret = kernel_sendpage(con->sock, e->page, offset, len,
- msg_flags);
+ bvec_set_page(&bvec, e->page, len, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(con->sock, &msg);
trace_dlm_send(con->nodeid, ret);
if (ret == -EAGAIN || ret == 0) {
lock_sock(con->sock->sk);
Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.
Signed-off-by: David Howells <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Xiubo Li <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ceph/messenger_v2.c | 89 +++++++++--------------------------------
1 file changed, 18 insertions(+), 71 deletions(-)
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index 301a991dc6a6..1637a0c21126 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -117,91 +117,38 @@ static int ceph_tcp_recv(struct ceph_connection *con)
return ret;
}
-static int do_sendmsg(struct socket *sock, struct iov_iter *it)
-{
- struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
- int ret;
-
- msg.msg_iter = *it;
- while (iov_iter_count(it)) {
- ret = sock_sendmsg(sock, &msg);
- if (ret <= 0) {
- if (ret == -EAGAIN)
- ret = 0;
- return ret;
- }
-
- iov_iter_advance(it, ret);
- }
-
- WARN_ON(msg_data_left(&msg));
- return 1;
-}
-
-static int do_try_sendpage(struct socket *sock, struct iov_iter *it)
-{
- struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
- struct bio_vec bv;
- int ret;
-
- if (WARN_ON(!iov_iter_is_bvec(it)))
- return -EINVAL;
-
- while (iov_iter_count(it)) {
- /* iov_iter_iovec() for ITER_BVEC */
- bvec_set_page(&bv, it->bvec->bv_page,
- min(iov_iter_count(it),
- it->bvec->bv_len - it->iov_offset),
- it->bvec->bv_offset + it->iov_offset);
-
- /*
- * sendpage cannot properly handle pages with
- * page_count == 0, we need to fall back to sendmsg if
- * that's the case.
- *
- * Same goes for slab pages: skb_can_coalesce() allows
- * coalescing neighboring slab objects into a single frag
- * which triggers one of hardened usercopy checks.
- */
- if (sendpage_ok(bv.bv_page)) {
- ret = sock->ops->sendpage(sock, bv.bv_page,
- bv.bv_offset, bv.bv_len,
- CEPH_MSG_FLAGS);
- } else {
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, bv.bv_len);
- ret = sock_sendmsg(sock, &msg);
- }
- if (ret <= 0) {
- if (ret == -EAGAIN)
- ret = 0;
- return ret;
- }
-
- iov_iter_advance(it, ret);
- }
-
- return 1;
-}
-
/*
* Write as much as possible. The socket is expected to be corked,
* so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
*
* Return:
- * 1 - done, nothing (else) to write
+ * >0 - done, nothing (else) to write
* 0 - socket is full, need to wait
* <0 - error
*/
static int ceph_tcp_send(struct ceph_connection *con)
{
+ struct msghdr msg = {
+ .msg_iter = con->v2.out_iter,
+ .msg_flags = CEPH_MSG_FLAGS,
+ };
int ret;
+ if (WARN_ON(!iov_iter_is_bvec(&con->v2.out_iter)))
+ return -EINVAL;
+
+ if (con->v2.out_iter_sendpage)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+
dout("%s con %p have %zu try_sendpage %d\n", __func__, con,
iov_iter_count(&con->v2.out_iter), con->v2.out_iter_sendpage);
- if (con->v2.out_iter_sendpage)
- ret = do_try_sendpage(con->sock, &con->v2.out_iter);
- else
- ret = do_sendmsg(con->sock, &con->v2.out_iter);
+
+ ret = sock_sendmsg(con->sock, &msg);
+ if (ret > 0)
+ iov_iter_advance(&con->v2.out_iter, ret);
+ else if (ret == -EAGAIN)
+ ret = 0;
+
dout("%s con %p ret %d left %zu\n", __func__, con, ret,
iov_iter_count(&con->v2.out_iter));
return ret;
When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.
To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator. The bio_vec array has two extra slots before the
first for headers and one after the last for a trailer. The headers and
trailer are copied into memory acquired from zcopy_alloc() which just
breaks a page up into small pieces that can be freed with put_page().
Signed-off-by: David Howells <[email protected]>
cc: Trond Myklebust <[email protected]>
cc: Anna Schumaker <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/sunrpc/svcsock.c | 70 ++++++++++++--------------------------------
net/sunrpc/xdr.c | 24 ++++++++++++---
2 files changed, 38 insertions(+), 56 deletions(-)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 03a4f5615086..1fa41ddbc40e 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -36,6 +36,7 @@
#include <linux/skbuff.h>
#include <linux/file.h>
#include <linux/freezer.h>
+#include <linux/zcopy_alloc.h>
#include <net/sock.h>
#include <net/checksum.h>
#include <net/ip.h>
@@ -1060,16 +1061,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
return 0; /* record not complete */
}
-static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
- int flags)
-{
- return kernel_sendpage(sock, virt_to_page(vec->iov_base),
- offset_in_page(vec->iov_base),
- vec->iov_len, flags);
-}
-
/*
- * kernel_sendpage() is used exclusively to reduce the number of
+ * MSG_SPLICE_PAGES is used exclusively to reduce the number of
* copy operations in this path. Therefore the caller must ensure
* that the pages backing @xdr are unchanging.
*
@@ -1081,65 +1074,38 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
{
const struct kvec *head = xdr->head;
const struct kvec *tail = xdr->tail;
- struct kvec rm = {
- .iov_base = &marker,
- .iov_len = sizeof(marker),
- };
struct msghdr msg = {
- .msg_flags = 0,
+ .msg_flags = MSG_SPLICE_PAGES,
};
- int ret;
+ int ret, n = xdr_buf_pagecount(xdr), size;
*sentp = 0;
ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
if (ret < 0)
return ret;
- ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
+ ret = zcopy_memdup(sizeof(marker), &marker, &xdr->bvec[-2], GFP_KERNEL);
if (ret < 0)
return ret;
- *sentp += ret;
- if (ret != rm.iov_len)
- return -EAGAIN;
- ret = svc_tcp_send_kvec(sock, head, 0);
+ ret = zcopy_memdup(head->iov_len, head->iov_base, &xdr->bvec[-1], GFP_KERNEL);
if (ret < 0)
return ret;
- *sentp += ret;
- if (ret != head->iov_len)
- goto out;
- if (xdr->page_len) {
- unsigned int offset, len, remaining;
- struct bio_vec *bvec;
-
- bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
- offset = offset_in_page(xdr->page_base);
- remaining = xdr->page_len;
- while (remaining > 0) {
- len = min(remaining, bvec->bv_len - offset);
- ret = kernel_sendpage(sock, bvec->bv_page,
- bvec->bv_offset + offset,
- len, 0);
- if (ret < 0)
- return ret;
- *sentp += ret;
- if (ret != len)
- goto out;
- remaining -= len;
- offset = 0;
- bvec++;
- }
- }
+ ret = zcopy_memdup(tail->iov_len, tail->iov_base, &xdr->bvec[n], GFP_KERNEL);
+ if (ret < 0)
+ return ret;
- if (tail->iov_len) {
- ret = svc_tcp_send_kvec(sock, tail, 0);
- if (ret < 0)
- return ret;
- *sentp += ret;
- }
+ size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 2, n + 3, size);
-out:
+ ret = sock_sendmsg(sock, &msg);
+ if (ret < 0)
+ return ret;
+ if (ret > 0)
+ *sentp = ret;
+ if (ret != size)
+ return -EAGAIN;
return 0;
}
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 36835b2f5446..6dff0b4f17b8 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -145,14 +145,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
{
size_t i, n = xdr_buf_pagecount(buf);
- if (n != 0 && buf->bvec == NULL) {
- buf->bvec = kmalloc_array(n, sizeof(buf->bvec[0]), gfp);
+ if (buf->bvec == NULL) {
+ /* Allow for two headers and a trailer to be attached */
+ buf->bvec = kmalloc_array(n + 3, sizeof(buf->bvec[0]), gfp);
if (!buf->bvec)
return -ENOMEM;
+ buf->bvec += 2;
+ buf->bvec[-2].bv_page = NULL;
+ buf->bvec[-1].bv_page = NULL;
for (i = 0; i < n; i++) {
bvec_set_page(&buf->bvec[i], buf->pages[i], PAGE_SIZE,
0);
}
+ buf->bvec[n].bv_page = NULL;
}
return 0;
}
@@ -160,8 +165,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
void
xdr_free_bvec(struct xdr_buf *buf)
{
- kfree(buf->bvec);
- buf->bvec = NULL;
+ if (buf->bvec) {
+ size_t n = xdr_buf_pagecount(buf);
+
+ if (buf->bvec[-2].bv_page)
+ put_page(buf->bvec[-2].bv_page);
+ if (buf->bvec[-1].bv_page)
+ put_page(buf->bvec[-1].bv_page);
+ if (buf->bvec[n].bv_page)
+ put_page(buf->bvec[n].bv_page);
+ buf->bvec -= 2;
+ kfree(buf->bvec);
+ buf->bvec = NULL;
+ }
}
/**
[!] Note: This is a work in progress. At the moment, some things won't
build if this patch is applied. nvme, kcm, smc, tls.
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with
MSG_SPLICE_PAGES should be used instead. This allows multiple pages and
multipage folios to be passed through.
Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
Documentation/networking/scaling.rst | 4 +-
crypto/af_alg.c | 29 ------
crypto/algif_aead.c | 22 +----
crypto/algif_rng.c | 2 -
crypto/algif_skcipher.c | 14 ---
include/linux/net.h | 8 --
include/net/inet_common.h | 2 -
include/net/sock.h | 6 --
net/appletalk/ddp.c | 1 -
net/atm/pvc.c | 1 -
net/atm/svc.c | 1 -
net/ax25/af_ax25.c | 1 -
net/caif/caif_socket.c | 2 -
net/can/bcm.c | 1 -
net/can/isotp.c | 1 -
net/can/j1939/socket.c | 1 -
net/can/raw.c | 1 -
net/core/sock.c | 35 +------
net/dccp/ipv4.c | 1 -
net/dccp/ipv6.c | 1 -
net/ieee802154/socket.c | 2 -
net/ipv4/af_inet.c | 21 ----
net/ipv4/tcp.c | 36 -------
net/ipv4/tcp_bpf.c | 21 +---
net/ipv4/tcp_ipv4.c | 1 -
net/ipv4/udp.c | 22 -----
net/ipv4/udp_impl.h | 2 -
net/ipv4/udplite.c | 1 -
net/ipv6/af_inet6.c | 3 -
net/ipv6/raw.c | 1 -
net/ipv6/tcp_ipv6.c | 1 -
net/key/af_key.c | 1 -
net/l2tp/l2tp_ip.c | 1 -
net/l2tp/l2tp_ip6.c | 1 -
net/llc/af_llc.c | 1 -
net/mctp/af_mctp.c | 1 -
net/mptcp/protocol.c | 2 -
net/netlink/af_netlink.c | 1 -
net/netrom/af_netrom.c | 1 -
net/packet/af_packet.c | 2 -
net/phonet/socket.c | 2 -
net/qrtr/af_qrtr.c | 1 -
net/rds/af_rds.c | 1 -
net/rose/af_rose.c | 1 -
net/rxrpc/af_rxrpc.c | 1 -
net/sctp/protocol.c | 1 -
net/socket.c | 48 ---------
net/tipc/socket.c | 3 -
net/unix/af_unix.c | 139 ---------------------------
net/vmw_vsock/af_vsock.c | 3 -
net/x25/af_x25.c | 1 -
net/xdp/xsk.c | 1 -
52 files changed, 9 insertions(+), 449 deletions(-)
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 3d435caa3ef2..92c9fb46d6a2 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes.
rps_sock_flow_table is a global flow table that contains the *desired* CPU
for flows: the CPU that is currently processing the flow in userspace.
Each table value is a CPU index that is updated during calls to recvmsg
-and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
-and tcp_splice_read()).
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
+tcp_splice_read()).
When the scheduler moves a thread to a new CPU while it has outstanding
receive packets on the old CPU, packets may arrive out of order. To
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 0e77fce60876..225c90657f58 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -483,7 +483,6 @@ static const struct proto_ops alg_proto_ops = {
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,
@@ -1135,34 +1134,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
}
EXPORT_SYMBOL_GPL(af_alg_sendmsg);
-/**
- * af_alg_sendpage - sendpage system call handler
- * @sock: socket of connection to user space to write to
- * @page: data to send
- * @offset: offset into page to begin sending
- * @size: length of data
- * @flags: message send/receive flags
- *
- * This is a generic implementation of sendpage to fill ctx->tsgl_list.
- */
-ssize_t af_alg_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES,
- };
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- return sock_sendmsg(sock, &msg);
-}
-EXPORT_SYMBOL_GPL(af_alg_sendpage);
-
/**
* af_alg_free_resources - release resources required for crypto request
* @areq: Request holding the TX and RX SGL
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 279eb17a1dfc..b65baefe6123 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -9,10 +9,10 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage. Filling up
- * the TX SGL does not cause a crypto operation -- the data will only be
- * tracked by the kernel. Upon receipt of one recvmsg call, the caller must
- * provide a buffer which is tracked with the RX SGL.
+ * filled by user space with the data submitted via sendmsg (maybe with with
+ * MSG_SPLICE_PAGES). Filling up the TX SGL does not cause a crypto operation
+ * -- the data will only be tracked by the kernel. Upon receipt of one recvmsg
+ * call, the caller must provide a buffer which is tracked with the RX SGL.
*
* During the processing of the recvmsg operation, the cipher request is
* allocated and prepared. As part of the recvmsg operation, the processed
@@ -368,7 +368,6 @@ static struct proto_ops algif_aead_ops = {
.release = af_alg_release,
.sendmsg = aead_sendmsg,
- .sendpage = af_alg_sendpage,
.recvmsg = aead_recvmsg,
.poll = af_alg_poll,
};
@@ -420,18 +419,6 @@ static int aead_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return aead_sendmsg(sock, msg, size);
}
-static ssize_t aead_sendpage_nokey(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
-
- err = aead_check_key(sock);
- if (err)
- return err;
-
- return af_alg_sendpage(sock, page, offset, size, flags);
-}
-
static int aead_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
@@ -459,7 +446,6 @@ static struct proto_ops algif_aead_ops_nokey = {
.release = af_alg_release,
.sendmsg = aead_sendmsg_nokey,
- .sendpage = aead_sendpage_nokey,
.recvmsg = aead_recvmsg_nokey,
.poll = af_alg_poll,
};
diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
index 407408c43730..10c41adac3b1 100644
--- a/crypto/algif_rng.c
+++ b/crypto/algif_rng.c
@@ -174,7 +174,6 @@ static struct proto_ops algif_rng_ops = {
.bind = sock_no_bind,
.accept = sock_no_accept,
.sendmsg = sock_no_sendmsg,
- .sendpage = sock_no_sendpage,
.release = af_alg_release,
.recvmsg = rng_recvmsg,
@@ -192,7 +191,6 @@ static struct proto_ops __maybe_unused algif_rng_test_ops = {
.mmap = sock_no_mmap,
.bind = sock_no_bind,
.accept = sock_no_accept,
- .sendpage = sock_no_sendpage,
.release = af_alg_release,
.recvmsg = rng_test_recvmsg,
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index 021f9ce7e87c..b34e20400e80 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -194,7 +194,6 @@ static struct proto_ops algif_skcipher_ops = {
.release = af_alg_release,
.sendmsg = skcipher_sendmsg,
- .sendpage = af_alg_sendpage,
.recvmsg = skcipher_recvmsg,
.poll = af_alg_poll,
};
@@ -246,18 +245,6 @@ static int skcipher_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return skcipher_sendmsg(sock, msg, size);
}
-static ssize_t skcipher_sendpage_nokey(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
-
- err = skcipher_check_key(sock);
- if (err)
- return err;
-
- return af_alg_sendpage(sock, page, offset, size, flags);
-}
-
static int skcipher_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
@@ -285,7 +272,6 @@ static struct proto_ops algif_skcipher_ops_nokey = {
.release = af_alg_release,
.sendmsg = skcipher_sendmsg_nokey,
- .sendpage = skcipher_sendpage_nokey,
.recvmsg = skcipher_recvmsg_nokey,
.poll = af_alg_poll,
};
diff --git a/include/linux/net.h b/include/linux/net.h
index b73ad8e3c212..e5794968ac9f 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -206,8 +206,6 @@ struct proto_ops {
size_t total_len, int flags);
int (*mmap) (struct file *file, struct socket *sock,
struct vm_area_struct * vma);
- ssize_t (*sendpage) (struct socket *sock, struct page *page,
- int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
int (*set_peek_off)(struct sock *sk, int val);
@@ -220,8 +218,6 @@ struct proto_ops {
sk_read_actor_t recv_actor);
/* This is different from read_sock(), it reads an entire skb at a time. */
int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor);
- int (*sendpage_locked)(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);
int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
size_t size);
int (*set_rcvlowat)(struct sock *sk, int val);
@@ -339,10 +335,6 @@ int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen,
int flags);
int kernel_getsockname(struct socket *sock, struct sockaddr *addr);
int kernel_getpeername(struct socket *sock, struct sockaddr *addr);
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags);
-int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
/* Routine returns the IP overhead imposed by a (caller-protected) socket. */
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index cec453c18f1d..054c3388fa51 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -33,8 +33,6 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags,
bool kern);
int inet_send_prepare(struct sock *sk);
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags);
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
int flags);
int inet_shutdown(struct socket *sock, int how);
diff --git a/include/net/sock.h b/include/net/sock.h
index 573f2bf7e0de..4618cd21e16b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1265,8 +1265,6 @@ struct proto {
size_t len);
int (*recvmsg)(struct sock *sk, struct msghdr *msg,
size_t len, int flags, int *addr_len);
- int (*sendpage)(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);
int (*bind)(struct sock *sk,
struct sockaddr *addr, int addr_len);
int (*bind_add)(struct sock *sk,
@@ -1906,10 +1904,6 @@ int sock_no_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
int sock_no_recvmsg(struct socket *, struct msghdr *, size_t, int);
int sock_no_mmap(struct file *file, struct socket *sock,
struct vm_area_struct *vma);
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags);
-ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);
/*
* Functions to fill in entries in struct proto_ops when a protocol
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index a06f4d4a6f47..8978fb6212ff 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1929,7 +1929,6 @@ static const struct proto_ops atalk_dgram_ops = {
.sendmsg = atalk_sendmsg,
.recvmsg = atalk_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct notifier_block ddp_notifier = {
diff --git a/net/atm/pvc.c b/net/atm/pvc.c
index 53e7d3f39e26..66d9a9bd5896 100644
--- a/net/atm/pvc.c
+++ b/net/atm/pvc.c
@@ -126,7 +126,6 @@ static const struct proto_ops pvc_proto_ops = {
.sendmsg = vcc_sendmsg,
.recvmsg = vcc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
diff --git a/net/atm/svc.c b/net/atm/svc.c
index 4a02bcaad279..289240fe234e 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -649,7 +649,6 @@ static const struct proto_ops svc_proto_ops = {
.sendmsg = vcc_sendmsg,
.recvmsg = vcc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index d8da400cb4de..5db805d5f74d 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -2022,7 +2022,6 @@ static const struct proto_ops ax25_proto_ops = {
.sendmsg = ax25_sendmsg,
.recvmsg = ax25_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
/*
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index 4eebcc66c19a..9c82698da4f5 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -976,7 +976,6 @@ static const struct proto_ops caif_seqpacket_ops = {
.sendmsg = caif_seqpkt_sendmsg,
.recvmsg = caif_seqpkt_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static const struct proto_ops caif_stream_ops = {
@@ -996,7 +995,6 @@ static const struct proto_ops caif_stream_ops = {
.sendmsg = caif_stream_sendmsg,
.recvmsg = caif_stream_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
/* This function is called when a socket is finally destroyed. */
diff --git a/net/can/bcm.c b/net/can/bcm.c
index 27706f6ace34..65a946a36d92 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1699,7 +1699,6 @@ static const struct proto_ops bcm_ops = {
.sendmsg = bcm_sendmsg,
.recvmsg = bcm_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct proto bcm_proto __read_mostly = {
diff --git a/net/can/isotp.c b/net/can/isotp.c
index 9bc344851704..0c3d11c29a2b 100644
--- a/net/can/isotp.c
+++ b/net/can/isotp.c
@@ -1633,7 +1633,6 @@ static const struct proto_ops isotp_ops = {
.sendmsg = isotp_sendmsg,
.recvmsg = isotp_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct proto isotp_proto __read_mostly = {
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 7e90f9e61d9b..2bfe4f79bb67 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1301,7 +1301,6 @@ static const struct proto_ops j1939_ops = {
.sendmsg = j1939_sk_sendmsg,
.recvmsg = j1939_sk_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct proto j1939_proto __read_mostly = {
diff --git a/net/can/raw.c b/net/can/raw.c
index f64469b98260..15c79b079184 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -962,7 +962,6 @@ static const struct proto_ops raw_ops = {
.sendmsg = raw_sendmsg,
.recvmsg = raw_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct proto raw_proto __read_mostly = {
diff --git a/net/core/sock.c b/net/core/sock.c
index 341c565dbc26..c2ae77bb2075 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3223,36 +3223,6 @@ void __receive_sock(struct file *file)
}
}
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
-{
- ssize_t res;
- struct msghdr msg = {.msg_flags = flags};
- struct kvec iov;
- char *kaddr = kmap(page);
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- res = kernel_sendmsg(sock, &msg, &iov, 1, size);
- kunmap(page);
- return res;
-}
-EXPORT_SYMBOL(sock_no_sendpage);
-
-ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page,
- int offset, size_t size, int flags)
-{
- ssize_t res;
- struct msghdr msg = {.msg_flags = flags};
- struct kvec iov;
- char *kaddr = kmap(page);
-
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- res = kernel_sendmsg_locked(sk, &msg, &iov, 1, size);
- kunmap(page);
- return res;
-}
-EXPORT_SYMBOL(sock_no_sendpage_locked);
-
/*
* Default Socket Callbacks
*/
@@ -4008,7 +3978,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
{
seq_printf(seq, "%-9s %4u %6d %6ld %-3s %6u %-3s %-10s "
- "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
+ "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
sock_prot_inuse_get(seq_file_net(seq), proto),
@@ -4029,7 +3999,6 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
proto_method_implemented(proto->getsockopt),
proto_method_implemented(proto->sendmsg),
proto_method_implemented(proto->recvmsg),
- proto_method_implemented(proto->sendpage),
proto_method_implemented(proto->bind),
proto_method_implemented(proto->backlog_rcv),
proto_method_implemented(proto->hash),
@@ -4050,7 +4019,7 @@ static int proto_seq_show(struct seq_file *seq, void *v)
"maxhdr",
"slab",
"module",
- "cl co di ac io in de sh ss gs se re sp bi br ha uh gp em\n");
+ "cl co di ac io in de sh ss gs se re bi br ha uh gp em\n");
else
proto_seq_printf(seq, list_entry(v, struct proto, node));
return 0;
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index b780827f5e0a..ea808de374ea 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -1008,7 +1008,6 @@ static const struct proto_ops inet_dccp_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct inet_protosw dccp_v4_protosw = {
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index b9d7c3dd1cb3..23eb8159e3cd 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1085,7 +1085,6 @@ static const struct proto_ops inet6_dccp_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c
index 1fa2fe041ec0..1238f036117f 100644
--- a/net/ieee802154/socket.c
+++ b/net/ieee802154/socket.c
@@ -426,7 +426,6 @@ static const struct proto_ops ieee802154_raw_ops = {
.sendmsg = ieee802154_sock_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
/* DGRAM Sockets (802.15.4 dataframes) */
@@ -990,7 +989,6 @@ static const struct proto_ops ieee802154_dgram_ops = {
.sendmsg = ieee802154_sock_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static void ieee802154_sock_destruct(struct sock *sk)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8db6747f892f..869b49933f15 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -827,23 +827,6 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(inet_sendmsg);
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- const struct proto *prot;
-
- if (unlikely(inet_send_prepare(sk)))
- return -EAGAIN;
-
- /* IPV6_ADDRFORM can change sk->sk_prot under us. */
- prot = READ_ONCE(sk->sk_prot);
- if (prot->sendpage)
- return prot->sendpage(sk, page, offset, size, flags);
- return sock_no_sendpage(sock, page, offset, size, flags);
-}
-EXPORT_SYMBOL(inet_sendpage);
-
INDIRECT_CALLABLE_DECLARE(int udp_recvmsg(struct sock *, struct msghdr *,
size_t, int, int *));
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
@@ -1046,12 +1029,10 @@ const struct proto_ops inet_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
- .sendpage = inet_sendpage,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
.read_skb = tcp_read_skb,
.sendmsg_locked = tcp_sendmsg_locked,
- .sendpage_locked = tcp_sendpage_locked,
.peek_len = tcp_peek_len,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
@@ -1080,7 +1061,6 @@ const struct proto_ops inet_dgram_ops = {
.read_skb = udp_read_skb,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
@@ -1111,7 +1091,6 @@ static const struct proto_ops inet_sockraw_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
#endif
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f1454e4497df..26fa387f1084 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,42 +971,6 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}
-int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES,
- };
-
- if (!(sk->sk_route_caps & NETIF_F_SG))
- return sock_no_sendpage_locked(sk, page, offset, size, flags);
-
- tcp_rate_check_app_limited(sk); /* is sending application-limited? */
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- return tcp_sendmsg_locked(sk, &msg, size);
-}
-EXPORT_SYMBOL_GPL(tcp_sendpage_locked);
-
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- int ret;
-
- lock_sock(sk);
- ret = tcp_sendpage_locked(sk, page, offset, size, flags);
- release_sock(sk);
-
- return ret;
-}
-EXPORT_SYMBOL(tcp_sendpage);
-
void tcp_free_fastopen_req(struct tcp_sock *tp)
{
if (tp->fastopen_req) {
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index de37a4372437..ab83cfb9de22 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -482,23 +482,6 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
return copied ? copied : err;
}
-static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES,
- };
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- return tcp_bpf_sendmsg(sk, &msg, size);
-}
-
enum {
TCP_BPF_IPV4,
TCP_BPF_IPV6,
@@ -528,7 +511,6 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
prot[TCP_BPF_TX] = prot[TCP_BPF_BASE];
prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg;
- prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage;
prot[TCP_BPF_RX] = prot[TCP_BPF_BASE];
prot[TCP_BPF_RX].recvmsg = tcp_bpf_recvmsg_parser;
@@ -563,8 +545,7 @@ static int tcp_bpf_assert_proto_ops(struct proto *ops)
* indeed valid assumptions.
*/
return ops->recvmsg == tcp_recvmsg &&
- ops->sendmsg == tcp_sendmsg &&
- ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP;
+ ops->sendmsg == tcp_sendmsg ? 0 : -ENOTSUPP;
}
int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..5c2e1c1ca329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3112,7 +3112,6 @@ struct proto tcp_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
- .sendpage = tcp_sendpage,
.backlog_rcv = tcp_v4_do_rcv,
.release_cb = tcp_release_cb,
.hash = inet_hash,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 097feb92e215..85bd5960f7ef 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1329,27 +1329,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udp_sendmsg);
-int udp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE
- };
- int ret;
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- lock_sock(sk);
- ret = udp_sendmsg(sk, &msg, size);
- release_sock(sk);
- return ret;
-}
-
#define UDP_SKB_IS_STATELESS 0x80000000
/* all head states (dst, sk, nf conntrack) except skb extensions are
@@ -2926,7 +2905,6 @@ struct proto udp_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
- .sendpage = udp_sendpage,
.release_cb = ip4_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 4ba7a88a1b1d..e1ff3a375996 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -19,8 +19,6 @@ int udp_getsockopt(struct sock *sk, int level, int optname,
int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
int *addr_len);
-int udp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
- int flags);
void udp_destroy_sock(struct sock *sk);
#ifdef CONFIG_PROC_FS
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index e0c9cc39b81e..69870f0afc6c 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -54,7 +54,6 @@ struct proto udplite_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
- .sendpage = udp_sendpage,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,
.rehash = udp_v4_rehash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 38689bedfce7..769c76d59053 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -695,9 +695,7 @@ const struct proto_ops inet6_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
- .sendpage = inet_sendpage,
.sendmsg_locked = tcp_sendmsg_locked,
- .sendpage_locked = tcp_sendpage_locked,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
.read_skb = tcp_read_skb,
@@ -728,7 +726,6 @@ const struct proto_ops inet6_dgram_ops = {
.recvmsg = inet6_recvmsg, /* retpoline's sake */
.read_skb = udp_read_skb,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index bac9ba747bde..c6c062678c0e 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1298,7 +1298,6 @@ const struct proto_ops inet6_sockraw_ops = {
.sendmsg = inet_sendmsg, /* ok */
.recvmsg = sock_common_recvmsg, /* ok */
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 1bf93b61aa06..03ba1e389901 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2151,7 +2151,6 @@ struct proto tcpv6_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
- .sendpage = tcp_sendpage,
.backlog_rcv = tcp_v6_do_rcv,
.release_cb = tcp_release_cb,
.hash = inet6_hash,
diff --git a/net/key/af_key.c b/net/key/af_key.c
index a815f5ab4c49..bf59d42dc697 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3757,7 +3757,6 @@ static const struct proto_ops pfkey_ops = {
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
/* Now the operations that really occur. */
.release = pfkey_release,
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4db5a554bdbd..d0dcbe3a4cd7 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -625,7 +625,6 @@ static const struct proto_ops l2tp_ip_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct inet_protosw l2tp_ip_protosw = {
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 2478aa60145f..49296ce14a90 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -751,7 +751,6 @@ static const struct proto_ops l2tp_ip6_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index da7fe94bea2e..addd94da2a81 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -1230,7 +1230,6 @@ static const struct proto_ops llc_ui_ops = {
.sendmsg = llc_ui_sendmsg,
.recvmsg = llc_ui_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static const char llc_proc_err_msg[] __initconst =
diff --git a/net/mctp/af_mctp.c b/net/mctp/af_mctp.c
index 3150f3f0c872..c6fe2e6b85dd 100644
--- a/net/mctp/af_mctp.c
+++ b/net/mctp/af_mctp.c
@@ -485,7 +485,6 @@ static const struct proto_ops mctp_dgram_ops = {
.sendmsg = mctp_sendmsg,
.recvmsg = mctp_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = mctp_compat_ioctl,
#endif
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3ad9c46202fc..ade89b8d0082 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3816,7 +3816,6 @@ static const struct proto_ops mptcp_stream_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
};
static struct inet_protosw mptcp_protosw = {
@@ -3911,7 +3910,6 @@ static const struct proto_ops mptcp_v6_stream_ops = {
.sendmsg = inet6_sendmsg,
.recvmsg = inet6_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c64277659753..f70073a3bb49 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2841,7 +2841,6 @@ static const struct proto_ops netlink_ops = {
.sendmsg = netlink_sendmsg,
.recvmsg = netlink_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static const struct net_proto_family netlink_family_ops = {
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index 5a4cb796150f..eb8ccbd58df7 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1364,7 +1364,6 @@ static const struct proto_ops nr_proto_ops = {
.sendmsg = nr_sendmsg,
.recvmsg = nr_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct notifier_block nr_dev_notifier = {
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d4e76e2ae153..385bd4982b80 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4604,7 +4604,6 @@ static const struct proto_ops packet_ops_spkt = {
.sendmsg = packet_sendmsg_spkt,
.recvmsg = packet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static const struct proto_ops packet_ops = {
@@ -4626,7 +4625,6 @@ static const struct proto_ops packet_ops = {
.sendmsg = packet_sendmsg,
.recvmsg = packet_recvmsg,
.mmap = packet_mmap,
- .sendpage = sock_no_sendpage,
};
static const struct net_proto_family packet_family_ops = {
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index 71e2caf6ab85..a246f7d0a817 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -441,7 +441,6 @@ const struct proto_ops phonet_dgram_ops = {
.sendmsg = pn_socket_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
const struct proto_ops phonet_stream_ops = {
@@ -462,7 +461,6 @@ const struct proto_ops phonet_stream_ops = {
.sendmsg = pn_socket_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
EXPORT_SYMBOL(phonet_stream_ops);
diff --git a/net/qrtr/af_qrtr.c b/net/qrtr/af_qrtr.c
index 5c2fb992803b..5bb7d680bd5f 100644
--- a/net/qrtr/af_qrtr.c
+++ b/net/qrtr/af_qrtr.c
@@ -1240,7 +1240,6 @@ static const struct proto_ops qrtr_proto_ops = {
.shutdown = sock_no_shutdown,
.release = qrtr_release,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct proto qrtr_proto = {
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 3ff6995244e5..01c4cdfef45d 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -653,7 +653,6 @@ static const struct proto_ops rds_proto_ops = {
.sendmsg = rds_sendmsg,
.recvmsg = rds_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static void rds_sock_destruct(struct sock *sk)
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index ca2b17f32670..49dafe9ac72f 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1496,7 +1496,6 @@ static const struct proto_ops rose_proto_ops = {
.sendmsg = rose_sendmsg,
.recvmsg = rose_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct notifier_block rose_dev_notifier = {
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 102f5cbff91a..182495804f8f 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -938,7 +938,6 @@ static const struct proto_ops rxrpc_rpc_ops = {
.sendmsg = rxrpc_sendmsg,
.recvmsg = rxrpc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct proto rxrpc_proto = {
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index c365df24ad33..acb2d2a69268 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1135,7 +1135,6 @@ static const struct proto_ops inet_seqpacket_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
/* Registration with AF_INET family. */
diff --git a/net/socket.c b/net/socket.c
index 1b48a976b8cc..130d6ce7f82d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3541,54 +3541,6 @@ int kernel_getpeername(struct socket *sock, struct sockaddr *addr)
}
EXPORT_SYMBOL(kernel_getpeername);
-/**
- * kernel_sendpage - send a &page through a socket (kernel space)
- * @sock: socket
- * @page: page
- * @offset: page offset
- * @size: total size in bytes
- * @flags: flags (MSG_DONTWAIT, ...)
- *
- * Returns the total amount sent in bytes or an error.
- */
-
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags)
-{
- if (sock->ops->sendpage) {
- /* Warn in case the improper page to zero-copy send */
- WARN_ONCE(!sendpage_ok(page), "improper page for zero-copy send");
- return sock->ops->sendpage(sock, page, offset, size, flags);
- }
- return sock_no_sendpage(sock, page, offset, size, flags);
-}
-EXPORT_SYMBOL(kernel_sendpage);
-
-/**
- * kernel_sendpage_locked - send a &page through the locked sock (kernel space)
- * @sk: sock
- * @page: page
- * @offset: page offset
- * @size: total size in bytes
- * @flags: flags (MSG_DONTWAIT, ...)
- *
- * Returns the total amount sent in bytes or an error.
- * Caller must hold @sk.
- */
-
-int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct socket *sock = sk->sk_socket;
-
- if (sock->ops->sendpage_locked)
- return sock->ops->sendpage_locked(sk, page, offset, size,
- flags);
-
- return sock_no_sendpage_locked(sk, page, offset, size, flags);
-}
-EXPORT_SYMBOL(kernel_sendpage_locked);
-
/**
* kernel_sock_shutdown - shut down part of a full-duplex connection (kernel space)
* @sock: socket
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 37edfe10f8c6..d2072fbf3272 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3375,7 +3375,6 @@ static const struct proto_ops msg_ops = {
.sendmsg = tipc_sendmsg,
.recvmsg = tipc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage
};
static const struct proto_ops packet_ops = {
@@ -3396,7 +3395,6 @@ static const struct proto_ops packet_ops = {
.sendmsg = tipc_send_packet,
.recvmsg = tipc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage
};
static const struct proto_ops stream_ops = {
@@ -3417,7 +3415,6 @@ static const struct proto_ops stream_ops = {
.sendmsg = tipc_sendstream,
.recvmsg = tipc_recvstream,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage
};
static const struct net_proto_family tipc_family_ops = {
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 6f3454db9c53..407f449df564 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -758,8 +758,6 @@ static int unix_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned lon
static int unix_shutdown(struct socket *, int);
static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
-static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
- size_t size, int flags);
static ssize_t unix_stream_splice_read(struct socket *, loff_t *ppos,
struct pipe_inode_info *, size_t size,
unsigned int flags);
@@ -852,7 +850,6 @@ static const struct proto_ops unix_stream_ops = {
.recvmsg = unix_stream_recvmsg,
.read_skb = unix_stream_read_skb,
.mmap = sock_no_mmap,
- .sendpage = unix_stream_sendpage,
.splice_read = unix_stream_splice_read,
.set_peek_off = unix_set_peek_off,
.show_fdinfo = unix_show_fdinfo,
@@ -878,7 +875,6 @@ static const struct proto_ops unix_dgram_ops = {
.read_skb = unix_read_skb,
.recvmsg = unix_dgram_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_peek_off = unix_set_peek_off,
.show_fdinfo = unix_show_fdinfo,
};
@@ -902,7 +898,6 @@ static const struct proto_ops unix_seqpacket_ops = {
.sendmsg = unix_seqpacket_sendmsg,
.recvmsg = unix_seqpacket_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_peek_off = unix_set_peek_off,
.show_fdinfo = unix_show_fdinfo,
};
@@ -1839,24 +1834,6 @@ static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
}
}
-static int maybe_init_creds(struct scm_cookie *scm,
- struct socket *socket,
- const struct sock *other)
-{
- int err;
- struct msghdr msg = { .msg_controllen = 0 };
-
- err = scm_send(socket, &msg, scm, false);
- if (err)
- return err;
-
- if (unix_passcred_enabled(socket, other)) {
- scm->pid = get_pid(task_tgid(current));
- current_uid_gid(&scm->creds.uid, &scm->creds.gid);
- }
- return err;
-}
-
static bool unix_skb_scm_eq(struct sk_buff *skb,
struct scm_cookie *scm)
{
@@ -2318,122 +2295,6 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
return sent ? : err;
}
-static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
- bool send_sigpipe = false;
- bool init_scm = true;
- struct scm_cookie scm;
- struct sock *other, *sk = socket->sk;
- struct sk_buff *skb, *newskb = NULL, *tail = NULL;
-
- if (flags & MSG_OOB)
- return -EOPNOTSUPP;
-
- other = unix_peer(sk);
- if (!other || sk->sk_state != TCP_ESTABLISHED)
- return -ENOTCONN;
-
- if (false) {
-alloc_skb:
- unix_state_unlock(other);
- mutex_unlock(&unix_sk(other)->iolock);
- newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
- &err, 0);
- if (!newskb)
- goto err;
- }
-
- /* we must acquire iolock as we modify already present
- * skbs in the sk_receive_queue and mess with skb->len
- */
- err = mutex_lock_interruptible(&unix_sk(other)->iolock);
- if (err) {
- err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
- goto err;
- }
-
- if (sk->sk_shutdown & SEND_SHUTDOWN) {
- err = -EPIPE;
- send_sigpipe = true;
- goto err_unlock;
- }
-
- unix_state_lock(other);
-
- if (sock_flag(other, SOCK_DEAD) ||
- other->sk_shutdown & RCV_SHUTDOWN) {
- err = -EPIPE;
- send_sigpipe = true;
- goto err_state_unlock;
- }
-
- if (init_scm) {
- err = maybe_init_creds(&scm, socket, other);
- if (err)
- goto err_state_unlock;
- init_scm = false;
- }
-
- skb = skb_peek_tail(&other->sk_receive_queue);
- if (tail && tail == skb) {
- skb = newskb;
- } else if (!skb || !unix_skb_scm_eq(skb, &scm)) {
- if (newskb) {
- skb = newskb;
- } else {
- tail = skb;
- goto alloc_skb;
- }
- } else if (newskb) {
- /* this is fast path, we don't necessarily need to
- * call to kfree_skb even though with newskb == NULL
- * this - does no harm
- */
- consume_skb(newskb);
- newskb = NULL;
- }
-
- if (skb_append_pagefrags(skb, page, offset, size)) {
- tail = skb;
- goto alloc_skb;
- }
-
- skb->len += size;
- skb->data_len += size;
- skb->truesize += size;
- refcount_add(size, &sk->sk_wmem_alloc);
-
- if (newskb) {
- err = unix_scm_to_skb(&scm, skb, false);
- if (err)
- goto err_state_unlock;
- spin_lock(&other->sk_receive_queue.lock);
- __skb_queue_tail(&other->sk_receive_queue, newskb);
- spin_unlock(&other->sk_receive_queue.lock);
- }
-
- unix_state_unlock(other);
- mutex_unlock(&unix_sk(other)->iolock);
-
- other->sk_data_ready(other);
- scm_destroy(&scm);
- return size;
-
-err_state_unlock:
- unix_state_unlock(other);
-err_unlock:
- mutex_unlock(&unix_sk(other)->iolock);
-err:
- kfree_skb(newskb);
- if (send_sigpipe && !(flags & MSG_NOSIGNAL))
- send_sig(SIGPIPE, current, 0);
- if (!init_scm)
- scm_destroy(&scm);
- return err;
-}
-
static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg,
size_t len)
{
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 19aea7cba26e..d0e476755cdc 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1271,7 +1271,6 @@ static const struct proto_ops vsock_dgram_ops = {
.sendmsg = vsock_dgram_sendmsg,
.recvmsg = vsock_dgram_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static int vsock_transport_cancel_pkt(struct vsock_sock *vsk)
@@ -2186,7 +2185,6 @@ static const struct proto_ops vsock_stream_ops = {
.sendmsg = vsock_connectible_sendmsg,
.recvmsg = vsock_connectible_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_rcvlowat = vsock_set_rcvlowat,
};
@@ -2208,7 +2206,6 @@ static const struct proto_ops vsock_seqpacket_ops = {
.sendmsg = vsock_connectible_sendmsg,
.recvmsg = vsock_connectible_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static int vsock_create(struct net *net, struct socket *sock,
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 5c7ad301d742..0fb5143bec7a 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1757,7 +1757,6 @@ static const struct proto_ops x25_proto_ops = {
.sendmsg = x25_sendmsg,
.recvmsg = x25_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
static struct packet_type x25_packet_type __read_mostly = {
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2ac58b282b5e..eff1f0aaa4b5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1386,7 +1386,6 @@ static const struct proto_ops xsk_proto_ops = {
.sendmsg = xsk_sendmsg,
.recvmsg = xsk_recvmsg,
.mmap = xsk_mmap,
- .sendpage = sock_no_sendpage,
};
static void xsk_destruct(struct sock *sk)
On 16.03.2023 15:26:18, David Howells wrote:
> [!] Note: This is a work in progress. At the moment, some things won't
> build if this patch is applied. nvme, kcm, smc, tls.
>
> Remove ->sendpage() and ->sendpage_locked(). sendmsg() with
> MSG_SPLICE_PAGES should be used instead. This allows multiple pages and
> multipage folios to be passed through.
>
> Signed-off-by: David Howells <[email protected]>
> cc: [email protected]
Acked-by: Marc Kleine-Budde <[email protected]> # for net/can
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
> On Mar 16, 2023, at 11:26, David Howells <[email protected]> wrote:
>
> When transmitting data, call down into TCP using a single sendmsg with
> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
> performing several sendmsg and sendpage calls to transmit header, data
> pages and trailer.
>
> To make this work, the data is assembled in a bio_vec array and attached to
> a BVEC-type iterator. The bio_vec array has two extra slots before the
> first for headers and one after the last for a trailer. The headers and
> trailer are copied into memory acquired from zcopy_alloc() which just
> breaks a page up into small pieces that can be freed with put_page().
>
> Signed-off-by: David Howells <[email protected]>
> cc: Trond Myklebust <[email protected]>
> cc: Anna Schumaker <[email protected]>
> cc: Chuck Lever <[email protected]>
> cc: Jeff Layton <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
> net/sunrpc/svcsock.c | 70 ++++++++++++--------------------------------
> net/sunrpc/xdr.c | 24 ++++++++++++---
> 2 files changed, 38 insertions(+), 56 deletions(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 03a4f5615086..1fa41ddbc40e 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -36,6 +36,7 @@
> #include <linux/skbuff.h>
> #include <linux/file.h>
> #include <linux/freezer.h>
> +#include <linux/zcopy_alloc.h>
> #include <net/sock.h>
> #include <net/checksum.h>
> #include <net/ip.h>
> @@ -1060,16 +1061,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> return 0; /* record not complete */
> }
>
> -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
> - int flags)
> -{
> - return kernel_sendpage(sock, virt_to_page(vec->iov_base),
> - offset_in_page(vec->iov_base),
> - vec->iov_len, flags);
> -}
> -
> /*
> - * kernel_sendpage() is used exclusively to reduce the number of
> + * MSG_SPLICE_PAGES is used exclusively to reduce the number of
> * copy operations in this path. Therefore the caller must ensure
> * that the pages backing @xdr are unchanging.
> *
> @@ -1081,65 +1074,38 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
> {
> const struct kvec *head = xdr->head;
> const struct kvec *tail = xdr->tail;
> - struct kvec rm = {
> - .iov_base = &marker,
> - .iov_len = sizeof(marker),
> - };
> struct msghdr msg = {
> - .msg_flags = 0,
> + .msg_flags = MSG_SPLICE_PAGES,
> };
> - int ret;
> + int ret, n = xdr_buf_pagecount(xdr), size;
>
> *sentp = 0;
> ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
> if (ret < 0)
> return ret;
>
> - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
> + ret = zcopy_memdup(sizeof(marker), &marker, &xdr->bvec[-2], GFP_KERNEL);
> if (ret < 0)
> return ret;
> - *sentp += ret;
> - if (ret != rm.iov_len)
> - return -EAGAIN;
>
> - ret = svc_tcp_send_kvec(sock, head, 0);
> + ret = zcopy_memdup(head->iov_len, head->iov_base, &xdr->bvec[-1], GFP_KERNEL);
> if (ret < 0)
> return ret;
> - *sentp += ret;
> - if (ret != head->iov_len)
> - goto out;
>
> - if (xdr->page_len) {
> - unsigned int offset, len, remaining;
> - struct bio_vec *bvec;
> -
> - bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
> - offset = offset_in_page(xdr->page_base);
> - remaining = xdr->page_len;
> - while (remaining > 0) {
> - len = min(remaining, bvec->bv_len - offset);
> - ret = kernel_sendpage(sock, bvec->bv_page,
> - bvec->bv_offset + offset,
> - len, 0);
> - if (ret < 0)
> - return ret;
> - *sentp += ret;
> - if (ret != len)
> - goto out;
> - remaining -= len;
> - offset = 0;
> - bvec++;
> - }
> - }
> + ret = zcopy_memdup(tail->iov_len, tail->iov_base, &xdr->bvec[n], GFP_KERNEL);
> + if (ret < 0)
> + return ret;
>
> - if (tail->iov_len) {
> - ret = svc_tcp_send_kvec(sock, tail, 0);
> - if (ret < 0)
> - return ret;
> - *sentp += ret;
> - }
> + size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 2, n + 3, size);
>
> -out:
> + ret = sock_sendmsg(sock, &msg);
> + if (ret < 0)
> + return ret;
> + if (ret > 0)
> + *sentp = ret;
> + if (ret != size)
> + return -EAGAIN;
> return 0;
> }
>
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 36835b2f5446..6dff0b4f17b8 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -145,14 +145,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
> {
> size_t i, n = xdr_buf_pagecount(buf);
>
> - if (n != 0 && buf->bvec == NULL) {
> - buf->bvec = kmalloc_array(n, sizeof(buf->bvec[0]), gfp);
> + if (buf->bvec == NULL) {
> + /* Allow for two headers and a trailer to be attached */
> + buf->bvec = kmalloc_array(n + 3, sizeof(buf->bvec[0]), gfp);
> if (!buf->bvec)
> return -ENOMEM;
> + buf->bvec += 2;
> + buf->bvec[-2].bv_page = NULL;
> + buf->bvec[-1].bv_page = NULL;
NACK.
> for (i = 0; i < n; i++) {
> bvec_set_page(&buf->bvec[i], buf->pages[i], PAGE_SIZE,
> 0);
> }
> + buf->bvec[n].bv_page = NULL;
> }
> return 0;
> }
> @@ -160,8 +165,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
> void
> xdr_free_bvec(struct xdr_buf *buf)
> {
> - kfree(buf->bvec);
> - buf->bvec = NULL;
> + if (buf->bvec) {
> + size_t n = xdr_buf_pagecount(buf);
> +
> + if (buf->bvec[-2].bv_page)
> + put_page(buf->bvec[-2].bv_page);
> + if (buf->bvec[-1].bv_page)
> + put_page(buf->bvec[-1].bv_page);
> + if (buf->bvec[n].bv_page)
> + put_page(buf->bvec[n].bv_page);
> + buf->bvec -= 2;
> + kfree(buf->bvec);
> + buf->bvec = NULL;
> + }
> }
>
> /**
>
Trond Myklebust <[email protected]> wrote:
> > + buf->bvec += 2;
> > + buf->bvec[-2].bv_page = NULL;
> > + buf->bvec[-1].bv_page = NULL;
>
> NACK.
Can you elaborate?
Is it that you dislike allocating extra slots for protocol bits? Or just that
the bvec[] is offset by 2? Or some other reason?
David
Note: this is the first I've seen of this series -- not sure why
I never received any of these patches.
That means I haven't seen the cover letter and do not have any
context for this proposed change.
> On Mar 16, 2023, at 12:17 PM, Trond Myklebust <[email protected]> wrote:
>
>> On Mar 16, 2023, at 11:26, David Howells <[email protected]> wrote:
>>
>> When transmitting data, call down into TCP using a single sendmsg with
>> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
>> performing several sendmsg and sendpage calls to transmit header, data
>> pages and trailer.
We've tried combining the sendpages calls in here before. It
results in a significant and measurable performance regression.
See:
da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")
and it's subsequent revert:
4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again")
Therefore, this kind of change needs to be accompanied by both
benchmark results and some field testing to convince me it won't
cause harm.
Also, I'd rather see struct xdr_buf changed to /replace/ the
head/pagevec/tail arrangement with bvecs before we do this
kind of overhaul.
And, we have to make certain that this doesn't break operation
with kTLS sockets... do they support MSG_SPLICE_PAGES ?
>> To make this work, the data is assembled in a bio_vec array and attached to
>> a BVEC-type iterator. The bio_vec array has two extra slots before the
>> first for headers and one after the last for a trailer. The headers and
>> trailer are copied into memory acquired from zcopy_alloc() which just
>> breaks a page up into small pieces that can be freed with put_page().
>>
>> Signed-off-by: David Howells <[email protected]>
>> cc: Trond Myklebust <[email protected]>
>> cc: Anna Schumaker <[email protected]>
>> cc: Chuck Lever <[email protected]>
>> cc: Jeff Layton <[email protected]>
>> cc: "David S. Miller" <[email protected]>
>> cc: Eric Dumazet <[email protected]>
>> cc: Jakub Kicinski <[email protected]>
>> cc: Paolo Abeni <[email protected]>
>> cc: Jens Axboe <[email protected]>
>> cc: Matthew Wilcox <[email protected]>
>> cc: [email protected]
>> cc: [email protected]
>> ---
>> net/sunrpc/svcsock.c | 70 ++++++++++++--------------------------------
>> net/sunrpc/xdr.c | 24 ++++++++++++---
>> 2 files changed, 38 insertions(+), 56 deletions(-)
>>
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 03a4f5615086..1fa41ddbc40e 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -36,6 +36,7 @@
>> #include <linux/skbuff.h>
>> #include <linux/file.h>
>> #include <linux/freezer.h>
>> +#include <linux/zcopy_alloc.h>
>> #include <net/sock.h>
>> #include <net/checksum.h>
>> #include <net/ip.h>
>> @@ -1060,16 +1061,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>> return 0; /* record not complete */
>> }
>>
>> -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
>> - int flags)
>> -{
>> - return kernel_sendpage(sock, virt_to_page(vec->iov_base),
>> - offset_in_page(vec->iov_base),
>> - vec->iov_len, flags);
>> -}
>> -
>> /*
>> - * kernel_sendpage() is used exclusively to reduce the number of
>> + * MSG_SPLICE_PAGES is used exclusively to reduce the number of
>> * copy operations in this path. Therefore the caller must ensure
>> * that the pages backing @xdr are unchanging.
>> *
>> @@ -1081,65 +1074,38 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
>> {
>> const struct kvec *head = xdr->head;
>> const struct kvec *tail = xdr->tail;
>> - struct kvec rm = {
>> - .iov_base = &marker,
>> - .iov_len = sizeof(marker),
>> - };
>> struct msghdr msg = {
>> - .msg_flags = 0,
>> + .msg_flags = MSG_SPLICE_PAGES,
>> };
>> - int ret;
>> + int ret, n = xdr_buf_pagecount(xdr), size;
>>
>> *sentp = 0;
>> ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
>> if (ret < 0)
>> return ret;
>>
>> - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
>> + ret = zcopy_memdup(sizeof(marker), &marker, &xdr->bvec[-2], GFP_KERNEL);
>> if (ret < 0)
>> return ret;
>> - *sentp += ret;
>> - if (ret != rm.iov_len)
>> - return -EAGAIN;
>>
>> - ret = svc_tcp_send_kvec(sock, head, 0);
>> + ret = zcopy_memdup(head->iov_len, head->iov_base, &xdr->bvec[-1], GFP_KERNEL);
>> if (ret < 0)
>> return ret;
>> - *sentp += ret;
>> - if (ret != head->iov_len)
>> - goto out;
>>
>> - if (xdr->page_len) {
>> - unsigned int offset, len, remaining;
>> - struct bio_vec *bvec;
>> -
>> - bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
>> - offset = offset_in_page(xdr->page_base);
>> - remaining = xdr->page_len;
>> - while (remaining > 0) {
>> - len = min(remaining, bvec->bv_len - offset);
>> - ret = kernel_sendpage(sock, bvec->bv_page,
>> - bvec->bv_offset + offset,
>> - len, 0);
>> - if (ret < 0)
>> - return ret;
>> - *sentp += ret;
>> - if (ret != len)
>> - goto out;
>> - remaining -= len;
>> - offset = 0;
>> - bvec++;
>> - }
>> - }
>> + ret = zcopy_memdup(tail->iov_len, tail->iov_base, &xdr->bvec[n], GFP_KERNEL);
>> + if (ret < 0)
>> + return ret;
>>
>> - if (tail->iov_len) {
>> - ret = svc_tcp_send_kvec(sock, tail, 0);
>> - if (ret < 0)
>> - return ret;
>> - *sentp += ret;
>> - }
>> + size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
>> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 2, n + 3, size);
>>
>> -out:
>> + ret = sock_sendmsg(sock, &msg);
>> + if (ret < 0)
>> + return ret;
>> + if (ret > 0)
>> + *sentp = ret;
>> + if (ret != size)
>> + return -EAGAIN;
>> return 0;
>> }
>>
>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
>> index 36835b2f5446..6dff0b4f17b8 100644
>> --- a/net/sunrpc/xdr.c
>> +++ b/net/sunrpc/xdr.c
>> @@ -145,14 +145,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
>> {
>> size_t i, n = xdr_buf_pagecount(buf);
>>
>> - if (n != 0 && buf->bvec == NULL) {
>> - buf->bvec = kmalloc_array(n, sizeof(buf->bvec[0]), gfp);
>> + if (buf->bvec == NULL) {
>> + /* Allow for two headers and a trailer to be attached */
>> + buf->bvec = kmalloc_array(n + 3, sizeof(buf->bvec[0]), gfp);
>> if (!buf->bvec)
>> return -ENOMEM;
>> + buf->bvec += 2;
>> + buf->bvec[-2].bv_page = NULL;
>> + buf->bvec[-1].bv_page = NULL;
>
> NACK.
>
>> for (i = 0; i < n; i++) {
>> bvec_set_page(&buf->bvec[i], buf->pages[i], PAGE_SIZE,
>> 0);
>> }
>> + buf->bvec[n].bv_page = NULL;
>> }
>> return 0;
>> }
>> @@ -160,8 +165,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
>> void
>> xdr_free_bvec(struct xdr_buf *buf)
>> {
>> - kfree(buf->bvec);
>> - buf->bvec = NULL;
>> + if (buf->bvec) {
>> + size_t n = xdr_buf_pagecount(buf);
>> +
>> + if (buf->bvec[-2].bv_page)
>> + put_page(buf->bvec[-2].bv_page);
>> + if (buf->bvec[-1].bv_page)
>> + put_page(buf->bvec[-1].bv_page);
>> + if (buf->bvec[n].bv_page)
>> + put_page(buf->bvec[n].bv_page);
>> + buf->bvec -= 2;
>> + kfree(buf->bvec);
>> + buf->bvec = NULL;
>> + }
>> }
>>
>> /**
>>
>
--
Chuck Lever
On Thu, Mar 16, 2023 at 03:25:52PM +0000, David Howells wrote:
> If a network protocol sendmsg() sees MSG_SPLICE_DATA, it expects that the
> iterator is of ITER_BVEC type and that all the pages can have refs taken on
> them with get_page() and discarded with put_page(). Bits of network
> filesystem protocol data, however, are typically contained in slab memory
> for which the cleanup method is kfree(), not put_page(), so this doesn't
> work.
>
> Provide a simple allocator, zcopy_alloc(), that allocates a page at a time
> per-cpu and sequentially breaks off pieces and hands them out with a ref as
> it's asked for them. The caller disposes of the memory it was given by
> calling put_page(). When a page is all parcelled out, it is abandoned by
> the allocator and another page is obtained. The page will get cleaned up
> when the last skbuff fragment is destroyed.
This feels a _lot_ like the page_frag allocator. Can the two be
unified?
Chuck Lever III <[email protected]> wrote:
> That means I haven't seen the cover letter and do not have any
> context for this proposed change.
https://lore.kernel.org/linux-fsdevel/[email protected]/
> We've tried combining the sendpages calls in here before. It
> results in a significant and measurable performance regression.
> See:
>
> da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")
The commit replaced the use of sendpage with sendmsg, but that took away the
zerocopy aspect of sendpage. The idea behind MSG_SPLICE_PAGES is that it
allows you to do keep that. I'll have to try reapplying this commit and
adding the MSG_SPLICE_PAGES flag.
> Therefore, this kind of change needs to be accompanied by both
> benchmark results and some field testing to convince me it won't
> cause harm.
Yep.
> And, we have to make certain that this doesn't break operation
> with kTLS sockets... do they support MSG_SPLICE_PAGES ?
I haven't yet tackled AF_TLS, AF_KCM or AF_SMC as they seem significantly more
complex than TCP and UDP. I thought I'd get some feedback on what I have
before I tried my hand at those.
David
> On Mar 16, 2023, at 1:28 PM, David Howells <[email protected]> wrote:
>
> Chuck Lever III <[email protected]> wrote:
>
>> That means I haven't seen the cover letter and do not have any
>> context for this proposed change.
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
>> We've tried combining the sendpages calls in here before. It
>> results in a significant and measurable performance regression.
>> See:
>>
>> da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")
>
> The commit replaced the use of sendpage with sendmsg, but that took away the
> zerocopy aspect of sendpage. The idea behind MSG_SPLICE_PAGES is that it
> allows you to do keep that. I'll have to try reapplying this commit and
> adding the MSG_SPLICE_PAGES flag.
Note that, as Trond point out, NFSD can handle an NFS READ
request with either a splice actor or by copying through a
vector, depending on what the underlying filesystem can
support and whether we are using a security flavor that
requires stable pages. Grep for RQ_SPLICE_OK.
Eventually we want to make use of iomaps to ensure that
reading areas of a file that are not allocated on disk
does not trigger an extent allocation. Anna is working on
that, but I have no idea what it will look like. We can
talk more at LSF, if you'll both be around.
Also... I find I have to put back the use of MSG_MORE and
friends in here, otherwise kTLS will split each of these
kernel_sendsomething() calls into its own TLS record. This
code is likely going to look different after support for
RPC-with-TLS goes in.
>> Therefore, this kind of change needs to be accompanied by both
>> benchmark results and some field testing to convince me it won't
>> cause harm.
>
> Yep.
>
>> And, we have to make certain that this doesn't break operation
>> with kTLS sockets... do they support MSG_SPLICE_PAGES ?
>
> I haven't yet tackled AF_TLS, AF_KCM or AF_SMC as they seem significantly more
> complex than TCP and UDP. I thought I'd get some feedback on what I have
> before I tried my hand at those.
OK, I didn't mean AF_TLS, I meant the stuff under net/tls,
which is AF_INET[6] and TCP, but with a ULP in place. It's
got its own sendpage and sendmsg methods that choke when
an unrecognized MSG_ flag is present.
But OK, you're just asking for feedback, so I'll put my red
pencil down.
--
Chuck Lever
Matthew Wilcox <[email protected]> wrote:
> This feels a _lot_ like the page_frag allocator. Can the two be
> unified?
Looks kind of similar. I might well be able to use that instead.
David
Trond Myklebust <[email protected]> wrote:
> 1) This is code that is common to the client and the server. Why are we
> adding unused 3 bvec slots to every client RPC call?
Fair point, but I'm trying to avoid making four+ sendmsg calls in nfsd rather
than one.
> 2) It obfuscates the existence of these bvec slots.
True, it'd be nice to find a better way to do it. Question is, can the client
make use of MSG_SPLICE_PAGES also?
> 3) knfsd may use splice_direct_to_actor() in order to avoid copying the page
> cache data into private buffers (it just takes a reference to the
> pages). Using MSG_SPLICE_PAGES will presumably require it to protect those
> pages against further writes while the socket is referencing them.
Upstream sunrpc is using sendpage with TCP. It already has that issue.
MSG_SPLICE_PAGES is a way of doing sendpage through sendmsg.
David
David Howells wrote:
> Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
> spliced from the source iterator if possible (the iterator must be
> ITER_BVEC and the pages must be spliceable).
>
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> ---
> net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 288693981b00..77c0c69208a5 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> int flags, err, copied = 0;
> int mss_now = 0, size_goal, copied_syn = 0;
> int process_backlog = 0;
> - bool zc = false;
> + int zc = 0;
> long timeo;
>
> flags = msg->msg_flags;
> @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> if (msg->msg_ubuf) {
> uarg = msg->msg_ubuf;
> net_zcopy_get(uarg);
> - zc = sk->sk_route_caps & NETIF_F_SG;
> + if (sk->sk_route_caps & NETIF_F_SG)
> + zc = 1;
> } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
> if (!uarg) {
> err = -ENOBUFS;
> goto out_err;
> }
> - zc = sk->sk_route_caps & NETIF_F_SG;
> - if (!zc)
> + if (sk->sk_route_caps & NETIF_F_SG)
> + zc = 1;
> + else
> uarg_to_msgzc(uarg)->zerocopy = 0;
> }
> + } else if (unlikely(flags & MSG_SPLICE_PAGES) && size) {
> + if (!iov_iter_is_bvec(&msg->msg_iter))
> + return -EINVAL;
> + if (sk->sk_route_caps & NETIF_F_SG)
> + zc = 2;
> }
The commit message mentions MSG_SPLICE_PAGES as an internal flag.
It can be passed from userspace. The code anticipates that and checks
preconditions.
A side effect is that legacy applications that may already be setting
this bit in the flags now start failing. Most socket types are
historically permissive and simply ignore undefined flags.
With MSG_ZEROCOPY we chose to be extra cautious and added
SOCK_ZEROCOPY, only testing the MSG_ZEROCOPY bit if this socket option
is explicitly enabled. Perhaps more cautious than necessary, but FYI.
Willem de Bruijn <[email protected]> wrote:
> The commit message mentions MSG_SPLICE_PAGES as an internal flag.
>
> It can be passed from userspace. The code anticipates that and checks
> preconditions.
Should I add a separate field in the in-kernel msghdr struct for such internal
flags? That would also avoid putting an internal flag in the same space as
the uapi flags.
David
David Howells wrote:
> Willem de Bruijn <[email protected]> wrote:
>
> > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> >
> > It can be passed from userspace. The code anticipates that and checks
> > preconditions.
>
> Should I add a separate field in the in-kernel msghdr struct for such internal
> flags? That would also avoid putting an internal flag in the same space as
> the uapi flags.
That would work, if no cost to common paths that don't need it.
A not very pretty alternative would be to add an an extra arg to each
sendmsg handler that is used only when called from sendpage.
There are a few other internal MSG_.. flags, such as
MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
in sendmsg, I think. Which would explain why it was clearly safe to
add them.
> On Mar 16, 2023, at 14:06, David Howells <[email protected]> wrote:
>
> Trond Myklebust <[email protected]> wrote:
>
>> 1) This is code that is common to the client and the server. Why are we
>> adding unused 3 bvec slots to every client RPC call?
>
> Fair point, but I'm trying to avoid making four+ sendmsg calls in nfsd rather
> than one.
Add an enum iter_type for ITER_ITER ? :-)
Otherwise, please just split these functions into one for knfsd and a separate one for the client.
>
>> 2) It obfuscates the existence of these bvec slots.
>
> True, it'd be nice to find a better way to do it. Question is, can the client
> make use of MSG_SPLICE_PAGES also?
The requirement for O_DIRECT support means we get the stable write issues with added extra spicy sauce.
>
>> 3) knfsd may use splice_direct_to_actor() in order to avoid copying the page
>> cache data into private buffers (it just takes a reference to the
>> pages). Using MSG_SPLICE_PAGES will presumably require it to protect those
>> pages against further writes while the socket is referencing them.
>
> Upstream sunrpc is using sendpage with TCP. It already has that issue.
> MSG_SPLICE_PAGES is a way of doing sendpage through sendmsg.
Fair enough. I do seem to remember a schism with the knfsd developers over that issue.
_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
Chuck Lever III <[email protected]> wrote:
> Therefore, this kind of change needs to be accompanied by both
> benchmark results and some field testing to convince me it won't
> cause harm.
Btw, what do you use to benchmark NFS performance?
David
David Howells <[email protected]> wrote:
> Remove hash_sendpage*() and use hash_sendmsg() as the latter seems to just
> use the source pages directly anyway.
...
> - if (!(flags & MSG_MORE)) {
> - if (ctx->more)
> - err = crypto_ahash_finup(&ctx->req);
> - else
> - err = crypto_ahash_digest(&ctx->req);
You've just removed the optimised path from user-space to
finup/digest. You need to add them back to sendmsg if you
want to eliminate sendpage.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> On Mar 16, 2023, at 5:21 PM, David Howells <[email protected]> wrote:
>
> Chuck Lever III <[email protected]> wrote:
>
>> Therefore, this kind of change needs to be accompanied by both
>> benchmark results and some field testing to convince me it won't
>> cause harm.
>
> Btw, what do you use to benchmark NFS performance?
It depends on what I'm trying to observe. I have only a small
handful of systems in my lab, which is why I was not able to
immediately detect the effects of the zero-copy change in my
lab. Daire has a large client cohort on a fast network, so is
able to see the impact of that kind of change quite readily.
A perhaps more interesting question is what kind of tooling
would I use to measure the performance of the proposed change.
The bottom line is whether or not applications on clients can
see a change. NFS client implementations can hide server and
network latency improvement from applications, and RPC-on-TCP
adds palpable latency as well that reduces the efficacy of
server performance optimizations.
For that I might use a multi-threaded fio workload with fixed
record sizes (2KB, 8KB, 128KB, 1MB) and then look at the
throughput numbers and latency distribution for each size.
In a single thread qd=1 test, iozone can show changes in
READ latency pretty clearly, though most folks believe qd=1
tests are junk.
I generally run such tests on 100GbE with a tmpfs or NVMe
export to take filesystem latencies out of the equation,
although that might matter more for WRITE latency if you
can keep your READ workload completely in server memory.
To measure server-side behavior without the effects of the
network or client, NFSD has a built-in trace point,
nfsd:svc_stats_latency, that records the latency in
microseconds of each RPC. Run the above workloads and
record this tracepoint (perhaps with a capture filter to
record only the latency of READ operations).
Then you can post-process the raw latencies to get an average
latency and deviation, or even look at latency distribution
to see if the shape of the outlier curve has changed. I use
awk for this.
[ Sidebar: you can use this tracepoint to track latency
outliers too, but that's another topic. ]
Second, I might try a flame graph study to measure changes in
instruction path length, and also capture an average cycles-
per-byte-read value. Looking at CPU cache misses can often be
a rathole, but flame graphs can surface changes there too.
And lastly, you might want to visit lock_stats to see if
there is any significant change in lock contention. An
unexpected increase in lock contention can often rob
positive changes made in other areas.
My guess is that for the RQ_SPLICE_OK case, the difference
would amount to the elimination of the kernel_sendpage
calls, which are indirect, but not terribly expensive.
Those calls amount to a significant cost only on large I/O.
It might not amount to much relative to the other costs
in the READ path.
So the real purpose here would have to be refactoring to
use bvecs instead of the bespoke xdr_buf structure, and I
would like to see support for bvecs in all of our transports
(looking at you, RDMA) to make this truly worthwhile. I had
started this a while back, but lack of a bvec-based RDMA API
made it less interesting to me. It isn't clear to me yet
whether bvecs or folios should be the replacement for
xdr_buf's head/pages/tail, but I'm a paid-in-full member of
the uneducated rabble.
This might sound like a lot of pushback, but actually I am
open to discussing clean-ups in this area, including the
one you proposed. Just getting a little more careful about
this kind of change as time goes on. And it sounds like you
were already aware of the most recent previous attempt at
this kind of improvement.
--
Chuck Lever
Willem de Bruijn <[email protected]> wrote:
> David Howells wrote:
> > Willem de Bruijn <[email protected]> wrote:
> >
> > > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > >
> > > It can be passed from userspace. The code anticipates that and checks
> > > preconditions.
> >
> > Should I add a separate field in the in-kernel msghdr struct for such internal
> > flags? That would also avoid putting an internal flag in the same space as
> > the uapi flags.
>
> That would work, if no cost to common paths that don't need it.
Actually, it might be tricky. __ip_append_data() doesn't take a msghdr struct
pointer per se. The "void *from" argument *might* point to one - but it
depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't
know.
Possibly this changes if sendpage goes away.
> A not very pretty alternative would be to add an an extra arg to each
> sendmsg handler that is used only when called from sendpage.
>
> There are a few other internal MSG_.. flags, such as
> MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
> in sendmsg, I think. Which would explain why it was clearly safe to
> add them.
Should those be moved across to the internal flags with MSG_SPLICE_PAGES?
David
David Howells wrote:
> Willem de Bruijn <[email protected]> wrote:
>
> > David Howells wrote:
> > > Willem de Bruijn <[email protected]> wrote:
> > >
> > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > > >
> > > > It can be passed from userspace. The code anticipates that and checks
> > > > preconditions.
> > >
> > > Should I add a separate field in the in-kernel msghdr struct for such internal
> > > flags? That would also avoid putting an internal flag in the same space as
> > > the uapi flags.
> >
> > That would work, if no cost to common paths that don't need it.
>
> Actually, it might be tricky. __ip_append_data() doesn't take a msghdr struct
> pointer per se. The "void *from" argument *might* point to one - but it
> depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't
> know.
>
> Possibly this changes if sendpage goes away.
Is it sufficient to mask out this bit in tcp_sendmsg_locked and
udp_sendmsg if passed from userspace (and should be ignored), and pass
it through flags to callees like ip_append_data?
>
> > A not very pretty alternative would be to add an an extra arg to each
> > sendmsg handler that is used only when called from sendpage.
> >
> > There are a few other internal MSG_.. flags, such as
> > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
> > in sendmsg, I think. Which would explain why it was clearly safe to
> > add them.
>
> Should those be moved across to the internal flags with MSG_SPLICE_PAGES?
I would not include that in this patch series.
Hi Willem,
Here's another option to passing MSG_SPLICE_PAGES into sendmsg()[1] without
polluting the flags in msg->msg_flags. The idea here is to put the flag
into a new field in msghdr, msg_kflags, that holds internal kernel flags
that aren't available to userspace.
What I've done here is:
(1) Pass msg down to __ip_append_data() and __ip6_append_data() so that
they can access the extra flags.
(2) In order to avoid adding extra arguments to these functions and the
functions in their call chains (such as ip_make_skb()), remove the
size and flags arguments as these values are redundant if msg is
passed in.
(3) msg is then passed into getfrag(). I would like to get rid of the
"from" argument also in favour of using something in msghdr, but I'm
not sure how best to do that.
(4) The size parameter to ->sendmsg() seems to be redundant; indeed
sock_sendmsg() doesn't actually take it, but rather gets the count
from msg_iter - so remove this parameter.
kernel_sendmsg() will still take a size, but it sets it on the
iterator and then calls sock_sendmsg().
(5) Protocol sendmsg implementations then extract the length and the flags
from the iterator.
(6) Illustrate the addition of msg_kflags and MSG_SPLICE_PAGES. I think
that, at some point in the future, some of the other flags could be
moved from msg_flags to msg_kflags.
David
Link: https://lore.kernel.org/r/[email protected]/ [1]
David Howells (3):
net: Drop the size argument from ->sendmsg()
ip: Make __ip{,6}_append_data() and co. take a msghdr*
net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
crypto/af_alg.c | 12 +--
crypto/algif_aead.c | 9 +--
crypto/algif_hash.c | 8 +-
crypto/algif_rng.c | 3 +-
crypto/algif_skcipher.c | 10 +--
drivers/isdn/mISDN/socket.c | 3 +-
.../chelsio/inline_crypto/chtls/chtls.h | 2 +-
.../chelsio/inline_crypto/chtls/chtls_io.c | 15 ++--
drivers/net/ppp/pppoe.c | 4 +-
drivers/net/tap.c | 3 +-
drivers/net/tun.c | 3 +-
drivers/vhost/net.c | 6 +-
drivers/xen/pvcalls-back.c | 2 +-
drivers/xen/pvcalls-front.c | 4 +-
drivers/xen/pvcalls-front.h | 3 +-
fs/afs/rxrpc.c | 8 +-
include/crypto/if_alg.h | 3 +-
include/linux/lsm_hook_defs.h | 3 +-
include/linux/lsm_hooks.h | 1 -
include/linux/net.h | 6 +-
include/linux/security.h | 4 +-
include/linux/socket.h | 3 +
include/net/af_rxrpc.h | 3 +-
include/net/inet_common.h | 2 +-
include/net/ip.h | 24 +++---
include/net/ipv6.h | 22 +++---
include/net/ping.h | 7 +-
include/net/sock.h | 7 +-
include/net/tcp.h | 8 +-
include/net/udp.h | 2 +-
include/net/udplite.h | 4 +-
net/appletalk/ddp.c | 3 +-
net/atm/common.c | 3 +-
net/atm/common.h | 2 +-
net/ax25/af_ax25.c | 4 +-
net/bluetooth/hci_sock.c | 4 +-
net/bluetooth/iso.c | 4 +-
net/bluetooth/l2cap_sock.c | 5 +-
net/bluetooth/rfcomm/sock.c | 7 +-
net/bluetooth/sco.c | 4 +-
net/caif/caif_socket.c | 13 ++--
net/can/bcm.c | 3 +-
net/can/isotp.c | 3 +-
net/can/j1939/socket.c | 4 +-
net/can/raw.c | 3 +-
net/core/sock.c | 4 +-
net/dccp/dccp.h | 2 +-
net/dccp/proto.c | 3 +-
net/ieee802154/socket.c | 11 +--
net/ipv4/af_inet.c | 4 +-
net/ipv4/icmp.c | 14 ++--
net/ipv4/ip_output.c | 73 ++++++++++---------
net/ipv4/ping.c | 18 ++---
net/ipv4/raw.c | 23 +++---
net/ipv4/tcp.c | 17 +++--
net/ipv4/tcp_bpf.c | 5 +-
net/ipv4/tcp_input.c | 3 +-
net/ipv4/udp.c | 24 +++---
net/ipv6/af_inet6.c | 7 +-
net/ipv6/icmp.c | 21 ++++--
net/ipv6/ip6_output.c | 57 +++++++--------
net/ipv6/ping.c | 12 +--
net/ipv6/raw.c | 25 +++----
net/ipv6/udp.c | 26 ++++---
net/ipv6/udp_impl.h | 2 +-
net/iucv/af_iucv.c | 4 +-
net/kcm/kcmsock.c | 2 +-
net/key/af_key.c | 3 +-
net/l2tp/l2tp_ip.c | 3 +-
net/l2tp/l2tp_ip6.c | 3 +-
net/l2tp/l2tp_ppp.c | 4 +-
net/llc/af_llc.c | 5 +-
net/mctp/af_mctp.c | 3 +-
net/mptcp/protocol.c | 8 +-
net/netlink/af_netlink.c | 11 +--
net/netrom/af_netrom.c | 3 +-
net/nfc/llcp_sock.c | 7 +-
net/nfc/rawsock.c | 3 +-
net/packet/af_packet.c | 11 +--
net/phonet/datagram.c | 3 +-
net/phonet/pep.c | 3 +-
net/phonet/socket.c | 5 +-
net/qrtr/af_qrtr.c | 4 +-
net/rds/rds.h | 2 +-
net/rds/send.c | 3 +-
net/rose/af_rose.c | 3 +-
net/rxrpc/af_rxrpc.c | 6 +-
net/rxrpc/ar-internal.h | 2 +-
net/rxrpc/output.c | 22 +++---
net/rxrpc/rxperf.c | 4 +-
net/rxrpc/sendmsg.c | 15 ++--
net/sctp/socket.c | 3 +-
net/smc/af_smc.c | 5 +-
net/socket.c | 16 ++--
net/tipc/socket.c | 34 ++++-----
net/tls/tls.h | 4 +-
net/tls/tls_device.c | 5 +-
net/tls/tls_sw.c | 2 +-
net/unix/af_unix.c | 19 +++--
net/vmw_vsock/af_vsock.c | 16 ++--
net/x25/af_x25.c | 3 +-
net/xdp/xsk.c | 6 +-
net/xfrm/espintcp.c | 8 +-
security/apparmor/lsm.c | 6 +-
security/security.c | 4 +-
security/selinux/hooks.c | 3 +-
security/smack/smack_lsm.c | 4 +-
security/tomoyo/common.h | 3 +-
security/tomoyo/network.c | 4 +-
security/tomoyo/tomoyo.c | 6 +-
110 files changed, 444 insertions(+), 456 deletions(-)
Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can. This is set in msg->msg_kflags,
not msg->msg_flags, thereby isolating it from the UAPI.
This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.
Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/socket.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..229f54484d3c 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -72,6 +72,7 @@ struct msghdr {
bool msg_control_is_user : 1;
bool msg_get_inq : 1;/* return INQ after receive */
unsigned int msg_flags; /* flags on received message */
+ unsigned int msg_kflags; /* Kernel internal flags */
__kernel_size_t msg_controllen; /* ancillary data buffer length */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
struct ubuf_info *msg_ubuf;
@@ -337,6 +338,8 @@ struct ucred {
#define MSG_CMSG_COMPAT 0 /* We never have 32 bit fixups */
#endif
+/* Flags for msghdr::msg_kflags (all internal to the kernel) */
+#define MSG_SPLICE_PAGES 0x00000001 /* Splice the pages from the iterator in sendmsg() */
/* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
#define SOL_IP 0
In order to pass an extra internal flag to indicate that sendmsg() should
splice pages rather than copying them, pass a struct msghdr pointer into
various paths that lead to __ip_append_data() and __ip6_append_data() and
thence into getfrag(). The flag can then be stashed in the msghdr struct
in a new field to avoid polluting the msg_flags field with non-UAPI flags.
Passing msghdr around like this allows the length and flags arguments to
__ip*_append_data() to be eliminated (the values can be obtained from the
msghdr and its iterator). Unfortunately, the "from" parameter can't be so
easily eliminated as it's used by the icmp routines particularly.
The getfrag function pointer is formalised as ip_getfrag_t by typedef.
This requires the following additional changes:
(1) __ip_append_data() and __ip6_append_data() add transhdrlen onto the
data length inside the functions rather than it being included in
msg_data_left().
(2) A few places, such as icmp_glue_bits(), have to create a msghdr they
didn't need before in order to pass in flags and length. They also
need to cheat a bit and stash the length in msg->msg_iter.count - even
though they don't actually use the iterator.
(3) udp_sendmsg() OR's MSG_MORE into msg->msg_flags if the corkflag is
set. Separate flags don't then need to be passed in to
ip_append_data(). Ditto udpv6_sendmsg().
Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: David Ahern <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/ip.h | 24 ++++++--------
include/net/ipv6.h | 20 ++++++------
include/net/ping.h | 5 ++-
include/net/udplite.h | 4 +--
net/ipv4/icmp.c | 14 +++++----
net/ipv4/ip_output.c | 73 ++++++++++++++++++++++---------------------
net/ipv4/ping.c | 10 +++---
net/ipv4/raw.c | 20 ++++++------
net/ipv4/udp.c | 19 ++++++-----
net/ipv6/icmp.c | 21 ++++++++-----
net/ipv6/ip6_output.c | 57 +++++++++++++++------------------
net/ipv6/ping.c | 7 ++---
net/ipv6/raw.c | 22 ++++++-------
net/ipv6/udp.c | 19 ++++++-----
14 files changed, 155 insertions(+), 160 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..152553bd9ad4 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -211,15 +211,13 @@ int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb);
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
__u8 tos);
void ip_init(void);
-int ip_append_data(struct sock *sk, struct flowi4 *fl4,
- int getfrag(void *from, char *to, int offset, int len,
- int odd, struct sk_buff *skb),
- void *from, int len, int protolen,
- struct ipcm_cookie *ipc,
- struct rtable **rt,
- unsigned int flags);
-int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd,
- struct sk_buff *skb);
+typedef int (*ip_getfrag_t)(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb);
+int ip_append_data(struct sock *sk, struct flowi4 *fl4, struct msghdr *msg,
+ ip_getfrag_t getfrag, void *from, int protolen,
+ struct ipcm_cookie *ipc, struct rtable **rt);
+int ip_generic_getfrag(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb);
ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
int offset, size_t size, int flags);
struct sk_buff *__ip_make_skb(struct sock *sk, struct flowi4 *fl4,
@@ -228,12 +226,10 @@ struct sk_buff *__ip_make_skb(struct sock *sk, struct flowi4 *fl4,
int ip_send_skb(struct net *net, struct sk_buff *skb);
int ip_push_pending_frames(struct sock *sk, struct flowi4 *fl4);
void ip_flush_pending_frames(struct sock *sk);
-struct sk_buff *ip_make_skb(struct sock *sk, struct flowi4 *fl4,
- int getfrag(void *from, char *to, int offset,
- int len, int odd, struct sk_buff *skb),
- void *from, int length, int transhdrlen,
+struct sk_buff *ip_make_skb(struct sock *sk, struct flowi4 *fl4, struct msghdr *msg,
+ ip_getfrag_t getfrag, int transhdrlen,
struct ipcm_cookie *ipc, struct rtable **rtp,
- struct inet_cork *cork, unsigned int flags);
+ struct inet_cork *cork);
int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f2132311e92b..bec2ecf31076 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1094,12 +1094,13 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
int ip6_find_1stfragopt(struct sk_buff *skb, u8 **nexthdr);
-int ip6_append_data(struct sock *sk,
- int getfrag(void *from, char *to, int offset, int len,
- int odd, struct sk_buff *skb),
- void *from, size_t length, int transhdrlen,
+typedef int (*ip_getfrag_t)(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb);
+
+int ip6_append_data(struct sock *sk, struct msghdr *msg,
+ ip_getfrag_t getfrag, void *from, int transhdrlen,
struct ipcm6_cookie *ipc6, struct flowi6 *fl6,
- struct rt6_info *rt, unsigned int flags);
+ struct rt6_info *rt);
int ip6_push_pending_frames(struct sock *sk);
@@ -1110,12 +1111,9 @@ int ip6_send_skb(struct sk_buff *skb);
struct sk_buff *__ip6_make_skb(struct sock *sk, struct sk_buff_head *queue,
struct inet_cork_full *cork,
struct inet6_cork *v6_cork);
-struct sk_buff *ip6_make_skb(struct sock *sk,
- int getfrag(void *from, char *to, int offset,
- int len, int odd, struct sk_buff *skb),
- void *from, size_t length, int transhdrlen,
- struct ipcm6_cookie *ipc6,
- struct rt6_info *rt, unsigned int flags,
+struct sk_buff *ip6_make_skb(struct sock *sk, struct msghdr *msg,
+ ip_getfrag_t getfrag, void *from, int transhdrlen,
+ struct ipcm6_cookie *ipc6, struct rt6_info *rt,
struct inet_cork_full *cork);
static inline struct sk_buff *ip6_finish_skb(struct sock *sk)
diff --git a/include/net/ping.h b/include/net/ping.h
index 04814edde8e3..cfa7cbeb5ebc 100644
--- a/include/net/ping.h
+++ b/include/net/ping.h
@@ -52,7 +52,6 @@ extern struct pingv6_ops pingv6_ops;
struct pingfakehdr {
struct icmphdr icmph;
- struct msghdr *msg;
sa_family_t family;
__wsum wcheck;
};
@@ -65,8 +64,8 @@ int ping_init_sock(struct sock *sk);
void ping_close(struct sock *sk, long timeout);
int ping_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len);
void ping_err(struct sk_buff *skb, int offset, u32 info);
-int ping_getfrag(void *from, char *to, int offset, int fraglen, int odd,
- struct sk_buff *);
+int ping_getfrag(struct msghdr *msg, void *from, char *to,
+ int offset, int fraglen, int odd, struct sk_buff *skb);
int ping_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int flags, int *addr_len);
diff --git a/include/net/udplite.h b/include/net/udplite.h
index 299c14ce2bb9..13ffb096154f 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -18,10 +18,10 @@ extern struct udp_table udplite_table;
/*
* Checksum computation is all in software, hence simpler getfrag.
*/
-static __inline__ int udplite_getfrag(void *from, char *to, int offset,
+static __inline__ int udplite_getfrag(struct msghdr *msg,
+ void *from, char *to, int offset,
int len, int odd, struct sk_buff *skb)
{
- struct msghdr *msg = from;
return copy_from_iter_full(to, len, &msg->msg_iter) ? 0 : -EFAULT;
}
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 8cebb476b3ab..5496cd50285a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -344,8 +344,8 @@ void icmp_out_count(struct net *net, unsigned char type)
* Checksum each fragment, and on the first include the headers and final
* checksum.
*/
-static int icmp_glue_bits(void *from, char *to, int offset, int len, int odd,
- struct sk_buff *skb)
+static int icmp_glue_bits(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb)
{
struct icmp_bxm *icmp_param = from;
__wsum csum;
@@ -366,11 +366,13 @@ static void icmp_push_reply(struct sock *sk,
struct ipcm_cookie *ipc, struct rtable **rt)
{
struct sk_buff *skb;
+ struct msghdr msg = {
+ .msg_flags = MSG_DONTWAIT,
+ .msg_iter.count = icmp_param->data_len,
+ };
- if (ip_append_data(sk, fl4, icmp_glue_bits, icmp_param,
- icmp_param->data_len+icmp_param->head_len,
- icmp_param->head_len,
- ipc, rt, MSG_DONTWAIT) < 0) {
+ if (ip_append_data(sk, fl4, &msg, icmp_glue_bits, icmp_param,
+ icmp_param->head_len, ipc, rt) < 0) {
__ICMP_INC_STATS(sock_net(sk), ICMP_MIB_OUTERRORS);
ip_flush_pending_frames(sk);
} else if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cb04dbad9ea4..46ab2ea25764 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -929,10 +929,9 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
EXPORT_SYMBOL(ip_do_fragment);
int
-ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb)
+ip_generic_getfrag(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb)
{
- struct msghdr *msg = from;
-
if (skb->ip_summed == CHECKSUM_PARTIAL) {
if (!copy_from_iter_full(to, len, &msg->msg_iter))
return -EFAULT;
@@ -959,13 +958,12 @@ csum_page(struct page *page, int offset, int copy)
static int __ip_append_data(struct sock *sk,
struct flowi4 *fl4,
+ struct msghdr *msg,
struct sk_buff_head *queue,
struct inet_cork *cork,
struct page_frag *pfrag,
- int getfrag(void *from, char *to, int offset,
- int len, int odd, struct sk_buff *skb),
- void *from, int length, int transhdrlen,
- unsigned int flags)
+ ip_getfrag_t getfrag,
+ void *from, int transhdrlen)
{
struct inet_sock *inet = inet_sk(sk);
struct ubuf_info *uarg = NULL;
@@ -978,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
int err;
int offset = 0;
bool zc = false;
+ unsigned int length = msg_data_left(msg) + transhdrlen;
unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
@@ -1014,11 +1013,11 @@ static int __ip_append_data(struct sock *sk,
if (transhdrlen &&
length + fragheaderlen <= mtu &&
rt->dst.dev->features & (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM) &&
- (!(flags & MSG_MORE) || cork->gso_size) &&
+ (!(msg->msg_flags & MSG_MORE) || cork->gso_size) &&
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
- if ((flags & MSG_ZEROCOPY) && length) {
+ if ((msg->msg_flags & MSG_ZEROCOPY) && length) {
struct msghdr *msg = from;
if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
@@ -1103,7 +1102,7 @@ static int __ip_append_data(struct sock *sk,
if (datalen == length + fraggap)
alloc_extra += rt->dst.trailer_len;
- if ((flags & MSG_MORE) &&
+ if ((msg->msg_flags & MSG_MORE) &&
!(rt->dst.dev->features&NETIF_F_SG))
alloclen = mtu;
else if (!paged &&
@@ -1119,7 +1118,7 @@ static int __ip_append_data(struct sock *sk,
if (transhdrlen) {
skb = sock_alloc_send_skb(sk, alloclen,
- (flags & MSG_DONTWAIT), &err);
+ (msg->msg_flags & MSG_DONTWAIT), &err);
} else {
skb = NULL;
if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
@@ -1159,7 +1158,8 @@ static int __ip_append_data(struct sock *sk,
}
copy = datalen - transhdrlen - fraggap - pagedlen;
- if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
+ if (copy > 0 && getfrag(msg, from, data + transhdrlen,
+ offset, copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
goto error;
@@ -1178,7 +1178,7 @@ static int __ip_append_data(struct sock *sk,
tskey = 0;
skb_zcopy_set(skb, uarg, &extra_uref);
- if ((flags & MSG_CONFIRM) && !skb_prev)
+ if ((msg->msg_flags & MSG_CONFIRM) && !skb_prev)
skb_set_dst_pending_confirm(skb, 1);
/*
@@ -1201,8 +1201,8 @@ static int __ip_append_data(struct sock *sk,
unsigned int off;
off = skb->len;
- if (getfrag(from, skb_put(skb, copy),
- offset, copy, off, skb) < 0) {
+ if (getfrag(msg, from, skb_put(skb, copy),
+ offset, copy, off, skb) < 0) {
__skb_trim(skb, off);
err = -EFAULT;
goto error;
@@ -1227,7 +1227,7 @@ static int __ip_append_data(struct sock *sk,
get_page(pfrag->page);
}
copy = min_t(int, copy, pfrag->size - pfrag->offset);
- if (getfrag(from,
+ if (getfrag(msg, from,
page_address(pfrag->page) + pfrag->offset,
offset, copy, skb->len, skb) < 0)
goto error_efault;
@@ -1320,17 +1320,14 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
*
* LATER: length must be adjusted by pad at tail, when it is required.
*/
-int ip_append_data(struct sock *sk, struct flowi4 *fl4,
- int getfrag(void *from, char *to, int offset, int len,
- int odd, struct sk_buff *skb),
- void *from, int length, int transhdrlen,
- struct ipcm_cookie *ipc, struct rtable **rtp,
- unsigned int flags)
+int ip_append_data(struct sock *sk, struct flowi4 *fl4, struct msghdr *msg,
+ ip_getfrag_t getfrag, void *from, int transhdrlen,
+ struct ipcm_cookie *ipc, struct rtable **rtp)
{
struct inet_sock *inet = inet_sk(sk);
int err;
- if (flags&MSG_PROBE)
+ if (msg->msg_flags & MSG_PROBE)
return 0;
if (skb_queue_empty(&sk->sk_write_queue)) {
@@ -1341,9 +1338,9 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
transhdrlen = 0;
}
- return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
- sk_page_frag(sk), getfrag,
- from, length, transhdrlen, flags);
+ return __ip_append_data(sk, fl4, msg, &sk->sk_write_queue,
+ &inet->cork.base, sk_page_frag(sk),
+ getfrag, from, transhdrlen);
}
ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
@@ -1629,16 +1626,16 @@ void ip_flush_pending_frames(struct sock *sk)
struct sk_buff *ip_make_skb(struct sock *sk,
struct flowi4 *fl4,
- int getfrag(void *from, char *to, int offset,
- int len, int odd, struct sk_buff *skb),
- void *from, int length, int transhdrlen,
+ struct msghdr *msg,
+ ip_getfrag_t getfrag,
+ int transhdrlen,
struct ipcm_cookie *ipc, struct rtable **rtp,
- struct inet_cork *cork, unsigned int flags)
+ struct inet_cork *cork)
{
struct sk_buff_head queue;
int err;
- if (flags & MSG_PROBE)
+ if (msg->msg_flags & MSG_PROBE)
return NULL;
__skb_queue_head_init(&queue);
@@ -1650,9 +1647,9 @@ struct sk_buff *ip_make_skb(struct sock *sk,
if (err)
return ERR_PTR(err);
- err = __ip_append_data(sk, fl4, &queue, cork,
+ err = __ip_append_data(sk, fl4, msg, &queue, cork,
¤t->task_frag, getfrag,
- from, length, transhdrlen, flags);
+ msg, transhdrlen);
if (err) {
__ip_flush_pending_frames(sk, &queue, cork);
return ERR_PTR(err);
@@ -1664,7 +1661,7 @@ struct sk_buff *ip_make_skb(struct sock *sk,
/*
* Fetch data from kernel space and fill in checksum if needed.
*/
-static int ip_reply_glue_bits(void *dptr, char *to, int offset,
+static int ip_reply_glue_bits(struct msghdr *msg, void *dptr, char *to, int offset,
int len, int odd, struct sk_buff *skb)
{
__wsum csum;
@@ -1690,6 +1687,10 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
struct rtable *rt = skb_rtable(skb);
struct net *net = sock_net(sk);
struct sk_buff *nskb;
+ struct msghdr msg = {
+ .msg_flags = MSG_DONTWAIT,
+ .msg_iter.count = len,
+ };
int err;
int oif;
@@ -1730,8 +1731,8 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
sk->sk_bound_dev_if = arg->bound_dev_if;
sk->sk_sndbuf = READ_ONCE(sysctl_wmem_default);
ipc.sockc.mark = fl4.flowi4_mark;
- err = ip_append_data(sk, &fl4, ip_reply_glue_bits, arg->iov->iov_base,
- len, 0, &ipc, &rt, MSG_DONTWAIT);
+ err = ip_append_data(sk, &fl4, &msg, ip_reply_glue_bits, arg->iov->iov_base,
+ 0, &ipc, &rt);
if (unlikely(err)) {
ip_flush_pending_frames(sk);
goto out;
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index f689f9f530c9..e93e0a8849cb 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -617,13 +617,13 @@ EXPORT_SYMBOL_GPL(ping_err);
* starting from the payload.
*/
-int ping_getfrag(void *from, char *to,
+int ping_getfrag(struct msghdr *msg, void *from, char *to,
int offset, int fraglen, int odd, struct sk_buff *skb)
{
struct pingfakehdr *pfh = from;
if (!csum_and_copy_from_iter_full(to, fraglen, &pfh->wcheck,
- &pfh->msg->msg_iter))
+ &msg->msg_iter))
return -EFAULT;
#if IS_ENABLED(CONFIG_IPV6)
@@ -832,13 +832,11 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg)
pfh.icmph.checksum = 0;
pfh.icmph.un.echo.id = inet->inet_sport;
pfh.icmph.un.echo.sequence = user_icmph.un.echo.sequence;
- pfh.msg = msg;
pfh.wcheck = 0;
pfh.family = AF_INET;
- err = ip_append_data(sk, &fl4, ping_getfrag, &pfh, len,
- sizeof(struct icmphdr), &ipc, &rt,
- msg->msg_flags);
+ err = ip_append_data(sk, &fl4, msg, ping_getfrag, &pfh,
+ sizeof(struct icmphdr), &ipc, &rt);
if (err)
ip_flush_pending_frames(sk);
else
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f2859c117796..504045163f86 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -77,7 +77,6 @@
#include <linux/uio.h>
struct raw_frag_vec {
- struct msghdr *msg;
union {
struct icmphdr icmph;
char c[1];
@@ -420,7 +419,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
return err;
}
-static int raw_probe_proto_opt(struct raw_frag_vec *rfv, struct flowi4 *fl4)
+static int raw_probe_proto_opt(struct msghdr *msg, struct raw_frag_vec *rfv,
+ struct flowi4 *fl4)
{
int err;
@@ -430,7 +430,7 @@ static int raw_probe_proto_opt(struct raw_frag_vec *rfv, struct flowi4 *fl4)
/* We only need the first two bytes. */
rfv->hlen = 2;
- err = memcpy_from_msg(rfv->hdr.c, rfv->msg, rfv->hlen);
+ err = memcpy_from_msg(rfv->hdr.c, msg, rfv->hlen);
if (err)
return err;
@@ -440,8 +440,8 @@ static int raw_probe_proto_opt(struct raw_frag_vec *rfv, struct flowi4 *fl4)
return 0;
}
-static int raw_getfrag(void *from, char *to, int offset, int len, int odd,
- struct sk_buff *skb)
+static int raw_getfrag(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb)
{
struct raw_frag_vec *rfv = from;
@@ -468,7 +468,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd,
offset -= rfv->hlen;
- return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb);
+ return ip_generic_getfrag(msg, NULL, to, offset, len, odd, skb);
}
static int raw_sendmsg(struct sock *sk, struct msghdr *msg)
@@ -608,10 +608,9 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg)
daddr, saddr, 0, 0, sk->sk_uid);
if (!hdrincl) {
- rfv.msg = msg;
rfv.hlen = 0;
- err = raw_probe_proto_opt(&rfv, &fl4);
+ err = raw_probe_proto_opt(msg, &rfv, &fl4);
if (err)
goto done;
}
@@ -640,9 +639,8 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg)
if (!ipc.addr)
ipc.addr = fl4.daddr;
lock_sock(sk);
- err = ip_append_data(sk, &fl4, raw_getfrag,
- &rfv, len, 0,
- &ipc, &rt, msg->msg_flags);
+ err = ip_append_data(sk, &fl4, msg, raw_getfrag,
+ &rfv, 0, &ipc, &rt);
if (err)
ip_flush_pending_frames(sk);
else if (!(msg->msg_flags & MSG_MORE)) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b2ed9d37a362..bb2e2e98c94c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1066,11 +1066,16 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg)
__be16 dport;
u8 tos;
int err, is_udplite = IS_UDPLITE(sk);
- int corkreq = READ_ONCE(up->corkflag) || msg->msg_flags&MSG_MORE;
- int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+ bool corkreq = READ_ONCE(up->corkflag);
+ ip_getfrag_t getfrag;
struct sk_buff *skb;
struct ip_options_data opt_copy;
+ if (corkreq)
+ msg->msg_flags |= MSG_MORE;
+ else
+ corkreq = msg->msg_flags & MSG_MORE;
+
if (len > 0xFFFF)
return -EMSGSIZE;
@@ -1258,9 +1263,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg)
if (!corkreq) {
struct inet_cork cork;
- skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
- sizeof(struct udphdr), &ipc, &rt,
- &cork, msg->msg_flags);
+ skb = ip_make_skb(sk, fl4, msg, getfrag,
+ sizeof(struct udphdr), &ipc, &rt, &cork);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4, &cork);
@@ -1289,9 +1293,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg)
do_append_data:
up->len += ulen;
- err = ip_append_data(sk, fl4, getfrag, msg, ulen,
- sizeof(struct udphdr), &ipc, &rt,
- corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
+ err = ip_append_data(sk, fl4, msg, getfrag, NULL,
+ sizeof(struct udphdr), &ipc, &rt);
if (err)
udp_flush_pending_frames(sk);
else if (!corkreq)
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 1f53f2a74480..92d94943bbee 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -313,7 +313,8 @@ struct icmpv6_msg {
uint8_t type;
};
-static int icmpv6_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb)
+static int icmpv6_getfrag(struct msghdr *_msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb)
{
struct icmpv6_msg *msg = (struct icmpv6_msg *) from;
struct sk_buff *org_skb = msg->skb;
@@ -453,6 +454,7 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
struct flowi6 fl6;
struct icmpv6_msg msg;
struct ipcm6_cookie ipc6;
+ struct msghdr msghdr;
int iif = 0;
int addr_type = 0;
int len;
@@ -606,14 +608,15 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
goto out_dst_release;
}
+ msghdr.msg_iter.count = len;
+ msghdr.msg_flags = MSG_DONTWAIT;
+
rcu_read_lock();
idev = __in6_dev_get(skb->dev);
- if (ip6_append_data(sk, icmpv6_getfrag, &msg,
- len + sizeof(struct icmp6hdr),
+ if (ip6_append_data(sk, &msghdr, icmpv6_getfrag, &msg,
sizeof(struct icmp6hdr),
- &ipc6, &fl6, (struct rt6_info *)dst,
- MSG_DONTWAIT)) {
+ &ipc6, &fl6, (struct rt6_info *)dst)) {
ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
ip6_flush_pending_frames(sk);
} else {
@@ -718,6 +721,7 @@ static enum skb_drop_reason icmpv6_echo_reply(struct sk_buff *skb)
struct icmpv6_msg msg;
struct dst_entry *dst;
struct ipcm6_cookie ipc6;
+ struct msghdr msghdr;
u32 mark = IP6_REPLY_MARK(net, skb->mark);
SKB_DR(reason);
bool acast;
@@ -796,10 +800,11 @@ static enum skb_drop_reason icmpv6_echo_reply(struct sk_buff *skb)
if (!icmp_build_probe(skb, (struct icmphdr *)&tmp_hdr))
goto out_dst_release;
- if (ip6_append_data(sk, icmpv6_getfrag, &msg,
- skb->len + sizeof(struct icmp6hdr),
+ msghdr.msg_iter.count = skb->len;
+ msghdr.msg_flags = MSG_DONTWAIT;
+ if (ip6_append_data(sk, &msghdr, icmpv6_getfrag, &msg,
sizeof(struct icmp6hdr), &ipc6, &fl6,
- (struct rt6_info *)dst, MSG_DONTWAIT)) {
+ (struct rt6_info *)dst)) {
__ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
ip6_flush_pending_frames(sk);
} else {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index e5ed39a3c65f..171a026d1dca 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1462,13 +1462,13 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
static int __ip6_append_data(struct sock *sk,
struct sk_buff_head *queue,
+ struct msghdr *msg,
struct inet_cork_full *cork_full,
struct inet6_cork *v6_cork,
struct page_frag *pfrag,
- int getfrag(void *from, char *to, int offset,
- int len, int odd, struct sk_buff *skb),
- void *from, size_t length, int transhdrlen,
- unsigned int flags, struct ipcm6_cookie *ipc6)
+ ip_getfrag_t getfrag,
+ void *from, int transhdrlen,
+ struct ipcm6_cookie *ipc6)
{
struct sk_buff *skb, *skb_prev = NULL;
struct inet_cork *cork = &cork_full->base;
@@ -1488,6 +1488,7 @@ static int __ip6_append_data(struct sock *sk,
int csummode = CHECKSUM_NONE;
unsigned int maxnonfragsize, headersize;
unsigned int wmem_alloc_delta = 0;
+ size_t length = msg_data_left(msg) + transhdrlen;
bool paged, extra_uref = false;
skb = skb_peek_tail(queue);
@@ -1555,11 +1556,11 @@ static int __ip6_append_data(struct sock *sk,
if (transhdrlen && sk->sk_protocol == IPPROTO_UDP &&
headersize == sizeof(struct ipv6hdr) &&
length <= mtu - headersize &&
- (!(flags & MSG_MORE) || cork->gso_size) &&
+ (!(msg->msg_flags & MSG_MORE) || cork->gso_size) &&
rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
csummode = CHECKSUM_PARTIAL;
- if ((flags & MSG_ZEROCOPY) && length) {
+ if ((msg->msg_flags & MSG_ZEROCOPY) && length) {
struct msghdr *msg = from;
if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
@@ -1659,7 +1660,7 @@ static int __ip6_append_data(struct sock *sk,
*/
alloc_extra += sizeof(struct frag_hdr);
- if ((flags & MSG_MORE) &&
+ if ((msg->msg_flags & MSG_MORE) &&
!(rt->dst.dev->features&NETIF_F_SG))
alloclen = mtu;
else if (!paged &&
@@ -1689,7 +1690,7 @@ static int __ip6_append_data(struct sock *sk,
}
if (transhdrlen) {
skb = sock_alloc_send_skb(sk, alloclen,
- (flags & MSG_DONTWAIT), &err);
+ (msg->msg_flags & MSG_DONTWAIT), &err);
} else {
skb = NULL;
if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
@@ -1729,7 +1730,7 @@ static int __ip6_append_data(struct sock *sk,
pskb_trim_unique(skb_prev, maxfraglen);
}
if (copy > 0 &&
- getfrag(from, data + transhdrlen, offset,
+ getfrag(msg, from, data + transhdrlen, offset,
copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
@@ -1749,7 +1750,7 @@ static int __ip6_append_data(struct sock *sk,
tskey = 0;
skb_zcopy_set(skb, uarg, &extra_uref);
- if ((flags & MSG_CONFIRM) && !skb_prev)
+ if ((msg->msg_flags & MSG_CONFIRM) && !skb_prev)
skb_set_dst_pending_confirm(skb, 1);
/*
@@ -1772,8 +1773,8 @@ static int __ip6_append_data(struct sock *sk,
unsigned int off;
off = skb->len;
- if (getfrag(from, skb_put(skb, copy),
- offset, copy, off, skb) < 0) {
+ if (getfrag(msg, from, skb_put(skb, copy),
+ offset, copy, off, skb) < 0) {
__skb_trim(skb, off);
err = -EFAULT;
goto error;
@@ -1798,7 +1799,7 @@ static int __ip6_append_data(struct sock *sk,
get_page(pfrag->page);
}
copy = min_t(int, copy, pfrag->size - pfrag->offset);
- if (getfrag(from,
+ if (getfrag(msg, from,
page_address(pfrag->page) + pfrag->offset,
offset, copy, skb->len, skb) < 0)
goto error_efault;
@@ -1832,19 +1833,17 @@ static int __ip6_append_data(struct sock *sk,
return err;
}
-int ip6_append_data(struct sock *sk,
- int getfrag(void *from, char *to, int offset, int len,
- int odd, struct sk_buff *skb),
- void *from, size_t length, int transhdrlen,
+int ip6_append_data(struct sock *sk, struct msghdr *msg,
+ ip_getfrag_t getfrag, void *from, int transhdrlen,
struct ipcm6_cookie *ipc6, struct flowi6 *fl6,
- struct rt6_info *rt, unsigned int flags)
+ struct rt6_info *rt)
{
struct inet_sock *inet = inet_sk(sk);
struct ipv6_pinfo *np = inet6_sk(sk);
int exthdrlen;
int err;
- if (flags&MSG_PROBE)
+ if (msg->msg_flags & MSG_PROBE)
return 0;
if (skb_queue_empty(&sk->sk_write_queue)) {
/*
@@ -1858,15 +1857,14 @@ int ip6_append_data(struct sock *sk,
inet->cork.fl.u.ip6 = *fl6;
exthdrlen = (ipc6->opt ? ipc6->opt->opt_flen : 0);
- length += exthdrlen;
transhdrlen += exthdrlen;
} else {
transhdrlen = 0;
}
- return __ip6_append_data(sk, &sk->sk_write_queue, &inet->cork,
+ return __ip6_append_data(sk, &sk->sk_write_queue, msg, &inet->cork,
&np->cork, sk_page_frag(sk), getfrag,
- from, length, transhdrlen, flags, ipc6);
+ from, transhdrlen, ipc6);
}
EXPORT_SYMBOL_GPL(ip6_append_data);
@@ -2029,19 +2027,17 @@ void ip6_flush_pending_frames(struct sock *sk)
}
EXPORT_SYMBOL_GPL(ip6_flush_pending_frames);
-struct sk_buff *ip6_make_skb(struct sock *sk,
- int getfrag(void *from, char *to, int offset,
- int len, int odd, struct sk_buff *skb),
- void *from, size_t length, int transhdrlen,
+struct sk_buff *ip6_make_skb(struct sock *sk, struct msghdr *msg,
+ ip_getfrag_t getfrag, void *from, int transhdrlen,
struct ipcm6_cookie *ipc6, struct rt6_info *rt,
- unsigned int flags, struct inet_cork_full *cork)
+ struct inet_cork_full *cork)
{
struct inet6_cork v6_cork;
struct sk_buff_head queue;
int exthdrlen = (ipc6->opt ? ipc6->opt->opt_flen : 0);
int err;
- if (flags & MSG_PROBE) {
+ if (msg->msg_flags & MSG_PROBE) {
dst_release(&rt->dst);
return NULL;
}
@@ -2060,10 +2056,9 @@ struct sk_buff *ip6_make_skb(struct sock *sk,
if (ipc6->dontfrag < 0)
ipc6->dontfrag = inet6_sk(sk)->dontfrag;
- err = __ip6_append_data(sk, &queue, cork, &v6_cork,
+ err = __ip6_append_data(sk, &queue, msg, cork, &v6_cork,
¤t->task_frag, getfrag, from,
- length + exthdrlen, transhdrlen + exthdrlen,
- flags, ipc6);
+ transhdrlen + exthdrlen, ipc6);
if (err) {
__ip6_flush_pending_frames(sk, &queue, cork, &v6_cork);
return ERR_PTR(err);
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index 54c94b28744f..0380d3230814 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -166,17 +166,16 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg)
pfh.icmph.checksum = 0;
pfh.icmph.un.echo.id = inet->inet_sport;
pfh.icmph.un.echo.sequence = user_icmph.icmp6_sequence;
- pfh.msg = msg;
pfh.wcheck = 0;
pfh.family = AF_INET6;
if (ipc6.hlimit < 0)
ipc6.hlimit = ip6_sk_dst_hoplimit(np, &fl6, dst);
+ msg->msg_flags = MSG_DONTWAIT;
lock_sock(sk);
- err = ip6_append_data(sk, ping_getfrag, &pfh, len,
- sizeof(struct icmp6hdr), &ipc6, &fl6, rt,
- MSG_DONTWAIT);
+ err = ip6_append_data(sk, msg, ping_getfrag, &pfh,
+ sizeof(struct icmp6hdr), &ipc6, &fl6, rt);
if (err) {
ICMP6_INC_STATS(sock_net(sk), rt->rt6i_idev,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index a3437deeeb74..2affd7589939 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -678,18 +678,18 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
}
struct raw6_frag_vec {
- struct msghdr *msg;
int hlen;
char c[4];
};
-static int rawv6_probe_proto_opt(struct raw6_frag_vec *rfv, struct flowi6 *fl6)
+static int rawv6_probe_proto_opt(struct raw6_frag_vec *rfv, struct flowi6 *fl6,
+ struct msghdr *msg)
{
int err = 0;
switch (fl6->flowi6_proto) {
case IPPROTO_ICMPV6:
rfv->hlen = 2;
- err = memcpy_from_msg(rfv->c, rfv->msg, rfv->hlen);
+ err = memcpy_from_msg(rfv->c, msg, rfv->hlen);
if (!err) {
fl6->fl6_icmp_type = rfv->c[0];
fl6->fl6_icmp_code = rfv->c[1];
@@ -697,15 +697,15 @@ static int rawv6_probe_proto_opt(struct raw6_frag_vec *rfv, struct flowi6 *fl6)
break;
case IPPROTO_MH:
rfv->hlen = 4;
- err = memcpy_from_msg(rfv->c, rfv->msg, rfv->hlen);
+ err = memcpy_from_msg(rfv->c, msg, rfv->hlen);
if (!err)
fl6->fl6_mh_type = rfv->c[2];
}
return err;
}
-static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
- struct sk_buff *skb)
+static int raw6_getfrag(struct msghdr *msg, void *from, char *to,
+ int offset, int len, int odd, struct sk_buff *skb)
{
struct raw6_frag_vec *rfv = from;
@@ -732,7 +732,7 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
offset -= rfv->hlen;
- return ip_generic_getfrag(rfv->msg, to, offset, len, odd, skb);
+ return ip_generic_getfrag(msg, NULL, to, offset, len, odd, skb);
}
static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg)
@@ -868,9 +868,8 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg)
fl6.flowi6_mark = ipc6.sockc.mark;
if (!hdrincl) {
- rfv.msg = msg;
rfv.hlen = 0;
- err = rawv6_probe_proto_opt(&rfv, &fl6);
+ err = rawv6_probe_proto_opt(&rfv, &fl6, msg);
if (err)
goto out;
}
@@ -919,9 +918,8 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg)
else {
ipc6.opt = opt;
lock_sock(sk);
- err = ip6_append_data(sk, raw6_getfrag, &rfv,
- len, 0, &ipc6, &fl6, (struct rt6_info *)dst,
- msg->msg_flags);
+ err = ip6_append_data(sk, msg, raw6_getfrag, &rfv,
+ 0, &ipc6, &fl6, (struct rt6_info *)dst);
if (err)
ip6_flush_pending_frames(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 80f2eb58ba1a..5bb67739bc0d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1345,10 +1345,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg)
bool connected = false;
size_t len = msg_data_left(msg);
int ulen = len;
- int corkreq = READ_ONCE(up->corkflag) || msg->msg_flags&MSG_MORE;
+ int corkreq = READ_ONCE(up->corkflag);
int err;
int is_udplite = IS_UDPLITE(sk);
- int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
+ ip_getfrag_t getfrag;
+
+ if (corkreq)
+ msg->msg_flags |= MSG_MORE;
+ else
+ corkreq = msg->msg_flags & MSG_MORE;
ipcm6_init(&ipc6);
ipc6.gso_size = READ_ONCE(up->gso_size);
@@ -1578,10 +1583,9 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg)
if (!corkreq) {
struct sk_buff *skb;
- skb = ip6_make_skb(sk, getfrag, msg, ulen,
+ skb = ip6_make_skb(sk, msg, getfrag, NULL,
sizeof(struct udphdr), &ipc6,
- (struct rt6_info *)dst,
- msg->msg_flags, &cork);
+ (struct rt6_info *)dst, &cork);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, fl6, &cork.base);
@@ -1606,9 +1610,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg)
if (ipc6.dontfrag < 0)
ipc6.dontfrag = np->dontfrag;
up->len += ulen;
- err = ip6_append_data(sk, getfrag, msg, ulen, sizeof(struct udphdr),
- &ipc6, fl6, (struct rt6_info *)dst,
- corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
+ err = ip6_append_data(sk, msg, getfrag, NULL, sizeof(struct udphdr),
+ &ipc6, fl6, (struct rt6_info *)dst);
if (err)
udp_v6_flush_pending_frames(sk);
else if (!corkreq)
David Howells wrote:
> Hi Willem,
>
> Here's another option to passing MSG_SPLICE_PAGES into sendmsg()[1] without
> polluting the flags in msg->msg_flags. The idea here is to put the flag
> into a new field in msghdr, msg_kflags, that holds internal kernel flags
> that aren't available to userspace.
>
> What I've done here is:
>
> (1) Pass msg down to __ip_append_data() and __ip6_append_data() so that
> they can access the extra flags.
>
> (2) In order to avoid adding extra arguments to these functions and the
> functions in their call chains (such as ip_make_skb()), remove the
> size and flags arguments as these values are redundant if msg is
> passed in.
>
> (3) msg is then passed into getfrag(). I would like to get rid of the
> "from" argument also in favour of using something in msghdr, but I'm
> not sure how best to do that.
>
> (4) The size parameter to ->sendmsg() seems to be redundant; indeed
> sock_sendmsg() doesn't actually take it, but rather gets the count
> from msg_iter - so remove this parameter.
>
> kernel_sendmsg() will still take a size, but it sets it on the
> iterator and then calls sock_sendmsg().
>
> (5) Protocol sendmsg implementations then extract the length and the flags
> from the iterator.
>
> (6) Illustrate the addition of msg_kflags and MSG_SPLICE_PAGES. I think
> that, at some point in the future, some of the other flags could be
> moved from msg_flags to msg_kflags.
>
> David
>
> Link: https://lore.kernel.org/r/[email protected]/ [1]
>
> David Howells (3):
> net: Drop the size argument from ->sendmsg()
> ip: Make __ip{,6}_append_data() and co. take a msghdr*
> net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
>
> crypto/af_alg.c | 12 +--
> crypto/algif_aead.c | 9 +--
> crypto/algif_hash.c | 8 +-
> crypto/algif_rng.c | 3 +-
> crypto/algif_skcipher.c | 10 +--
> drivers/isdn/mISDN/socket.c | 3 +-
> .../chelsio/inline_crypto/chtls/chtls.h | 2 +-
> .../chelsio/inline_crypto/chtls/chtls_io.c | 15 ++--
> drivers/net/ppp/pppoe.c | 4 +-
> drivers/net/tap.c | 3 +-
> drivers/net/tun.c | 3 +-
> drivers/vhost/net.c | 6 +-
> drivers/xen/pvcalls-back.c | 2 +-
> drivers/xen/pvcalls-front.c | 4 +-
> drivers/xen/pvcalls-front.h | 3 +-
> fs/afs/rxrpc.c | 8 +-
> include/crypto/if_alg.h | 3 +-
> include/linux/lsm_hook_defs.h | 3 +-
> include/linux/lsm_hooks.h | 1 -
> include/linux/net.h | 6 +-
> include/linux/security.h | 4 +-
> include/linux/socket.h | 3 +
> include/net/af_rxrpc.h | 3 +-
> include/net/inet_common.h | 2 +-
> include/net/ip.h | 24 +++---
> include/net/ipv6.h | 22 +++---
> include/net/ping.h | 7 +-
> include/net/sock.h | 7 +-
> include/net/tcp.h | 8 +-
> include/net/udp.h | 2 +-
> include/net/udplite.h | 4 +-
> net/appletalk/ddp.c | 3 +-
> net/atm/common.c | 3 +-
> net/atm/common.h | 2 +-
> net/ax25/af_ax25.c | 4 +-
> net/bluetooth/hci_sock.c | 4 +-
> net/bluetooth/iso.c | 4 +-
> net/bluetooth/l2cap_sock.c | 5 +-
> net/bluetooth/rfcomm/sock.c | 7 +-
> net/bluetooth/sco.c | 4 +-
> net/caif/caif_socket.c | 13 ++--
> net/can/bcm.c | 3 +-
> net/can/isotp.c | 3 +-
> net/can/j1939/socket.c | 4 +-
> net/can/raw.c | 3 +-
> net/core/sock.c | 4 +-
> net/dccp/dccp.h | 2 +-
> net/dccp/proto.c | 3 +-
> net/ieee802154/socket.c | 11 +--
> net/ipv4/af_inet.c | 4 +-
> net/ipv4/icmp.c | 14 ++--
> net/ipv4/ip_output.c | 73 ++++++++++---------
> net/ipv4/ping.c | 18 ++---
> net/ipv4/raw.c | 23 +++---
> net/ipv4/tcp.c | 17 +++--
> net/ipv4/tcp_bpf.c | 5 +-
> net/ipv4/tcp_input.c | 3 +-
> net/ipv4/udp.c | 24 +++---
> net/ipv6/af_inet6.c | 7 +-
> net/ipv6/icmp.c | 21 ++++--
> net/ipv6/ip6_output.c | 57 +++++++--------
> net/ipv6/ping.c | 12 +--
> net/ipv6/raw.c | 25 +++----
> net/ipv6/udp.c | 26 ++++---
> net/ipv6/udp_impl.h | 2 +-
> net/iucv/af_iucv.c | 4 +-
> net/kcm/kcmsock.c | 2 +-
> net/key/af_key.c | 3 +-
> net/l2tp/l2tp_ip.c | 3 +-
> net/l2tp/l2tp_ip6.c | 3 +-
> net/l2tp/l2tp_ppp.c | 4 +-
> net/llc/af_llc.c | 5 +-
> net/mctp/af_mctp.c | 3 +-
> net/mptcp/protocol.c | 8 +-
> net/netlink/af_netlink.c | 11 +--
> net/netrom/af_netrom.c | 3 +-
> net/nfc/llcp_sock.c | 7 +-
> net/nfc/rawsock.c | 3 +-
> net/packet/af_packet.c | 11 +--
> net/phonet/datagram.c | 3 +-
> net/phonet/pep.c | 3 +-
> net/phonet/socket.c | 5 +-
> net/qrtr/af_qrtr.c | 4 +-
> net/rds/rds.h | 2 +-
> net/rds/send.c | 3 +-
> net/rose/af_rose.c | 3 +-
> net/rxrpc/af_rxrpc.c | 6 +-
> net/rxrpc/ar-internal.h | 2 +-
> net/rxrpc/output.c | 22 +++---
> net/rxrpc/rxperf.c | 4 +-
> net/rxrpc/sendmsg.c | 15 ++--
> net/sctp/socket.c | 3 +-
> net/smc/af_smc.c | 5 +-
> net/socket.c | 16 ++--
> net/tipc/socket.c | 34 ++++-----
> net/tls/tls.h | 4 +-
> net/tls/tls_device.c | 5 +-
> net/tls/tls_sw.c | 2 +-
> net/unix/af_unix.c | 19 +++--
> net/vmw_vsock/af_vsock.c | 16 ++--
> net/x25/af_x25.c | 3 +-
> net/xdp/xsk.c | 6 +-
> net/xfrm/espintcp.c | 8 +-
> security/apparmor/lsm.c | 6 +-
> security/security.c | 4 +-
> security/selinux/hooks.c | 3 +-
> security/smack/smack_lsm.c | 4 +-
> security/tomoyo/common.h | 3 +-
> security/tomoyo/network.c | 4 +-
> security/tomoyo/tomoyo.c | 6 +-
> 110 files changed, 444 insertions(+), 456 deletions(-)
That's a significant code change if only for this purpose.
If this bit is undefined and ignored by all socket families today,
masking it out in sock_sendmsg should be enough to start using it
safely as an internal flag.
Herbert Xu <[email protected]> wrote:
> David Howells <[email protected]> wrote:
> > Remove hash_sendpage*() and use hash_sendmsg() as the latter seems to just
> > use the source pages directly anyway.
>
> ...
>
> > - if (!(flags & MSG_MORE)) {
> > - if (ctx->more)
> > - err = crypto_ahash_finup(&ctx->req);
> > - else
> > - err = crypto_ahash_digest(&ctx->req);
>
> You've just removed the optimised path from user-space to
> finup/digest. You need to add them back to sendmsg if you
> want to eliminate sendpage.
I must be missing something, I think. What's particularly optimal about the
code in hash_sendpage() but not hash_sendmsg()? Is it that the former uses
finup/digest, but the latter ony does update+final?
Also, looking at:
if (!ctx->more) {
if ((msg->msg_flags & MSG_MORE))
hash_free_result(sk, ctx);
how is ctx->more meant to be interpreted? I'm guessing it means that we're
continuing to the previous op. But we do we need to free any old result if
MSG_MORE is set, but not if it isn't?
David