2023-03-31 16:12:24

by David Howells

[permalink] [raw]
Subject: [PATCH v3 00/55] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES)

Hi Willy, Dave, et al.,

I've been looking at how to make pipes handle the splicing in of multipage
folios and also looking to see if I could implement a suggestion from Willy
that pipe_buffers could perhaps hold a list of pages (which could make
splicing simpler - an entire splice segment would go in a single
pipe_buffer).

There are a couple of issues here:

(1) Gifting/stealing a multipage folio is really tricky. I think that if
a multipage folio if gifted, the gift flag should be quietly dropped.
Userspace has no control over what splice() and vmsplice() will see in
the pagecache.

(2) The sendpage op expects to be given a single page and various network
protocols just attach that to a socket buffer.

This patchset aims to deal with the second by removing the ->sendpage()
operation and replacing it with sendmsg() and a new internal flag
MSG_SPLICE_PAGES. As sendmsg() takes an I/O iterator, this also affords
the opportunity to pass a slew of pages in one go, rather than one at a
time.

If MSG_SPLICE_PAGES is set, the protocol sendmsg() instance will attempt to
splice the pages out of the buffer, copying into individual fragments those
that it can't (e.g. because they belong to the slab).

The patchset consists of the following parts:

(1) A couple of fixes.

(2) Define the MSG_SPLICE_PAGES flag.

(3) The page_frag_alloc_align() allocator is overhauled:

(a) Split it out from mm/page_alloc.c into its own file,
mm/page_frag_alloc.c.

(b) Make it use multipage folios rather than compound pages.

(c) Give it per-cpu buckets to allocate from so no locking is
required.

(d) The netdev_alloc_cache and the napi fragment cache are then cast
in terms of this and some private allocators are removed.

I'm not sure that the existing allocator is 100% multithread safe.

(4) Implement MSG_SPLICE_PAGES support in TCP.

(5) Make MSG_SPLICE_PAGES copy unspliceable pages (eg. slab pages).

(6) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
various callers.

(7) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
a wrapper around sendmsg().

(8) Make IP/UDP copy unspliceable pages.

(9) Implement MSG_SPLICE_PAGES support in AF_UNIX.

(10) Make AF_UNIX copy unspliceable pages.

(11) Make AF_ALG use netfs_extract_iter_to_sg().

(12) Make AF_ALG implement MSG_SPLICE_PAGES and make af_alg_sendpage() just
a wrapper around sendmsg().

(13) Make AF_ALG/hash implement MSG_SPLICE_PAGES.

(14) Make TLS implement MSG_SPLICE_PAGES and make its sendpage
implementations just a wrapper.

[!] Note that tls_sw_sendpage_locked() appears to have the wrong
locking upstream. I think the caller will only hold the socket
lock, but it should hold tls_ctx->tx_lock too.

(15) Make Chelsio's chtls implement MSG_SPLICE_PAGES.

(16) Make AF_KCM implement MSG_SPLICE_PAGES.

(17) Rename pipe_to_sendpage() to pipe_to_sendmsg() and make it a wrapper
around sendmsg().

(18) Replace splice_to_socket() with an implementation that doesn't use
splice_from_pipe() to push one page at a time, but rather something
that splices up to 16 pages at once. This absorbs pipe_to_sendmsg().

(19) Remove sendpage file operation.

(20) Convert siw, ceph, iscsi and tcp_bpf to use sendmsg() instead of
tcp_sendpage().

(21) Make skb_send_sock() use sendmsg().

(22) Convert ceph, rds, dlm, sunrpc, nvme, kcm, smc, ocfs2 and drbd to use
sendmsg().

(23) Make drbd delegate copying of slab pages to TCP and pass an entire
bio's bvec to sendmsg at a time. Delegate copying of unspliceable
pages to TCP.

(24) Remove the sendpage socket operation.

I've killed off all uses of kernel_sendpage() and all uses of sendpage_ok()
outside of the protocols.

I have tested AF_UNIX splicing - which, surprisingly, seems nearly twice as
fast - TCP splicing, the siw driver (softIWarp RDMA with nfs and cifs),
sunrpc (with nfsd), UDP (using a patched rxrpc) and TLS/sw.

I've pushed the patches here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-sendpage

David

Changes
=======
ver #3)
- Dropped the iterator-of-iterators patch.
- Only expunge MSG_SPLICE_PAGES in sys_send[m]msg, not sys_recv[m]msg.
- Split MSG_SPLICE_PAGES code in __ip_append_data() out into helper
functions.
- Implement MSG_SPLICE_PAGES support in __ip6_append_data() using the
above helper functions.
- Rename 'xlength' to 'initial_length'.
- Minimise the changes to sunrpc for the moment.
- Don't give -EOPNOTSUPP if NETIF_F_SG not available, just copy instead.
- Implemented MSG_SPLICE_PAGES support in the TLS, Chelsio-TLS and AF_KCM
code.

ver #2)
- Overhauled the page_frag_alloc() allocator: large folios and per-cpu.
- Got rid of my own zerocopy allocator.
- Use iov_iter_extract_pages() rather poking in iter->bvec.
- Made page splicing fall back to page copying on a page-by-page basis.
- Made splice_to_socket() pass 16 pipe buffers at a time.
- Made AF_ALG/hash use finup/digest where possible in sendmsg.
- Added an iterator-of-iterators, ITER_ITERLIST.
- Made sunrpc use the iterator-of-iterators.
- Converted more drivers.

Link: https://lore.kernel.org/r/[email protected]/ # v1
Link: https://lore.kernel.org/r/[email protected]/ # v2

David Howells (55):
netfs: Fix netfs_extract_iter_to_sg() for ITER_UBUF/IOVEC
iov_iter: Remove last_offset member
net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
mm: Move the page fragment allocator from page_alloc.c into its own
file
mm: Make the page_frag_cache allocator use multipage folios
mm: Make the page_frag_cache allocator use per-cpu
tcp: Support MSG_SPLICE_PAGES
tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
tcp_sendmsg
espintcp: Inline do_tcp_sendpages()
tls: Inline do_tcp_sendpages()
siw: Inline do_tcp_sendpages()
tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
ip, udp: Support MSG_SPLICE_PAGES
ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
ip6, udp6: Support MSG_SPLICE_PAGES
udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
af_unix: Support MSG_SPLICE_PAGES
af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
crypto: af_alg: Pin pages rather than ref'ing if appropriate
crypto: af_alg: Use netfs_extract_iter_to_sg() to create scatterlists
crypto: af_alg: Indent the loop in af_alg_sendmsg()
crypto: af_alg: Support MSG_SPLICE_PAGES
crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES
crypto: af_alg/hash: Support MSG_SPLICE_PAGES
tls/device: Support MSG_SPLICE_PAGES
tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES
tls/sw: Support MSG_SPLICE_PAGES
tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
chelsio: Support MSG_SPLICE_PAGES
chelsio: Convert chtls_sendpage() to use MSG_SPLICE_PAGES
kcm: Support MSG_SPLICE_PAGES
kcm: Convert kcm_sendpage() to use MSG_SPLICE_PAGES
splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
splice, net: Reimplement splice_to_socket() to pass multiple bufs to
sendmsg()
Remove file->f_op->sendpage
siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
iscsi: Assume "sendpage" is okay in iscsi_tcp_segment_map()
tcp_bpf: Make tcp_bpf_sendpage() go through
tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
algif: Remove hash_sendpage*()
ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
nvme: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
kcm: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES
ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendmsg()
drdb: Send an entire bio in a single sendmsg
sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

Documentation/networking/scaling.rst | 4 +-
crypto/Kconfig | 1 +
crypto/af_alg.c | 194 +++++--------
crypto/algif_aead.c | 52 ++--
crypto/algif_hash.c | 171 +++++------
crypto/algif_rng.c | 2 -
crypto/algif_skcipher.c | 24 +-
drivers/block/drbd/drbd_main.c | 86 ++----
drivers/infiniband/sw/siw/siw_qp_tx.c | 227 +++------------
.../chelsio/inline_crypto/chtls/chtls.h | 2 -
.../chelsio/inline_crypto/chtls/chtls_io.c | 169 ++++-------
.../chelsio/inline_crypto/chtls/chtls_main.c | 1 -
drivers/net/ethernet/mediatek/mtk_wed_wo.c | 19 +-
drivers/net/ethernet/mediatek/mtk_wed_wo.h | 2 -
drivers/nvme/host/tcp.c | 63 ++--
drivers/nvme/target/tcp.c | 69 +++--
drivers/scsi/iscsi_tcp.c | 31 +-
drivers/scsi/iscsi_tcp.h | 2 +-
drivers/scsi/libiscsi_tcp.c | 13 +-
drivers/target/iscsi/iscsi_target_util.c | 14 +-
fs/dlm/lowcomms.c | 10 +-
fs/netfs/iterator.c | 2 +-
fs/ocfs2/cluster/tcp.c | 107 +++----
fs/splice.c | 158 ++++++++--
include/crypto/if_alg.h | 7 +-
include/linux/fs.h | 3 -
include/linux/gfp.h | 17 +-
include/linux/mm_types.h | 13 +-
include/linux/net.h | 8 -
include/linux/socket.h | 3 +
include/linux/splice.h | 2 +
include/linux/sunrpc/svc.h | 11 +-
include/linux/uio.h | 5 +-
include/net/inet_common.h | 2 -
include/net/ip.h | 4 +
include/net/sock.h | 6 -
include/net/tcp.h | 2 -
include/net/tls.h | 2 +-
mm/Makefile | 2 +-
mm/page_alloc.c | 126 --------
mm/page_frag_alloc.c | 201 +++++++++++++
net/appletalk/ddp.c | 1 -
net/atm/pvc.c | 1 -
net/atm/svc.c | 1 -
net/ax25/af_ax25.c | 1 -
net/caif/caif_socket.c | 2 -
net/can/bcm.c | 1 -
net/can/isotp.c | 1 -
net/can/j1939/socket.c | 1 -
net/can/raw.c | 1 -
net/ceph/messenger_v1.c | 58 ++--
net/ceph/messenger_v2.c | 89 ++----
net/core/skbuff.c | 81 +++---
net/core/sock.c | 35 +--
net/dccp/ipv4.c | 1 -
net/dccp/ipv6.c | 1 -
net/ieee802154/socket.c | 2 -
net/ipv4/af_inet.c | 21 --
net/ipv4/ip_output.c | 122 +++++++-
net/ipv4/tcp.c | 274 ++++++------------
net/ipv4/tcp_bpf.c | 72 +----
net/ipv4/tcp_ipv4.c | 1 -
net/ipv4/udp.c | 54 ----
net/ipv4/udp_impl.h | 2 -
net/ipv4/udplite.c | 1 -
net/ipv6/af_inet6.c | 3 -
net/ipv6/ip6_output.c | 28 +-
net/ipv6/raw.c | 1 -
net/ipv6/tcp_ipv6.c | 1 -
net/kcm/kcmsock.c | 249 ++++++----------
net/key/af_key.c | 1 -
net/l2tp/l2tp_ip.c | 1 -
net/l2tp/l2tp_ip6.c | 1 -
net/llc/af_llc.c | 1 -
net/mctp/af_mctp.c | 1 -
net/mptcp/protocol.c | 2 -
net/netlink/af_netlink.c | 1 -
net/netrom/af_netrom.c | 1 -
net/packet/af_packet.c | 2 -
net/phonet/socket.c | 2 -
net/qrtr/af_qrtr.c | 1 -
net/rds/af_rds.c | 1 -
net/rds/tcp_send.c | 86 +++---
net/rose/af_rose.c | 1 -
net/rxrpc/af_rxrpc.c | 1 -
net/sctp/protocol.c | 1 -
net/smc/af_smc.c | 29 --
net/smc/smc_stats.c | 2 +-
net/smc/smc_stats.h | 1 -
net/smc/smc_tx.c | 16 -
net/smc/smc_tx.h | 2 -
net/socket.c | 76 +----
net/sunrpc/svcsock.c | 38 +--
net/tipc/socket.c | 3 -
net/tls/tls_device.c | 91 +++---
net/tls/tls_main.c | 31 +-
net/tls/tls_sw.c | 215 ++++++--------
net/unix/af_unix.c | 254 +++++++---------
net/vmw_vsock/af_vsock.c | 3 -
net/x25/af_x25.c | 1 -
net/xdp/xsk.c | 1 -
net/xfrm/espintcp.c | 10 +-
102 files changed, 1519 insertions(+), 2301 deletions(-)
create mode 100644 mm/page_frag_alloc.c


2023-03-31 16:12:25

by David Howells

[permalink] [raw]
Subject: [PATCH v3 02/55] iov_iter: Remove last_offset member

With the removal of ITER_PIPE, the last_offset member of struct iov_iter is
no longer used, so remove it and un-unionise the remaining member.

Signed-off-by: David Howells <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Alexander Viro <[email protected]>
cc: Jeff Layton <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
include/linux/uio.h | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 74598426edb4..2d8a70cb9b26 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -43,10 +43,7 @@ struct iov_iter {
bool nofault;
bool data_source;
bool user_backed;
- union {
- size_t iov_offset;
- int last_offset;
- };
+ size_t iov_offset;
size_t count;
union {
const struct iovec *iov;

2023-03-31 16:12:29

by David Howells

[permalink] [raw]
Subject: [PATCH v3 03/55] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag

Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can. This flag is added to a list that
is cleared by sendmsg and recvmsg syscalls on entry.

This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/socket.h | 3 +++
net/socket.c | 2 ++
2 files changed, 5 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..bd1cc3238851 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -327,6 +327,7 @@ struct ucred {
*/

#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
+#define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file
descriptor received through
@@ -337,6 +338,8 @@ struct ucred {
#define MSG_CMSG_COMPAT 0 /* We never have 32 bit fixups */
#endif

+/* Flags to be cleared on entry by sendmsg and sendmmsg syscalls */
+#define MSG_INTERNAL_SENDMSG_FLAGS (MSG_SPLICE_PAGES)

/* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
#define SOL_IP 0
diff --git a/net/socket.c b/net/socket.c
index 6bae8ce7059e..0c39ce57d603 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2139,6 +2139,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
msg.msg_name = (struct sockaddr *)&address;
msg.msg_namelen = addr_len;
}
+ flags &= ~MSG_INTERNAL_SENDMSG_FLAGS;
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
msg.msg_flags = flags;
@@ -2486,6 +2487,7 @@ static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys,
}
msg_sys->msg_flags = flags;

+ flags &= ~MSG_INTERNAL_SENDMSG_FLAGS;
if (sock->file->f_flags & O_NONBLOCK)
msg_sys->msg_flags |= MSG_DONTWAIT;
/*

2023-03-31 16:12:42

by David Howells

[permalink] [raw]
Subject: [PATCH v3 05/55] mm: Make the page_frag_cache allocator use multipage folios

Change the page_frag_cache allocator to use multipage folios rather than
groups of pages. This reduces page_frag_free to just a folio_put() or
put_page().

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/mm_types.h | 13 ++----
mm/page_frag_alloc.c | 88 +++++++++++++++++++---------------------
2 files changed, 45 insertions(+), 56 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0722859c3647..49a70b3f44a9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
}

struct page_frag_cache {
- void * va;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- __u16 offset;
- __u16 size;
-#else
- __u32 offset;
-#endif
+ struct folio *folio;
+ unsigned int offset;
/* we maintain a pagecount bias, so that we dont dirty cache line
* containing page->_refcount every time we allocate a fragment.
*/
- unsigned int pagecnt_bias;
- bool pfmemalloc;
+ unsigned int pagecnt_bias;
+ bool pfmemalloc;
};

typedef unsigned long vm_flags_t;
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index bee95824ef8f..c3792b68ce32 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,33 +16,34 @@
#include <linux/init.h>
#include <linux/mm.h>

-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
- gfp_t gfp_mask)
+/*
+ * Allocate a new folio for the frag cache.
+ */
+static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
+ gfp_t gfp_mask)
{
- struct page *page = NULL;
+ struct folio *folio = NULL;
gfp_t gfp = gfp_mask;

#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
- __GFP_NOMEMALLOC;
- page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
- PAGE_FRAG_CACHE_MAX_ORDER);
- nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+ gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+ folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
#endif
- if (unlikely(!page))
- page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
-
- nc->va = page ? page_address(page) : NULL;
+ if (unlikely(!folio))
+ folio = folio_alloc(gfp, 0);

- return page;
+ if (folio)
+ nc->folio = folio;
+ return folio;
}

void __page_frag_cache_drain(struct page *page, unsigned int count)
{
- VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+ struct folio *folio = page_folio(page);

- if (page_ref_sub_and_test(page, count - 1))
- __free_pages(page, compound_order(page));
+ VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+
+ folio_put_refs(folio, count);
}
EXPORT_SYMBOL(__page_frag_cache_drain);

@@ -50,54 +51,47 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align_mask)
{
- unsigned int size = PAGE_SIZE;
- struct page *page;
- int offset;
+ struct folio *folio = nc->folio;
+ size_t offset;

- if (unlikely(!nc->va)) {
+ if (unlikely(!folio)) {
refill:
- page = __page_frag_cache_refill(nc, gfp_mask);
- if (!page)
+ folio = page_frag_cache_refill(nc, gfp_mask);
+ if (!folio)
return NULL;

-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* Even if we own the page, we do not use atomic_set().
* This would break get_page_unless_zero() users.
*/
- page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+ folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);

/* reset page count bias and offset to start of new frag */
- nc->pfmemalloc = page_is_pfmemalloc(page);
+ nc->pfmemalloc = folio_is_pfmemalloc(folio);
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- nc->offset = size;
+ nc->offset = folio_size(folio);
}

- offset = nc->offset - fragsz;
- if (unlikely(offset < 0)) {
- page = virt_to_page(nc->va);
-
- if (page_ref_count(page) != nc->pagecnt_bias)
+ offset = nc->offset;
+ if (unlikely(fragsz > offset)) {
+ /* Reuse the folio if everyone we gave it to has finished with it. */
+ if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
+ nc->folio = NULL;
goto refill;
+ }
+
if (unlikely(nc->pfmemalloc)) {
- page_ref_sub(page, nc->pagecnt_bias - 1);
- __free_pages(page, compound_order(page));
+ __folio_put(folio);
+ nc->folio = NULL;
goto refill;
}

-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* OK, page count is 0, we can safely set it */
- set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+ folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);

/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- offset = size - fragsz;
- if (unlikely(offset < 0)) {
+ offset = folio_size(folio);
+ if (unlikely(fragsz > offset)) {
/*
* The caller is trying to allocate a fragment
* with fragsz > PAGE_SIZE but the cache isn't big
@@ -107,15 +101,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
* it could make memory pressure worse
* so we simply return NULL here.
*/
+ nc->offset = offset;
return NULL;
}
}

nc->pagecnt_bias--;
+ offset -= fragsz;
offset &= align_mask;
nc->offset = offset;

- return nc->va + offset;
+ return folio_address(folio) + offset;
}
EXPORT_SYMBOL(page_frag_alloc_align);

@@ -124,8 +120,6 @@ EXPORT_SYMBOL(page_frag_alloc_align);
*/
void page_frag_free(void *addr)
{
- struct page *page = virt_to_head_page(addr);
-
- __free_pages(page, compound_order(page));
+ folio_put(virt_to_folio(addr));
}
EXPORT_SYMBOL(page_frag_free);

2023-03-31 16:12:45

by David Howells

[permalink] [raw]
Subject: [PATCH v3 06/55] mm: Make the page_frag_cache allocator use per-cpu

Make the page_frag_cache allocator have a separate allocation bucket for
each cpu to avoid racing. This means that no lock is required, other than
preempt disablement, to allocate from it, though if a softirq wants to
access it, then softirq disablement will need to be added.

Make the NVMe and mediatek drivers pass in NULL to page_frag_cache() and
use the default allocation buckets rather than defining their own.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Lorenzo Bianconi <[email protected]>
cc: Felix Fietkau <[email protected]>
cc: John Crispin <[email protected]>
cc: Sean Wang <[email protected]>
cc: Mark Lee <[email protected]>
cc: Keith Busch <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Sagi Grimberg <[email protected]>
cc: Chaitanya Kulkarni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
drivers/net/ethernet/mediatek/mtk_wed_wo.c | 19 +--
drivers/net/ethernet/mediatek/mtk_wed_wo.h | 2 -
drivers/nvme/host/tcp.c | 19 +--
drivers/nvme/target/tcp.c | 22 +--
include/linux/gfp.h | 17 +-
mm/page_frag_alloc.c | 182 +++++++++++++++------
net/core/skbuff.c | 32 ++--
7 files changed, 164 insertions(+), 129 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 69fba29055e9..859f34447f2f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -143,7 +143,7 @@ mtk_wed_wo_queue_refill(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q,
dma_addr_t addr;
void *buf;

- buf = page_frag_alloc(&q->cache, q->buf_size, GFP_ATOMIC);
+ buf = page_frag_alloc(NULL, q->buf_size, GFP_ATOMIC);
if (!buf)
break;

@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
static void
mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
{
- struct page *page;
int i;

for (i = 0; i < q->n_desc; i++) {
@@ -297,20 +296,11 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
skb_free_frag(entry->buf);
entry->buf = NULL;
}
-
- if (!q->cache.va)
- return;
-
- page = virt_to_page(q->cache.va);
- __page_frag_cache_drain(page, q->cache.pagecnt_bias);
- memset(&q->cache, 0, sizeof(q->cache));
}

static void
mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
{
- struct page *page;
-
for (;;) {
void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);

@@ -319,13 +309,6 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)

skb_free_frag(buf);
}
-
- if (!q->cache.va)
- return;
-
- page = virt_to_page(q->cache.va);
- __page_frag_cache_drain(page, q->cache.pagecnt_bias);
- memset(&q->cache, 0, sizeof(q->cache));
}

static void
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.h b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
index dbcf42ce9173..6f940db67fb8 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.h
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
@@ -210,8 +210,6 @@ struct mtk_wed_wo_queue_entry {
struct mtk_wed_wo_queue {
struct mtk_wed_wo_queue_regs regs;

- struct page_frag_cache cache;
-
struct mtk_wed_wo_queue_desc *desc;
dma_addr_t desc_dma;

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 7723a4989524..fa32969b532f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -147,8 +147,6 @@ struct nvme_tcp_queue {
__le32 exp_ddgst;
__le32 recv_ddgst;

- struct page_frag_cache pf_cache;
-
void (*state_change)(struct sock *);
void (*data_ready)(struct sock *);
void (*write_space)(struct sock *);
@@ -470,9 +468,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
u8 hdgst = nvme_tcp_hdgst_len(queue);

- req->pdu = page_frag_alloc(&queue->pf_cache,
- sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
- GFP_KERNEL | __GFP_ZERO);
+ req->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+ GFP_KERNEL | __GFP_ZERO);
if (!req->pdu)
return -ENOMEM;

@@ -1288,9 +1285,8 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
struct nvme_tcp_request *async = &ctrl->async_req;
u8 hdgst = nvme_tcp_hdgst_len(queue);

- async->pdu = page_frag_alloc(&queue->pf_cache,
- sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
- GFP_KERNEL | __GFP_ZERO);
+ async->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+ GFP_KERNEL | __GFP_ZERO);
if (!async->pdu)
return -ENOMEM;

@@ -1300,7 +1296,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)

static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
{
- struct page *page;
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
unsigned int noreclaim_flag;
@@ -1311,12 +1306,6 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
if (queue->hdr_digest || queue->data_digest)
nvme_tcp_free_crypto(queue);

- if (queue->pf_cache.va) {
- page = virt_to_head_page(queue->pf_cache.va);
- __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
- queue->pf_cache.va = NULL;
- }
-
noreclaim_flag = memalloc_noreclaim_save();
sock_release(queue->sock);
memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 66e8f9fd0ca7..d6cc557cc539 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -143,8 +143,6 @@ struct nvmet_tcp_queue {

struct nvmet_tcp_cmd connect;

- struct page_frag_cache pf_cache;
-
void (*data_ready)(struct sock *);
void (*state_change)(struct sock *);
void (*write_space)(struct sock *);
@@ -1312,25 +1310,25 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
c->queue = queue;
c->req.port = queue->port->nport;

- c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
- sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+ c->cmd_pdu = page_frag_alloc(NULL, sizeof(*c->cmd_pdu) + hdgst,
+ GFP_KERNEL | __GFP_ZERO);
if (!c->cmd_pdu)
return -ENOMEM;
c->req.cmd = &c->cmd_pdu->cmd;

- c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
- sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+ c->rsp_pdu = page_frag_alloc(NULL, sizeof(*c->rsp_pdu) + hdgst,
+ GFP_KERNEL | __GFP_ZERO);
if (!c->rsp_pdu)
goto out_free_cmd;
c->req.cqe = &c->rsp_pdu->cqe;

- c->data_pdu = page_frag_alloc(&queue->pf_cache,
- sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+ c->data_pdu = page_frag_alloc(NULL, sizeof(*c->data_pdu) + hdgst,
+ GFP_KERNEL | __GFP_ZERO);
if (!c->data_pdu)
goto out_free_rsp;

- c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
- sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+ c->r2t_pdu = page_frag_alloc(NULL, sizeof(*c->r2t_pdu) + hdgst,
+ GFP_KERNEL | __GFP_ZERO);
if (!c->r2t_pdu)
goto out_free_data;

@@ -1438,7 +1436,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)

static void nvmet_tcp_release_queue_work(struct work_struct *w)
{
- struct page *page;
struct nvmet_tcp_queue *queue =
container_of(w, struct nvmet_tcp_queue, release_work);

@@ -1460,9 +1457,6 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
if (queue->hdr_digest || queue->data_digest)
nvmet_tcp_free_crypto(queue);
ida_free(&nvmet_tcp_queue_ida, queue->idx);
-
- page = virt_to_head_page(queue->pf_cache.va);
- __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
kfree(queue);
}

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..b208ca315882 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -304,14 +304,17 @@ extern void free_pages(unsigned long addr, unsigned int order);

struct page_frag_cache;
extern void __page_frag_cache_drain(struct page *page, unsigned int count);
-extern void *page_frag_alloc_align(struct page_frag_cache *nc,
- unsigned int fragsz, gfp_t gfp_mask,
- unsigned int align_mask);
-
-static inline void *page_frag_alloc(struct page_frag_cache *nc,
- unsigned int fragsz, gfp_t gfp_mask)
+extern void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+ size_t fragsz, gfp_t gfp,
+ unsigned long align_mask);
+extern void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+ const void *p, size_t fragsz, gfp_t gfp,
+ unsigned long align_mask);
+
+static inline void *page_frag_alloc(struct page_frag_cache __percpu *frag_cache,
+ size_t fragsz, gfp_t gfp)
{
- return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+ return page_frag_alloc_align(frag_cache, fragsz, gfp, ULONG_MAX);
}

extern void page_frag_free(void *addr);
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index c3792b68ce32..7844398afe26 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,25 +16,23 @@
#include <linux/init.h>
#include <linux/mm.h>

+static DEFINE_PER_CPU(struct page_frag_cache, page_frag_default_allocator);
+
/*
* Allocate a new folio for the frag cache.
*/
-static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
- gfp_t gfp_mask)
+static struct folio *page_frag_cache_refill(gfp_t gfp)
{
- struct folio *folio = NULL;
- gfp_t gfp = gfp_mask;
+ struct folio *folio;

#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
- folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
+ folio = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
+ PAGE_FRAG_CACHE_MAX_ORDER);
+ if (folio)
+ return folio;
#endif
- if (unlikely(!folio))
- folio = folio_alloc(gfp, 0);

- if (folio)
- nc->folio = folio;
- return folio;
+ return folio_alloc(gfp, 0);
}

void __page_frag_cache_drain(struct page *page, unsigned int count)
@@ -47,41 +45,68 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
}
EXPORT_SYMBOL(__page_frag_cache_drain);

-void *page_frag_alloc_align(struct page_frag_cache *nc,
- unsigned int fragsz, gfp_t gfp_mask,
- unsigned int align_mask)
+/**
+ * page_frag_alloc_align - Allocate some memory for use in zerocopy
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The size of the fragment desired
+ * @gfp: Allocation flags under which to make an allocation
+ * @align_mask: The required alignment
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data. Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and fragments are broken off
+ * sequentially and returned to the caller with a ref until the folio no longer
+ * has enough spare space - at which point the allocator's ref is dropped and a
+ * new folio is allocated. The folio remains in existence until the last ref
+ * held by, say, an sk_buff is discarded and then the page is returned to the
+ * page allocator.
+ *
+ * Returns a pointer to the memory on success and -ENOMEM on allocation
+ * failure.
+ *
+ * The allocated memory should be disposed of with folio_put().
+ */
+void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+ size_t fragsz, gfp_t gfp, unsigned long align_mask)
{
- struct folio *folio = nc->folio;
+ struct page_frag_cache *nc;
+ struct folio *folio, *spare = NULL;
size_t offset;
+ void *p;

- if (unlikely(!folio)) {
-refill:
- folio = page_frag_cache_refill(nc, gfp_mask);
- if (!folio)
- return NULL;
-
- /* Even if we own the page, we do not use atomic_set().
- * This would break get_page_unless_zero() users.
- */
- folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+ if (!frag_cache)
+ frag_cache = &page_frag_default_allocator;
+ if (WARN_ON_ONCE(fragsz == 0))
+ fragsz = 1;
+ align_mask &= ~3UL;

- /* reset page count bias and offset to start of new frag */
- nc->pfmemalloc = folio_is_pfmemalloc(folio);
- nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- nc->offset = folio_size(folio);
+ nc = get_cpu_ptr(frag_cache);
+reload:
+ folio = nc->folio;
+ offset = nc->offset;
+try_again:
+
+ /* Make the allocation if there's sufficient space. */
+ if (fragsz <= offset) {
+ nc->pagecnt_bias--;
+ offset = (offset - fragsz) & align_mask;
+ nc->offset = offset;
+ p = folio_address(folio) + offset;
+ put_cpu_ptr(frag_cache);
+ if (spare)
+ folio_put(spare);
+ return p;
}

- offset = nc->offset;
- if (unlikely(fragsz > offset)) {
- /* Reuse the folio if everyone we gave it to has finished with it. */
- if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
- nc->folio = NULL;
+ /* Insufficient space - see if we can refurbish the current folio. */
+ if (folio) {
+ if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias))
goto refill;
- }

if (unlikely(nc->pfmemalloc)) {
__folio_put(folio);
- nc->folio = NULL;
goto refill;
}

@@ -91,27 +116,56 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
offset = folio_size(folio);
- if (unlikely(fragsz > offset)) {
- /*
- * The caller is trying to allocate a fragment
- * with fragsz > PAGE_SIZE but the cache isn't big
- * enough to satisfy the request, this may
- * happen in low memory conditions.
- * We don't release the cache page because
- * it could make memory pressure worse
- * so we simply return NULL here.
- */
- nc->offset = offset;
+ if (unlikely(fragsz > offset))
+ goto frag_too_big;
+ goto try_again;
+ }
+
+refill:
+ if (!spare) {
+ nc->folio = NULL;
+ put_cpu_ptr(frag_cache);
+
+ spare = page_frag_cache_refill(gfp);
+ if (!spare)
return NULL;
- }
+
+ nc = get_cpu_ptr(frag_cache);
+ /* We may now be on a different cpu and/or someone else may
+ * have refilled it
+ */
+ nc->pfmemalloc = folio_is_pfmemalloc(spare);
+ if (nc->folio)
+ goto reload;
}

- nc->pagecnt_bias--;
- offset -= fragsz;
- offset &= align_mask;
+ nc->folio = spare;
+ folio = spare;
+ spare = NULL;
+
+ /* Even if we own the page, we do not use atomic_set(). This would
+ * break get_page_unless_zero() users.
+ */
+ folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+ /* Reset page count bias and offset to start of new frag */
+ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ offset = folio_size(folio);
+ goto try_again;
+
+frag_too_big:
+ /*
+ * The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+ * but the cache isn't big enough to satisfy the request, this may
+ * happen in low memory conditions. We don't release the cache page
+ * because it could make memory pressure worse so we simply return NULL
+ * here.
+ */
nc->offset = offset;
-
- return folio_address(folio) + offset;
+ put_cpu_ptr(frag_cache);
+ if (spare)
+ folio_put(spare);
+ return NULL;
}
EXPORT_SYMBOL(page_frag_alloc_align);

@@ -123,3 +177,25 @@ void page_frag_free(void *addr)
folio_put(virt_to_folio(addr));
}
EXPORT_SYMBOL(page_frag_free);
+
+/**
+ * page_frag_memdup - Allocate a page fragment and duplicate some data into it
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @gfp: Allocation flags under which to make an allocation
+ * @align_mask: The required alignment
+ */
+void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+ const void *p, size_t fragsz, gfp_t gfp,
+ unsigned long align_mask)
+{
+ void *q;
+
+ q = page_frag_alloc_align(frag_cache, fragsz, gfp, align_mask);
+ if (!q)
+ return q;
+
+ return memcpy(q, p, fragsz);
+}
+EXPORT_SYMBOL(page_frag_memdup);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eb7d33b41e71..0506e4cf1ed9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -222,13 +222,13 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
#endif

struct napi_alloc_cache {
- struct page_frag_cache page;
struct page_frag_1k page_small;
unsigned int skb_count;
void *skb_cache[NAPI_SKB_CACHE_SIZE];
};

static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct page_frag_cache, napi_frag_cache);
static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);

/* Double check that napi_get_frags() allocates skbs with
@@ -250,11 +250,9 @@ void napi_get_frags_check(struct napi_struct *napi)

void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
{
- struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
fragsz = SKB_DATA_ALIGN(fragsz);

- return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+ return page_frag_alloc_align(&napi_frag_cache, fragsz, GFP_ATOMIC, align_mask);
}
EXPORT_SYMBOL(__napi_alloc_frag_align);

@@ -264,15 +262,12 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)

fragsz = SKB_DATA_ALIGN(fragsz);
if (in_hardirq() || irqs_disabled()) {
- struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
-
- data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+ data = page_frag_alloc_align(&netdev_alloc_cache,
+ fragsz, GFP_ATOMIC, align_mask);
} else {
- struct napi_alloc_cache *nc;
-
local_bh_disable();
- nc = this_cpu_ptr(&napi_alloc_cache);
- data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+ data = page_frag_alloc_align(&napi_frag_cache,
+ fragsz, GFP_ATOMIC, align_mask);
local_bh_enable();
}
return data;
@@ -656,7 +651,6 @@ EXPORT_SYMBOL(__alloc_skb);
struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
gfp_t gfp_mask)
{
- struct page_frag_cache *nc;
struct sk_buff *skb;
bool pfmemalloc;
void *data;
@@ -681,14 +675,12 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
gfp_mask |= __GFP_MEMALLOC;

if (in_hardirq() || irqs_disabled()) {
- nc = this_cpu_ptr(&netdev_alloc_cache);
- data = page_frag_alloc(nc, len, gfp_mask);
- pfmemalloc = nc->pfmemalloc;
+ data = page_frag_alloc(&netdev_alloc_cache, len, gfp_mask);
+ pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
} else {
local_bh_disable();
- nc = this_cpu_ptr(&napi_alloc_cache.page);
- data = page_frag_alloc(nc, len, gfp_mask);
- pfmemalloc = nc->pfmemalloc;
+ data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+ pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
local_bh_enable();
}

@@ -776,8 +768,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
} else {
len = SKB_HEAD_ALIGN(len);

- data = page_frag_alloc(&nc->page, len, gfp_mask);
- pfmemalloc = nc->page.pfmemalloc;
+ data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+ pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
}

if (unlikely(!data))

2023-03-31 16:13:05

by David Howells

[permalink] [raw]
Subject: [PATCH v3 04/55] mm: Move the page fragment allocator from page_alloc.c into its own file

Move the page fragment allocator from page_alloc.c into its own file
preparatory to changing it.

Signed-off-by: David Howells <[email protected]>
cc: Bernard Metzler <[email protected]>
cc: Tom Talpey <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
mm/Makefile | 2 +-
mm/page_alloc.c | 126 -----------------------------------------
mm/page_frag_alloc.c | 131 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 132 insertions(+), 127 deletions(-)
create mode 100644 mm/page_frag_alloc.c

diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..4e6dc12b4cbd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
- compaction.o \
+ compaction.o page_frag_alloc.o \
interval_tree.o list_lru.o workingset.o \
debug.o gup.o mmap_lock.o $(mmu-y)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac1fc986af44..c08847308907 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5694,132 +5694,6 @@ void free_pages(unsigned long addr, unsigned int order)

EXPORT_SYMBOL(free_pages);

-/*
- * Page Fragment:
- * An arbitrary-length arbitrary-offset area of memory which resides
- * within a 0 or higher order page. Multiple fragments within that page
- * are individually refcounted, in the page's reference counter.
- *
- * The page_frag functions below provide a simple allocation framework for
- * page fragments. This is used by the network stack and network device
- * drivers to provide a backing region of memory for use as either an
- * sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
- */
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
- gfp_t gfp_mask)
-{
- struct page *page = NULL;
- gfp_t gfp = gfp_mask;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
- __GFP_NOMEMALLOC;
- page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
- PAGE_FRAG_CACHE_MAX_ORDER);
- nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
-#endif
- if (unlikely(!page))
- page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
-
- nc->va = page ? page_address(page) : NULL;
-
- return page;
-}
-
-void __page_frag_cache_drain(struct page *page, unsigned int count)
-{
- VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
- if (page_ref_sub_and_test(page, count))
- free_the_page(page, compound_order(page));
-}
-EXPORT_SYMBOL(__page_frag_cache_drain);
-
-void *page_frag_alloc_align(struct page_frag_cache *nc,
- unsigned int fragsz, gfp_t gfp_mask,
- unsigned int align_mask)
-{
- unsigned int size = PAGE_SIZE;
- struct page *page;
- int offset;
-
- if (unlikely(!nc->va)) {
-refill:
- page = __page_frag_cache_refill(nc, gfp_mask);
- if (!page)
- return NULL;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
- /* Even if we own the page, we do not use atomic_set().
- * This would break get_page_unless_zero() users.
- */
- page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
-
- /* reset page count bias and offset to start of new frag */
- nc->pfmemalloc = page_is_pfmemalloc(page);
- nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- nc->offset = size;
- }
-
- offset = nc->offset - fragsz;
- if (unlikely(offset < 0)) {
- page = virt_to_page(nc->va);
-
- if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
- goto refill;
-
- if (unlikely(nc->pfmemalloc)) {
- free_the_page(page, compound_order(page));
- goto refill;
- }
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
- /* OK, page count is 0, we can safely set it */
- set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
-
- /* reset page count bias and offset to start of new frag */
- nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- offset = size - fragsz;
- if (unlikely(offset < 0)) {
- /*
- * The caller is trying to allocate a fragment
- * with fragsz > PAGE_SIZE but the cache isn't big
- * enough to satisfy the request, this may
- * happen in low memory conditions.
- * We don't release the cache page because
- * it could make memory pressure worse
- * so we simply return NULL here.
- */
- return NULL;
- }
- }
-
- nc->pagecnt_bias--;
- offset &= align_mask;
- nc->offset = offset;
-
- return nc->va + offset;
-}
-EXPORT_SYMBOL(page_frag_alloc_align);
-
-/*
- * Frees a page fragment allocated out of either a compound or order 0 page.
- */
-void page_frag_free(void *addr)
-{
- struct page *page = virt_to_head_page(addr);
-
- if (unlikely(put_page_testzero(page)))
- free_the_page(page, compound_order(page));
-}
-EXPORT_SYMBOL(page_frag_free);
-
static void *make_alloc_exact(unsigned long addr, unsigned int order,
size_t size)
{
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
new file mode 100644
index 000000000000..bee95824ef8f
--- /dev/null
+++ b/mm/page_frag_alloc.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Page fragment allocator
+ *
+ * Page Fragment:
+ * An arbitrary-length arbitrary-offset area of memory which resides within a
+ * 0 or higher order page. Multiple fragments within that page are
+ * individually refcounted, in the page's reference counter.
+ *
+ * The page_frag functions provide a simple allocation framework for page
+ * fragments. This is used by the network stack and network device drivers to
+ * provide a backing region of memory for use as either an sk_buff->head, or to
+ * be used in the "frags" portion of skb_shared_info.
+ */
+
+#include <linux/export.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+ gfp_t gfp_mask)
+{
+ struct page *page = NULL;
+ gfp_t gfp = gfp_mask;
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
+ __GFP_NOMEMALLOC;
+ page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
+ PAGE_FRAG_CACHE_MAX_ORDER);
+ nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+#endif
+ if (unlikely(!page))
+ page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+
+ nc->va = page ? page_address(page) : NULL;
+
+ return page;
+}
+
+void __page_frag_cache_drain(struct page *page, unsigned int count)
+{
+ VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+ if (page_ref_sub_and_test(page, count - 1))
+ __free_pages(page, compound_order(page));
+}
+EXPORT_SYMBOL(__page_frag_cache_drain);
+
+void *page_frag_alloc_align(struct page_frag_cache *nc,
+ unsigned int fragsz, gfp_t gfp_mask,
+ unsigned int align_mask)
+{
+ unsigned int size = PAGE_SIZE;
+ struct page *page;
+ int offset;
+
+ if (unlikely(!nc->va)) {
+refill:
+ page = __page_frag_cache_refill(nc, gfp_mask);
+ if (!page)
+ return NULL;
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ /* if size can vary use size else just use PAGE_SIZE */
+ size = nc->size;
+#endif
+ /* Even if we own the page, we do not use atomic_set().
+ * This would break get_page_unless_zero() users.
+ */
+ page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+
+ /* reset page count bias and offset to start of new frag */
+ nc->pfmemalloc = page_is_pfmemalloc(page);
+ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ nc->offset = size;
+ }
+
+ offset = nc->offset - fragsz;
+ if (unlikely(offset < 0)) {
+ page = virt_to_page(nc->va);
+
+ if (page_ref_count(page) != nc->pagecnt_bias)
+ goto refill;
+ if (unlikely(nc->pfmemalloc)) {
+ page_ref_sub(page, nc->pagecnt_bias - 1);
+ __free_pages(page, compound_order(page));
+ goto refill;
+ }
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ /* if size can vary use size else just use PAGE_SIZE */
+ size = nc->size;
+#endif
+ /* OK, page count is 0, we can safely set it */
+ set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+ /* reset page count bias and offset to start of new frag */
+ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+ offset = size - fragsz;
+ if (unlikely(offset < 0)) {
+ /*
+ * The caller is trying to allocate a fragment
+ * with fragsz > PAGE_SIZE but the cache isn't big
+ * enough to satisfy the request, this may
+ * happen in low memory conditions.
+ * We don't release the cache page because
+ * it could make memory pressure worse
+ * so we simply return NULL here.
+ */
+ return NULL;
+ }
+ }
+
+ nc->pagecnt_bias--;
+ offset &= align_mask;
+ nc->offset = offset;
+
+ return nc->va + offset;
+}
+EXPORT_SYMBOL(page_frag_alloc_align);
+
+/*
+ * Frees a page fragment allocated out of either a compound or order 0 page.
+ */
+void page_frag_free(void *addr)
+{
+ struct page *page = virt_to_head_page(addr);
+
+ __free_pages(page, compound_order(page));
+}
+EXPORT_SYMBOL(page_frag_free);

2023-03-31 16:13:06

by David Howells

[permalink] [raw]
Subject: [PATCH v3 12/55] tls: Inline do_tcp_sendpages()

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed. This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/tls.h | 2 +-
net/tls/tls_main.c | 24 +++++++++++++++---------
2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 154949c7b0c8..d31521c36a84 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -256,7 +256,7 @@ struct tls_context {
struct scatterlist *partially_sent_record;
u16 partially_sent_offset;

- bool in_tcp_sendpages;
+ bool splicing_pages;
bool pending_open_record_frags;

struct mutex tx_lock; /* protects partially_sent_* fields and
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 3735cb00905d..35b2f7ee2fa3 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -124,7 +124,10 @@ int tls_push_sg(struct sock *sk,
u16 first_offset,
int flags)
{
- int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SENDPAGE_NOTLAST | MSG_SPLICE_PAGES | flags,
+ };
int ret = 0;
struct page *p;
size_t size;
@@ -133,16 +136,19 @@ int tls_push_sg(struct sock *sk,
size = sg->length - offset;
offset += sg->offset;

- ctx->in_tcp_sendpages = true;
+ ctx->splicing_pages = true;
while (1) {
if (sg_is_last(sg))
- sendpage_flags = flags;
+ msg.msg_flags = flags;

/* is sending application-limited? */
tcp_rate_check_app_limited(sk);
p = sg_page(sg);
retry:
- ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
+ bvec_set_page(&bvec, p, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+ ret = tcp_sendmsg_locked(sk, &msg, size);

if (ret != size) {
if (ret > 0) {
@@ -154,7 +160,7 @@ int tls_push_sg(struct sock *sk,
offset -= sg->offset;
ctx->partially_sent_offset = offset;
ctx->partially_sent_record = (void *)sg;
- ctx->in_tcp_sendpages = false;
+ ctx->splicing_pages = false;
return ret;
}

@@ -168,7 +174,7 @@ int tls_push_sg(struct sock *sk,
size = sg->length;
}

- ctx->in_tcp_sendpages = false;
+ ctx->splicing_pages = false;

return 0;
}
@@ -246,11 +252,11 @@ static void tls_write_space(struct sock *sk)
{
struct tls_context *ctx = tls_get_ctx(sk);

- /* If in_tcp_sendpages call lower protocol write space handler
+ /* If splicing_pages call lower protocol write space handler
* to ensure we wake up any waiting operations there. For example
- * if do_tcp_sendpages where to call sk_wait_event.
+ * if splicing pages where to call sk_wait_event.
*/
- if (ctx->in_tcp_sendpages) {
+ if (ctx->splicing_pages) {
ctx->sk_write_space(sk);
return;
}

2023-03-31 16:13:30

by David Howells

[permalink] [raw]
Subject: [PATCH v3 13/55] siw: Inline do_tcp_sendpages()

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed. This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <[email protected]>
cc: Bernard Metzler <[email protected]>
cc: Tom Talpey <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
drivers/infiniband/sw/siw/siw_qp_tx.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 05052b49107f..fa5de40d85d5 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -313,7 +313,7 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
}

/*
- * 0copy TCP transmit interface: Use do_tcp_sendpages.
+ * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
*
* Using sendpage to push page by page appears to be less efficient
* than using sendmsg, even if data are copied.
@@ -324,20 +324,27 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
size_t size)
{
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST |
+ MSG_SPLICE_PAGES),
+ };
struct sock *sk = s->sk;
- int i = 0, rv = 0, sent = 0,
- flags = MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST;
+ int i = 0, rv = 0, sent = 0;

while (size) {
size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);

if (size + offset <= PAGE_SIZE)
- flags = MSG_MORE | MSG_DONTWAIT;
+ msg.msg_flags = MSG_MORE | MSG_DONTWAIT;

tcp_rate_check_app_limited(sk);
+ bvec_set_page(&bvec, page[i], bytes, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
try_page_again:
lock_sock(sk);
- rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags);
+ rv = tcp_sendmsg_locked(sk, &msg, size);
release_sock(sk);

if (rv > 0) {

2023-03-31 16:13:34

by David Howells

[permalink] [raw]
Subject: [PATCH v3 11/55] espintcp: Inline do_tcp_sendpages()

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed. This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <[email protected]>
cc: Steffen Klassert <[email protected]>
cc: Herbert Xu <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/xfrm/espintcp.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 872b80188e83..3504925babdb 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -205,14 +205,16 @@ static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
static int espintcp_sendskmsg_locked(struct sock *sk,
struct espintcp_msg *emsg, int flags)
{
+ struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
struct sk_msg *skmsg = &emsg->skmsg;
struct scatterlist *sg;
int done = 0;
int ret;

- flags |= MSG_SENDPAGE_NOTLAST;
+ msghdr.msg_flags |= MSG_SENDPAGE_NOTLAST;
sg = &skmsg->sg.data[skmsg->sg.start];
do {
+ struct bio_vec bvec;
size_t size = sg->length - emsg->offset;
int offset = sg->offset + emsg->offset;
struct page *p;
@@ -220,11 +222,13 @@ static int espintcp_sendskmsg_locked(struct sock *sk,
emsg->offset = 0;

if (sg_is_last(sg))
- flags &= ~MSG_SENDPAGE_NOTLAST;
+ msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;

p = sg_page(sg);
retry:
- ret = do_tcp_sendpages(sk, p, offset, size, flags);
+ bvec_set_page(&bvec, p, size, offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ ret = tcp_sendmsg_locked(sk, &msghdr, size);
if (ret < 0) {
emsg->offset = offset - sg->offset;
skmsg->sg.start += done;

2023-03-31 16:13:34

by David Howells

[permalink] [raw]
Subject: [PATCH v3 14/55] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()

Fold do_tcp_sendpages() into its last remaining caller,
tcp_sendpage_locked().

Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/tcp.h | 2 --
net/ipv4/tcp.c | 21 +++++++--------------
2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..844bc8e6a714 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -333,8 +333,6 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
int flags);
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
size_t size, int flags);
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle,
int size_goal);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index edcf3a60c1b0..a8f8ccaed10e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,12 +971,17 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}

-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
+ size_t size, int flags)
{
struct bio_vec bvec;
struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

+ if (!(sk->sk_route_caps & NETIF_F_SG))
+ return sock_no_sendpage_locked(sk, page, offset, size, flags);
+
+ tcp_rate_check_app_limited(sk); /* is sending application-limited? */
+
bvec_set_page(&bvec, page, size, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);

@@ -985,18 +990,6 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,

return tcp_sendmsg_locked(sk, &msg, size);
}
-EXPORT_SYMBOL_GPL(do_tcp_sendpages);
-
-int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- if (!(sk->sk_route_caps & NETIF_F_SG))
- return sock_no_sendpage_locked(sk, page, offset, size, flags);
-
- tcp_rate_check_app_limited(sk); /* is sending application-limited? */
-
- return do_tcp_sendpages(sk, page, offset, size, flags);
-}
EXPORT_SYMBOL_GPL(tcp_sendpage_locked);

int tcp_sendpage(struct sock *sk, struct page *page, int offset,

2023-03-31 16:13:40

by David Howells

[permalink] [raw]
Subject: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/ip_output.c | 102 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 99 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4e4e308c3230..e2eaba817c1f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -956,6 +956,79 @@ csum_page(struct page *page, int offset, int copy)
return csum;
}

+/*
+ * Allocate a packet for MSG_SPLICE_PAGES.
+ */
+static int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb,
+ unsigned int fragheaderlen, unsigned int maxfraglen,
+ unsigned int hh_len)
+{
+ struct sk_buff *skb_prev = *pskb, *skb;
+ unsigned int fraggap = skb_prev->len - maxfraglen;
+ unsigned int alloclen = fragheaderlen + hh_len + fraggap + 15;
+
+ skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
+ if (unlikely(!skb))
+ return -ENOBUFS;
+
+ /* Fill in the control structures */
+ skb->ip_summed = CHECKSUM_NONE;
+ skb->csum = 0;
+ skb_reserve(skb, hh_len);
+
+ /* Find where to start putting bytes. */
+ skb_put(skb, fragheaderlen + fraggap);
+ skb_reset_network_header(skb);
+ skb->transport_header = skb->network_header + fragheaderlen;
+ if (fraggap) {
+ skb->csum = skb_copy_and_csum_bits(skb_prev, maxfraglen,
+ skb_transport_header(skb),
+ fraggap);
+ skb_prev->csum = csum_sub(skb_prev->csum, skb->csum);
+ pskb_trim_unique(skb_prev, maxfraglen);
+ }
+
+ /* Put the packet on the pending queue. */
+ __skb_queue_tail(&sk->sk_write_queue, skb);
+ *pskb = skb;
+ return 0;
+}
+
+/*
+ * Add (or copy) data pages for MSG_SPLICE_PAGES.
+ */
+static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
+ void *from, int *pcopy)
+{
+ struct msghdr *msg = from;
+ struct page *page = NULL, **pages = &page;
+ ssize_t copy = *pcopy;
+ size_t off;
+ int err;
+
+ copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
+ if (copy <= 0)
+ return copy ?: -EIO;
+
+ err = skb_append_pagefrags(skb, page, off, copy);
+ if (err < 0) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ return err;
+ }
+
+ if (skb->ip_summed == CHECKSUM_NONE) {
+ __wsum csum;
+
+ csum = csum_page(page, off, copy);
+ skb->csum = csum_block_add(skb->csum, csum, skb->len);
+ }
+
+ skb_len_add(skb, copy);
+ refcount_add(copy, &sk->sk_wmem_alloc);
+ *pcopy = copy;
+ return 0;
+}
+
static int __ip_append_data(struct sock *sk,
struct flowi4 *fl4,
struct sk_buff_head *queue,
@@ -977,7 +1050,7 @@ static int __ip_append_data(struct sock *sk,
int err;
int offset = 0;
bool zc = false;
- unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
+ unsigned int maxfraglen, fragheaderlen, maxnonfragsize, initial_length;
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
@@ -1017,6 +1090,7 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;

+ initial_length = length;
if ((flags & MSG_ZEROCOPY) && length) {
struct msghdr *msg = from;

@@ -1047,6 +1121,14 @@ static int __ip_append_data(struct sock *sk,
skb_zcopy_set(skb, uarg, &extra_uref);
}
}
+ } else if ((flags & MSG_SPLICE_PAGES) && length) {
+ if (inet->hdrincl)
+ return -EPERM;
+ if (rt->dst.dev->features & NETIF_F_SG)
+ /* We need an empty buffer to attach stuff to */
+ initial_length = transhdrlen;
+ else
+ flags &= ~MSG_SPLICE_PAGES;
}

cork->length += length;
@@ -1074,6 +1156,16 @@ static int __ip_append_data(struct sock *sk,
unsigned int alloclen, alloc_extra;
unsigned int pagedlen;
struct sk_buff *skb_prev;
+
+ if (unlikely(flags & MSG_SPLICE_PAGES)) {
+ err = __ip_splice_alloc(sk, &skb, fragheaderlen,
+ maxfraglen, hh_len);
+ if (err < 0)
+ goto error;
+ continue;
+ }
+ initial_length = length;
+
alloc_new_skb:
skb_prev = skb;
if (skb_prev)
@@ -1085,7 +1177,7 @@ static int __ip_append_data(struct sock *sk,
* If remaining data exceeds the mtu,
* we know we need more fragment(s).
*/
- datalen = length + fraggap;
+ datalen = initial_length + fraggap;
if (datalen > mtu - fragheaderlen)
datalen = maxfraglen - fragheaderlen;
fraglen = datalen + fragheaderlen;
@@ -1099,7 +1191,7 @@ static int __ip_append_data(struct sock *sk,
* because we have no idea what fragment will be
* the last.
*/
- if (datalen == length + fraggap)
+ if (datalen == initial_length + fraggap)
alloc_extra += rt->dst.trailer_len;

if ((flags & MSG_MORE) &&
@@ -1206,6 +1298,10 @@ static int __ip_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
+ } else if (flags & MSG_SPLICE_PAGES) {
+ err = __ip_splice_pages(sk, skb, from, &copy);
+ if (err < 0)
+ goto error;
} else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;


2023-03-31 16:13:49

by David Howells

[permalink] [raw]
Subject: [PATCH v3 10/55] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it. This is part of replacing ->sendpage() with a call to
sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Sitnicki <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ipv4/tcp_bpf.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cf26d65ca389..7f17134637eb 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -72,11 +72,13 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
{
bool apply = apply_bytes;
struct scatterlist *sge;
+ struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
struct page *page;
int size, ret = 0;
u32 off;

while (1) {
+ struct bio_vec bvec;
bool has_tx_ulp;

sge = sk_msg_elem(msg, msg->sg.start);
@@ -88,16 +90,18 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
tcp_rate_check_app_limited(sk);
retry:
has_tx_ulp = tls_sw_has_ctx_tx(sk);
- if (has_tx_ulp) {
- flags |= MSG_SENDPAGE_NOPOLICY;
- ret = kernel_sendpage_locked(sk,
- page, off, size, flags);
- } else {
- ret = do_tcp_sendpages(sk, page, off, size, flags);
- }
+ if (has_tx_ulp)
+ msghdr.msg_flags |= MSG_SENDPAGE_NOPOLICY;

+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msghdr.msg_flags |= MSG_MORE;
+
+ bvec_set_page(&bvec, page, size, off);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ ret = tcp_sendmsg_locked(sk, &msghdr, size);
if (ret <= 0)
return ret;
+
if (apply)
apply_bytes -= ret;
msg->sg.size -= ret;
@@ -398,7 +402,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
long timeo;
int flags;

- /* Don't let internal do_tcp_sendpages() flags through */
+ /* Don't let internal sendpage flags through */
flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
flags |= MSG_NO_SHARED_FRAGS;


2023-03-31 16:13:53

by David Howells

[permalink] [raw]
Subject: [PATCH v3 17/55] ip6, udp6: Support MSG_SPLICE_PAGES

Make IP6/UDP6 sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible, copying the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/ip.h | 4 ++++
net/ipv4/ip_output.c | 11 ++++++-----
net/ipv6/ip6_output.c | 28 +++++++++++++++++++++++++---
3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..e27d2ceffcfa 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -211,6 +211,10 @@ int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb);
int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
__u8 tos);
void ip_init(void);
+int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb,
+ unsigned int fragheaderlen, unsigned int maxfraglen,
+ unsigned int hh_len);
+int __ip_splice_pages(struct sock *sk, struct sk_buff *skb, void *from, int *pcopy);
int ip_append_data(struct sock *sk, struct flowi4 *fl4,
int getfrag(void *from, char *to, int offset, int len,
int odd, struct sk_buff *skb),
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 41a954ac9e1a..fa2546d944bc 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -959,9 +959,9 @@ csum_page(struct page *page, int offset, int copy)
/*
* Allocate a packet for MSG_SPLICE_PAGES.
*/
-static int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb,
- unsigned int fragheaderlen, unsigned int maxfraglen,
- unsigned int hh_len)
+int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb,
+ unsigned int fragheaderlen, unsigned int maxfraglen,
+ unsigned int hh_len)
{
struct sk_buff *skb_prev = *pskb, *skb;
unsigned int fraggap = skb_prev->len - maxfraglen;
@@ -993,12 +993,12 @@ static int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb,
*pskb = skb;
return 0;
}
+EXPORT_SYMBOL_GPL(__ip_splice_alloc);

/*
* Add (or copy) data pages for MSG_SPLICE_PAGES.
*/
-static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
- void *from, int *pcopy)
+int __ip_splice_pages(struct sock *sk, struct sk_buff *skb, void *from, int *pcopy)
{
struct msghdr *msg = from;
struct page *page = NULL, **pages = &page;
@@ -1047,6 +1047,7 @@ static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
*pcopy = copy;
return 0;
}
+EXPORT_SYMBOL_GPL(__ip_splice_pages);

static int __ip_append_data(struct sock *sk,
struct flowi4 *fl4,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c314fdde0097..c95d034cb45a 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1486,7 +1486,7 @@ static int __ip6_append_data(struct sock *sk,
struct rt6_info *rt = (struct rt6_info *)cork->dst;
struct ipv6_txoptions *opt = v6_cork->opt;
int csummode = CHECKSUM_NONE;
- unsigned int maxnonfragsize, headersize;
+ unsigned int maxnonfragsize, headersize, initial_length;
unsigned int wmem_alloc_delta = 0;
bool paged, extra_uref = false;

@@ -1559,6 +1559,7 @@ static int __ip6_append_data(struct sock *sk,
rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
csummode = CHECKSUM_PARTIAL;

+ initial_length = length;
if ((flags & MSG_ZEROCOPY) && length) {
struct msghdr *msg = from;

@@ -1589,6 +1590,14 @@ static int __ip6_append_data(struct sock *sk,
skb_zcopy_set(skb, uarg, &extra_uref);
}
}
+ } else if ((flags & MSG_SPLICE_PAGES) && length) {
+ if (inet_sk(sk)->hdrincl)
+ return -EPERM;
+ if (rt->dst.dev->features & NETIF_F_SG)
+ /* We need an empty buffer to attach stuff to */
+ initial_length = transhdrlen;
+ else
+ flags &= ~MSG_SPLICE_PAGES;
}

/*
@@ -1624,6 +1633,15 @@ static int __ip6_append_data(struct sock *sk,
unsigned int fraggap;
unsigned int alloclen, alloc_extra;
unsigned int pagedlen;
+
+ if (unlikely(flags & MSG_SPLICE_PAGES)) {
+ err = __ip_splice_alloc(sk, &skb, fragheaderlen,
+ maxfraglen, hh_len);
+ if (err < 0)
+ goto error;
+ continue;
+ }
+ initial_length = length;
alloc_new_skb:
/* There's no room in the current skb */
if (skb)
@@ -1642,7 +1660,7 @@ static int __ip6_append_data(struct sock *sk,
* If remaining data exceeds the mtu,
* we know we need more fragment(s).
*/
- datalen = length + fraggap;
+ datalen = initial_length + fraggap;

if (datalen > (cork->length <= mtu && !(cork->flags & IPCORK_ALLFRAG) ? mtu : maxfraglen) - fragheaderlen)
datalen = maxfraglen - fragheaderlen - rt->dst.trailer_len;
@@ -1672,7 +1690,7 @@ static int __ip6_append_data(struct sock *sk,
}
alloclen += alloc_extra;

- if (datalen != length + fraggap) {
+ if (datalen != initial_length + fraggap) {
/*
* this is not the last fragment, the trailer
* space is regarded as data space.
@@ -1778,6 +1796,10 @@ static int __ip6_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
+ } else if (flags & MSG_SPLICE_PAGES) {
+ err = __ip_splice_pages(sk, skb, from, &copy);
+ if (err < 0)
+ goto error;
} else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;


2023-03-31 16:14:28

by David Howells

[permalink] [raw]
Subject: [PATCH v3 08/55] tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data

If sendmsg() with MSG_SPLICE_PAGES encounters a page that shouldn't be
spliced - a slab page, for instance, or one with a zero count - make
tcp_sendmsg() copy it.

Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/tcp.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 910b327c236e..6ef0518eb706 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1417,10 +1417,10 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
goto do_error;
copy = err;
} else if (zc == 2) {
- /* Splice in data. */
+ /* Splice in data if we can; copy if we can't. */
struct page *page = NULL, **pages = &page;
size_t off = 0, part;
- bool can_coalesce;
+ bool can_coalesce, put = false;
int i = skb_shinfo(skb)->nr_frags;

copy = iov_iter_extract_pages(&msg->msg_iter, &pages,
@@ -1447,12 +1447,34 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
goto wait_for_space;
copy = part;

+ if (!sendpage_ok(page)) {
+ const void *p = kmap_local_page(page);
+ void *q;
+
+ q = page_frag_memdup(NULL, p + off, copy,
+ sk->sk_allocation, ULONG_MAX);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ err = copy ?: -ENOMEM;
+ goto do_error;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ can_coalesce = false;
+ }
+
if (can_coalesce) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
} else {
- get_page(page);
+ if (!put)
+ get_page(page);
+ put = false;
skb_fill_page_desc_noacc(skb, i, page, off, copy);
}
+ if (put)
+ put_page(page);
page = NULL;

if (!(flags & MSG_NO_SHARED_FRAGS))

2023-03-31 16:14:32

by David Howells

[permalink] [raw]
Subject: [PATCH v3 09/55] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES

Convert do_tcp_sendpages() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself. do_tcp_sendpages() can then be
inlined in subsequent patches into its callers.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/tcp.c | 158 +++----------------------------------------------
1 file changed, 7 insertions(+), 151 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6ef0518eb706..edcf3a60c1b0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,163 +971,19 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}

-static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
- struct page *page, int offset, size_t *size)
-{
- struct sk_buff *skb = tcp_write_queue_tail(sk);
- struct tcp_sock *tp = tcp_sk(sk);
- bool can_coalesce;
- int copy, i;
-
- if (!skb || (copy = size_goal - skb->len) <= 0 ||
- !tcp_skb_can_collapse_to(skb)) {
-new_segment:
- if (!sk_stream_memory_free(sk))
- return NULL;
-
- skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
- tcp_rtx_and_write_queues_empty(sk));
- if (!skb)
- return NULL;
-
-#ifdef CONFIG_TLS_DEVICE
- skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
-#endif
- tcp_skb_entail(sk, skb);
- copy = size_goal;
- }
-
- if (copy > *size)
- copy = *size;
-
- i = skb_shinfo(skb)->nr_frags;
- can_coalesce = skb_can_coalesce(skb, i, page, offset);
- if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
- tcp_mark_push(tp, skb);
- goto new_segment;
- }
- if (tcp_downgrade_zcopy_pure(sk, skb))
- return NULL;
-
- copy = tcp_wmem_schedule(sk, copy);
- if (!copy)
- return NULL;
-
- if (can_coalesce) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
- } else {
- get_page(page);
- skb_fill_page_desc_noacc(skb, i, page, offset, copy);
- }
-
- if (!(flags & MSG_NO_SHARED_FRAGS))
- skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
- skb->len += copy;
- skb->data_len += copy;
- skb->truesize += copy;
- sk_wmem_queued_add(sk, copy);
- sk_mem_charge(sk, copy);
- WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
- TCP_SKB_CB(skb)->end_seq += copy;
- tcp_skb_pcount_set(skb, 0);
-
- *size = copy;
- return skb;
-}
-
ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct tcp_sock *tp = tcp_sk(sk);
- int mss_now, size_goal;
- int err;
- ssize_t copied;
- long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
-
- if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- WARN_ONCE(!sendpage_ok(page),
- "page must not be a Slab one and have page_count > 0"))
- return -EINVAL;
-
- /* Wait for a connection to finish. One exception is TCP Fast Open
- * (passive side) where data is allowed to be sent before a connection
- * is fully established.
- */
- if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
- !tcp_passive_fastopen(sk)) {
- err = sk_stream_wait_connect(sk, &timeo);
- if (err != 0)
- goto out_err;
- }
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

- sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);

- mss_now = tcp_send_mss(sk, &size_goal, flags);
- copied = 0;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- err = -EPIPE;
- if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- goto out_err;
-
- while (size > 0) {
- struct sk_buff *skb;
- size_t copy = size;
-
- skb = tcp_build_frag(sk, size_goal, flags, page, offset, &copy);
- if (!skb)
- goto wait_for_space;
-
- if (!copied)
- TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
- copied += copy;
- offset += copy;
- size -= copy;
- if (!size)
- goto out;
-
- if (skb->len < size_goal || (flags & MSG_OOB))
- continue;
-
- if (forced_push(tp)) {
- tcp_mark_push(tp, skb);
- __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
- } else if (skb == tcp_send_head(sk))
- tcp_push_one(sk, mss_now);
- continue;
-
-wait_for_space:
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
- tcp_push(sk, flags & ~MSG_MORE, mss_now,
- TCP_NAGLE_PUSH, size_goal);
-
- err = sk_stream_wait_memory(sk, &timeo);
- if (err != 0)
- goto do_error;
-
- mss_now = tcp_send_mss(sk, &size_goal, flags);
- }
-
-out:
- if (copied) {
- tcp_tx_timestamp(sk, sk->sk_tsflags);
- if (!(flags & MSG_SENDPAGE_NOTLAST))
- tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
- }
- return copied;
-
-do_error:
- tcp_remove_empty_skb(sk);
- if (copied)
- goto out;
-out_err:
- /* make sure we wake any epoll edge trigger waiter */
- if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
- sk->sk_write_space(sk);
- tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
- }
- return sk_stream_error(sk, flags, err);
+ return tcp_sendmsg_locked(sk, &msg, size);
}
EXPORT_SYMBOL_GPL(do_tcp_sendpages);


2023-03-31 16:14:54

by David Howells

[permalink] [raw]
Subject: [PATCH v3 07/55] tcp: Support MSG_SPLICE_PAGES

Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/tcp.c | 67 ++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..910b327c236e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
int flags, err, copied = 0;
int mss_now = 0, size_goal, copied_syn = 0;
int process_backlog = 0;
- bool zc = false;
+ int zc = 0;
long timeo;

flags = msg->msg_flags;
@@ -1231,17 +1231,22 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (msg->msg_ubuf) {
uarg = msg->msg_ubuf;
net_zcopy_get(uarg);
- zc = sk->sk_route_caps & NETIF_F_SG;
+ if (sk->sk_route_caps & NETIF_F_SG)
+ zc = 1;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
err = -ENOBUFS;
goto out_err;
}
- zc = sk->sk_route_caps & NETIF_F_SG;
- if (!zc)
+ if (sk->sk_route_caps & NETIF_F_SG)
+ zc = 1;
+ else
uarg_to_msgzc(uarg)->zerocopy = 0;
}
+ } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
+ if (sk->sk_route_caps & NETIF_F_SG)
+ zc = 2;
}

if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1304,7 +1309,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
goto do_error;

while (msg_data_left(msg)) {
- int copy = 0;
+ ssize_t copy = 0;

skb = tcp_write_queue_tail(sk);
if (skb)
@@ -1345,7 +1350,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (copy > msg_data_left(msg))
copy = msg_data_left(msg);

- if (!zc) {
+ if (zc == 0) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
@@ -1390,7 +1395,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
page_ref_inc(pfrag->page);
}
pfrag->offset += copy;
- } else {
+ } else if (zc == 1) {
/* First append to a fragless skb builds initial
* pure zerocopy skb
*/
@@ -1411,6 +1416,54 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (err < 0)
goto do_error;
copy = err;
+ } else if (zc == 2) {
+ /* Splice in data. */
+ struct page *page = NULL, **pages = &page;
+ size_t off = 0, part;
+ bool can_coalesce;
+ int i = skb_shinfo(skb)->nr_frags;
+
+ copy = iov_iter_extract_pages(&msg->msg_iter, &pages,
+ copy, 1, 0, &off);
+ if (copy <= 0) {
+ err = copy ?: -EIO;
+ goto do_error;
+ }
+
+ can_coalesce = skb_can_coalesce(skb, i, page, off);
+ if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
+ tcp_mark_push(tp, skb);
+ iov_iter_revert(&msg->msg_iter, copy);
+ goto new_segment;
+ }
+ if (tcp_downgrade_zcopy_pure(sk, skb)) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ goto wait_for_space;
+ }
+
+ part = tcp_wmem_schedule(sk, copy);
+ iov_iter_revert(&msg->msg_iter, copy - part);
+ if (!part)
+ goto wait_for_space;
+ copy = part;
+
+ if (can_coalesce) {
+ skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+ } else {
+ get_page(page);
+ skb_fill_page_desc_noacc(skb, i, page, off, copy);
+ }
+ page = NULL;
+
+ if (!(flags & MSG_NO_SHARED_FRAGS))
+ skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+
+ skb->len += copy;
+ skb->data_len += copy;
+ skb->truesize += copy;
+ sk_wmem_queued_add(sk, copy);
+ sk_mem_charge(sk, copy);
+
}

if (!copied)

2023-03-31 16:15:04

by David Howells

[permalink] [raw]
Subject: [PATCH v3 22/55] crypto: af_alg: Use netfs_extract_iter_to_sg() to create scatterlists

Use netfs_extract_iter_to_sg() to decant the destination iterator into a
scatterlist in af_alg_get_rsgl(). af_alg_make_sg() can then be removed.

[!] Note that if this fits, netfs_extract_iter_to_sg() should move to core
code.

Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/af_alg.c | 55 ++++++++++-------------------------------
crypto/algif_aead.c | 12 ++++-----
crypto/algif_hash.c | 18 ++++++++++----
crypto/algif_skcipher.c | 2 +-
include/crypto/if_alg.h | 6 ++---
5 files changed, 35 insertions(+), 58 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 7caff10df643..1dafd088ad45 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -22,6 +22,7 @@
#include <linux/sched/signal.h>
#include <linux/security.h>
#include <linux/string.h>
+#include <linux/netfs.h>
#include <keys/user-type.h>
#include <keys/trusted-type.h>
#include <keys/encrypted-type.h>
@@ -531,45 +532,11 @@ static const struct net_proto_family alg_family = {
.owner = THIS_MODULE,
};

-int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len)
-{
- struct page **pages = sgl->pages;
- size_t off;
- ssize_t n;
- int npages, i;
-
- n = iov_iter_extract_pages(iter, &pages, len, ALG_MAX_PAGES, 0, &off);
- if (n < 0)
- return n;
-
- sgl->need_unpin = iov_iter_extract_will_pin(iter);
-
- npages = DIV_ROUND_UP(off + n, PAGE_SIZE);
- if (WARN_ON(npages == 0))
- return -EINVAL;
- /* Add one extra for linking */
- sg_init_table(sgl->sg, npages + 1);
-
- for (i = 0, len = n; i < npages; i++) {
- int plen = min_t(int, len, PAGE_SIZE - off);
-
- sg_set_page(sgl->sg + i, sgl->pages[i], plen, off);
-
- off = 0;
- len -= plen;
- }
- sg_mark_end(sgl->sg + npages - 1);
- sgl->npages = npages;
-
- return n;
-}
-EXPORT_SYMBOL_GPL(af_alg_make_sg);
-
static void af_alg_link_sg(struct af_alg_sgl *sgl_prev,
struct af_alg_sgl *sgl_new)
{
- sg_unmark_end(sgl_prev->sg + sgl_prev->npages - 1);
- sg_chain(sgl_prev->sg, sgl_prev->npages + 1, sgl_new->sg);
+ sg_unmark_end(sgl_prev->sgt.sgl + sgl_prev->sgt.nents - 1);
+ sg_chain(sgl_prev->sgt.sgl, sgl_prev->sgt.nents + 1, sgl_new->sgt.sgl);
}

void af_alg_free_sg(struct af_alg_sgl *sgl)
@@ -577,8 +544,8 @@ void af_alg_free_sg(struct af_alg_sgl *sgl)
int i;

if (sgl->need_unpin)
- for (i = 0; i < sgl->npages; i++)
- unpin_user_page(sgl->pages[i]);
+ for (i = 0; i < sgl->sgt.nents; i++)
+ unpin_user_page(sg_page(&sgl->sgt.sgl[i]));
}
EXPORT_SYMBOL_GPL(af_alg_free_sg);

@@ -1292,8 +1259,8 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,

while (maxsize > len && msg_data_left(msg)) {
struct af_alg_rsgl *rsgl;
+ ssize_t err;
size_t seglen;
- int err;

/* limit the amount of readable buffers */
if (!af_alg_readable(sk))
@@ -1310,16 +1277,20 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
return -ENOMEM;
}

- rsgl->sgl.npages = 0;
+ rsgl->sgl.sgt.sgl = rsgl->sgl.sgl;
+ rsgl->sgl.sgt.nents = 0;
+ rsgl->sgl.sgt.orig_nents = 0;
list_add_tail(&rsgl->list, &areq->rsgl_list);

- /* make one iovec available as scatterlist */
- err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen);
+ err = netfs_extract_iter_to_sg(&msg->msg_iter, seglen,
+ &rsgl->sgl.sgt, ALG_MAX_PAGES, 0);
if (err < 0) {
rsgl->sg_num_bytes = 0;
return err;
}

+ rsgl->sgl.need_unpin = iov_iter_extract_will_pin(&msg->msg_iter);
+
/* chain the new scatterlist with previous one */
if (areq->last_rsgl)
af_alg_link_sg(&areq->last_rsgl->sgl, &rsgl->sgl);
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 42493b4d8ce4..f6aa3856d8d5 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -210,7 +210,7 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
*/

/* Use the RX SGL as source (and destination) for crypto op. */
- rsgl_src = areq->first_rsgl.sgl.sg;
+ rsgl_src = areq->first_rsgl.sgl.sgt.sgl;

if (ctx->enc) {
/*
@@ -224,7 +224,7 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
* RX SGL: AAD || PT || Tag
*/
err = crypto_aead_copy_sgl(null_tfm, tsgl_src,
- areq->first_rsgl.sgl.sg, processed);
+ areq->first_rsgl.sgl.sgt.sgl, processed);
if (err)
goto free;
af_alg_pull_tsgl(sk, processed, NULL, 0);
@@ -242,7 +242,7 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,

/* Copy AAD || CT to RX SGL buffer for in-place operation. */
err = crypto_aead_copy_sgl(null_tfm, tsgl_src,
- areq->first_rsgl.sgl.sg, outlen);
+ areq->first_rsgl.sgl.sgt.sgl, outlen);
if (err)
goto free;

@@ -268,8 +268,8 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
/* RX SGL present */
struct af_alg_sgl *sgl_prev = &areq->last_rsgl->sgl;

- sg_unmark_end(sgl_prev->sg + sgl_prev->npages - 1);
- sg_chain(sgl_prev->sg, sgl_prev->npages + 1,
+ sg_unmark_end(sgl_prev->sgt.sgl + sgl_prev->sgt.nents - 1);
+ sg_chain(sgl_prev->sgt.sgl, sgl_prev->sgt.nents + 1,
areq->tsgl);
} else
/* no RX SGL present (e.g. authentication only) */
@@ -278,7 +278,7 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,

/* Initialize the crypto operation */
aead_request_set_crypt(&areq->cra_u.aead_req, rsgl_src,
- areq->first_rsgl.sgl.sg, used, ctx->iv);
+ areq->first_rsgl.sgl.sgt.sgl, used, ctx->iv);
aead_request_set_ad(&areq->cra_u.aead_req, ctx->aead_assoclen);
aead_request_set_tfm(&areq->cra_u.aead_req, tfm);

diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 1d017ec5c63c..f051fa624bd7 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -14,6 +14,7 @@
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/net.h>
+#include <linux/netfs.h>
#include <net/sock.h>

struct hash_ctx {
@@ -91,13 +92,20 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
if (len > limit)
len = limit;

- len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, len);
+ ctx->sgl.sgt.sgl = ctx->sgl.sgl;
+ ctx->sgl.sgt.nents = 0;
+ ctx->sgl.sgt.orig_nents = 0;
+
+ len = netfs_extract_iter_to_sg(&msg->msg_iter, len,
+ &ctx->sgl.sgt, ALG_MAX_PAGES, 0);
if (len < 0) {
err = copied ? 0 : len;
goto unlock;
}

- ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, NULL, len);
+ ctx->sgl.need_unpin = iov_iter_extract_will_pin(&msg->msg_iter);
+
+ ahash_request_set_crypt(&ctx->req, ctx->sgl.sgt.sgl, NULL, len);

err = crypto_wait_req(crypto_ahash_update(&ctx->req),
&ctx->wait);
@@ -141,8 +149,8 @@ static ssize_t hash_sendpage(struct socket *sock, struct page *page,
flags |= MSG_MORE;

lock_sock(sk);
- sg_init_table(ctx->sgl.sg, 1);
- sg_set_page(ctx->sgl.sg, page, size, offset);
+ sg_init_table(ctx->sgl.sgl, 1);
+ sg_set_page(ctx->sgl.sgl, page, size, offset);

if (!(flags & MSG_MORE)) {
err = hash_alloc_result(sk, ctx);
@@ -151,7 +159,7 @@ static ssize_t hash_sendpage(struct socket *sock, struct page *page,
} else if (!ctx->more)
hash_free_result(sk, ctx);

- ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, ctx->result, size);
+ ahash_request_set_crypt(&ctx->req, ctx->sgl.sgl, ctx->result, size);

if (!(flags & MSG_MORE)) {
if (ctx->more)
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index ee8890ee8f33..a251cd6bd5b9 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -105,7 +105,7 @@ static int _skcipher_recvmsg(struct socket *sock, struct msghdr *msg,
/* Initialize the crypto operation */
skcipher_request_set_tfm(&areq->cra_u.skcipher_req, tfm);
skcipher_request_set_crypt(&areq->cra_u.skcipher_req, areq->tsgl,
- areq->first_rsgl.sgl.sg, len, ctx->iv);
+ areq->first_rsgl.sgl.sgt.sgl, len, ctx->iv);

if (msg->msg_iocb && !is_sync_kiocb(msg->msg_iocb)) {
/* AIO operation */
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 46494b33f5bc..34224e77f5a2 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -56,9 +56,8 @@ struct af_alg_type {
};

struct af_alg_sgl {
- struct scatterlist sg[ALG_MAX_PAGES + 1];
- struct page *pages[ALG_MAX_PAGES];
- unsigned int npages;
+ struct sg_table sgt;
+ struct scatterlist sgl[ALG_MAX_PAGES + 1];
bool need_unpin;
};

@@ -164,7 +163,6 @@ int af_alg_release(struct socket *sock);
void af_alg_release_parent(struct sock *sk);
int af_alg_accept(struct sock *sk, struct socket *newsock, bool kern);

-int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len);
void af_alg_free_sg(struct af_alg_sgl *sgl);

static inline struct alg_sock *alg_sk(struct sock *sk)

2023-03-31 16:15:24

by David Howells

[permalink] [raw]
Subject: [PATCH v3 18/55] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES

Convert udp_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/udp.c | 50 +++++++++-----------------------------------------
1 file changed, 9 insertions(+), 41 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..097feb92e215 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1332,52 +1332,20 @@ EXPORT_SYMBOL(udp_sendmsg);
int udp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct inet_sock *inet = inet_sk(sk);
- struct udp_sock *up = udp_sk(sk);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE
+ };
int ret;

- if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);

- if (!up->pending) {
- struct msghdr msg = { .msg_flags = flags|MSG_MORE };
-
- /* Call udp_sendmsg to specify destination address which
- * sendpage interface can't pass.
- * This will succeed only when the socket is connected.
- */
- ret = udp_sendmsg(sk, &msg, 0);
- if (ret < 0)
- return ret;
- }
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

lock_sock(sk);
-
- if (unlikely(!up->pending)) {
- release_sock(sk);
-
- net_dbg_ratelimited("cork failed\n");
- return -EINVAL;
- }
-
- ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
- page, offset, size, flags);
- if (ret == -EOPNOTSUPP) {
- release_sock(sk);
- return sock_no_sendpage(sk->sk_socket, page, offset,
- size, flags);
- }
- if (ret < 0) {
- udp_flush_pending_frames(sk);
- goto out;
- }
-
- up->len += size;
- if (!(READ_ONCE(up->corkflag) || (flags&MSG_MORE)))
- ret = udp_push_pending_frames(sk);
- if (!ret)
- ret = size;
-out:
+ ret = udp_sendmsg(sk, &msg, size);
release_sock(sk);
return ret;
}

2023-03-31 16:15:30

by David Howells

[permalink] [raw]
Subject: [PATCH v3 16/55] ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data

If sendmsg() with MSG_SPLICE_PAGES encounters a page that shouldn't be
spliced - a slab page, for instance, or one with a zero count - make
__ip_append_data() copy it.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/ipv4/ip_output.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e2eaba817c1f..41a954ac9e1a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1004,13 +1004,32 @@ static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
struct page *page = NULL, **pages = &page;
ssize_t copy = *pcopy;
size_t off;
+ bool put = false;
int err;

copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
if (copy <= 0)
return copy ?: -EIO;

+ if (!sendpage_ok(page)) {
+ const void *p = kmap_local_page(page);
+ void *q;
+
+ q = page_frag_memdup(NULL, p + off, copy,
+ sk->sk_allocation, ULONG_MAX);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ return -ENOMEM;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ }
+
err = skb_append_pagefrags(skb, page, off, copy);
+ if (put)
+ put_page(page);
if (err < 0) {
iov_iter_revert(&msg->msg_iter, copy);
return err;

2023-03-31 16:15:35

by David Howells

[permalink] [raw]
Subject: [PATCH v3 20/55] af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data

If sendmsg() with MSG_SPLICE_PAGES encounters a page that shouldn't be
spliced - a slab page, for instance, or one with a zero count - make
unix_extract_bvec_to_skb() copy it.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/unix/af_unix.c | 44 +++++++++++++++++++++++++++++++++-----------
1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index a9ad97f3c57f..88b91005567e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2154,12 +2154,12 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
/*
* Extract pages from an iterator and add them to the socket buffer.
*/
-static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
- struct iov_iter *iter, ssize_t maxsize)
+static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb, struct iov_iter *iter,
+ ssize_t maxsize, gfp_t gfp)
{
struct page *pages[8], **ppages = pages;
unsigned int i, nr;
- ssize_t ret = 0;
+ ssize_t spliced = 0, ret = 0;

while (iter->count > 0) {
size_t off, len;
@@ -2171,31 +2171,52 @@ static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,

len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off);
if (len <= 0) {
- if (!ret)
- ret = len ?: -EIO;
+ ret = len ?: -EIO;
break;
}

i = 0;
do {
+ struct page *page = pages[i++];
size_t part = min_t(size_t, PAGE_SIZE - off, len);
+ bool put = false;
+
+ if (!sendpage_ok(page)) {
+ const void *p = kmap_local_page(page);
+ void *q;
+
+ q = page_frag_memdup(NULL, p + off, part, gfp,
+ ULONG_MAX);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(iter, len);
+ ret = -ENOMEM;
+ goto out;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ }

- if (skb_append_pagefrags(skb, pages[i++], off, part) < 0) {
- if (!ret)
- ret = -EMSGSIZE;
+ ret = skb_append_pagefrags(skb, page, off, part);
+ if (put)
+ put_page(page);
+ if (ret < 0) {
+ iov_iter_revert(iter, len);
goto out;
}
off = 0;
- ret += part;
+ spliced += part;
maxsize -= part;
len -= part;
} while (len > 0);
+
if (maxsize <= 0)
break;
}

out:
- return ret;
+ return spliced ?: ret;
}

static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
@@ -2272,7 +2293,8 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
fds_sent = true;

if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
- size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size);
+ size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size,
+ sk->sk_allocation);
skb->data_len += size;
skb->len += size;
skb->truesize += size;

2023-03-31 16:15:40

by David Howells

[permalink] [raw]
Subject: [PATCH v3 24/55] crypto: af_alg: Support MSG_SPLICE_PAGES

Make AF_ALG sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

[!] Note that this makes use of netfs_extract_iter_to_sg() from netfslib.
This probably needs moving to core code somewhere.

Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/Kconfig | 1 +
crypto/af_alg.c | 28 ++++++++++++++++++++++++++--
crypto/algif_aead.c | 22 +++++++++++-----------
crypto/algif_skcipher.c | 8 ++++----
4 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 9c86f7045157..8c04ecbb4395 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1297,6 +1297,7 @@ menu "Userspace interface"

config CRYPTO_USER_API
tristate
+ select NETFS_SUPPORT # for netfs_extract_iter_to_sg()

config CRYPTO_USER_API_HASH
tristate "Hash algorithms"
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 483821e310e9..3088ab298632 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -941,6 +941,10 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
bool init = false;
int err = 0;

+ if ((msg->msg_flags & MSG_SPLICE_PAGES) &&
+ !iov_iter_is_bvec(&msg->msg_iter))
+ return -EINVAL;
+
if (msg->msg_controllen) {
err = af_alg_cmsg_send(msg, &con);
if (err)
@@ -986,7 +990,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
while (size) {
struct scatterlist *sg;
size_t len = size;
- size_t plen;
+ ssize_t plen;

/* use the existing memory in an allocated page */
if (ctx->merge) {
@@ -1031,7 +1035,27 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
if (sgl->cur)
sg_unmark_end(sg + sgl->cur - 1);

- if (1 /* TODO check MSG_SPLICE_PAGES */) {
+ if (msg->msg_flags & MSG_SPLICE_PAGES) {
+ struct sg_table sgtable = {
+ .sgl = sg,
+ .nents = sgl->cur,
+ .orig_nents = sgl->cur,
+ };
+
+ plen = netfs_extract_iter_to_sg(&msg->msg_iter, len,
+ &sgtable, MAX_SGL_ENTS, 0);
+ if (plen < 0) {
+ err = plen;
+ goto unlock;
+ }
+
+ for (; sgl->cur < sgtable.nents; sgl->cur++)
+ get_page(sg_page(&sg[sgl->cur]));
+ len -= plen;
+ ctx->used += plen;
+ copied += plen;
+ size -= plen;
+ } else {
do {
struct page *pg;
unsigned int i = sgl->cur;
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index f6aa3856d8d5..b16111a3025a 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -9,8 +9,8 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage/sendmsg. Filling
- * up the TX SGL does not cause a crypto operation -- the data will only be
+ * filled by user space with the data submitted via sendpage. Filling up
+ * the TX SGL does not cause a crypto operation -- the data will only be
* tracked by the kernel. Upon receipt of one recvmsg call, the caller must
* provide a buffer which is tracked with the RX SGL.
*
@@ -113,19 +113,19 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
}

/*
- * Data length provided by caller via sendmsg/sendpage that has not
- * yet been processed.
+ * Data length provided by caller via sendmsg that has not yet been
+ * processed.
*/
used = ctx->used;

/*
- * Make sure sufficient data is present -- note, the same check is
- * also present in sendmsg/sendpage. The checks in sendpage/sendmsg
- * shall provide an information to the data sender that something is
- * wrong, but they are irrelevant to maintain the kernel integrity.
- * We need this check here too in case user space decides to not honor
- * the error message in sendmsg/sendpage and still call recvmsg. This
- * check here protects the kernel integrity.
+ * Make sure sufficient data is present -- note, the same check is also
+ * present in sendmsg. The checks in sendmsg shall provide an
+ * information to the data sender that something is wrong, but they are
+ * irrelevant to maintain the kernel integrity. We need this check
+ * here too in case user space decides to not honor the error message
+ * in sendmsg and still call recvmsg. This check here protects the
+ * kernel integrity.
*/
if (!aead_sufficient_data(sk))
return -EINVAL;
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index a251cd6bd5b9..b1f321b9f846 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -9,10 +9,10 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage/sendmsg. Filling
- * up the TX SGL does not cause a crypto operation -- the data will only be
- * tracked by the kernel. Upon receipt of one recvmsg call, the caller must
- * provide a buffer which is tracked with the RX SGL.
+ * filled by user space with the data submitted via sendmsg. Filling up the TX
+ * SGL does not cause a crypto operation -- the data will only be tracked by
+ * the kernel. Upon receipt of one recvmsg call, the caller must provide a
+ * buffer which is tracked with the RX SGL.
*
* During the processing of the recvmsg operation, the cipher request is
* allocated and prepared. As part of the recvmsg operation, the processed

2023-03-31 16:15:51

by David Howells

[permalink] [raw]
Subject: [PATCH v3 19/55] af_unix: Support MSG_SPLICE_PAGES

Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the
source iterator if possible and copying the data in otherwise.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/unix/af_unix.c | 93 ++++++++++++++++++++++++++++++++++++++--------
1 file changed, 77 insertions(+), 16 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 347122c3575e..a9ad97f3c57f 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2151,6 +2151,53 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
}
#endif

+/*
+ * Extract pages from an iterator and add them to the socket buffer.
+ */
+static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
+ struct iov_iter *iter, ssize_t maxsize)
+{
+ struct page *pages[8], **ppages = pages;
+ unsigned int i, nr;
+ ssize_t ret = 0;
+
+ while (iter->count > 0) {
+ size_t off, len;
+
+ nr = min_t(size_t, MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags,
+ ARRAY_SIZE(pages));
+ if (nr == 0)
+ break;
+
+ len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off);
+ if (len <= 0) {
+ if (!ret)
+ ret = len ?: -EIO;
+ break;
+ }
+
+ i = 0;
+ do {
+ size_t part = min_t(size_t, PAGE_SIZE - off, len);
+
+ if (skb_append_pagefrags(skb, pages[i++], off, part) < 0) {
+ if (!ret)
+ ret = -EMSGSIZE;
+ goto out;
+ }
+ off = 0;
+ ret += part;
+ maxsize -= part;
+ len -= part;
+ } while (len > 0);
+ if (maxsize <= 0)
+ break;
+ }
+
+out:
+ return ret;
+}
+
static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
size_t len)
{
@@ -2194,19 +2241,25 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
while (sent < len) {
size = len - sent;

- /* Keep two messages in the pipe so it schedules better */
- size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+ if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+ skb = sock_alloc_send_pskb(sk, 0, 0,
+ msg->msg_flags & MSG_DONTWAIT,
+ &err, 0);
+ } else {
+ /* Keep two messages in the pipe so it schedules better */
+ size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);

- /* allow fallback to order-0 allocations */
- size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
+ /* allow fallback to order-0 allocations */
+ size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);

- data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
+ data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));

- data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
+ data_len = min_t(size_t, size, PAGE_ALIGN(data_len));

- skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
- msg->msg_flags & MSG_DONTWAIT, &err,
- get_order(UNIX_SKB_FRAGS_SZ));
+ skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
+ msg->msg_flags & MSG_DONTWAIT, &err,
+ get_order(UNIX_SKB_FRAGS_SZ));
+ }
if (!skb)
goto out_err;

@@ -2218,13 +2271,21 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
}
fds_sent = true;

- skb_put(skb, size - data_len);
- skb->data_len = data_len;
- skb->len = size;
- err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
- if (err) {
- kfree_skb(skb);
- goto out_err;
+ if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+ size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size);
+ skb->data_len += size;
+ skb->len += size;
+ skb->truesize += size;
+ refcount_add(size, &sk->sk_wmem_alloc);
+ } else {
+ skb_put(skb, size - data_len);
+ skb->data_len = data_len;
+ skb->len = size;
+ err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+ if (err) {
+ kfree_skb(skb);
+ goto out_err;
+ }
}

unix_state_lock(other);

2023-03-31 16:16:57

by David Howells

[permalink] [raw]
Subject: [PATCH v3 27/55] tls/device: Support MSG_SPLICE_PAGES

Make TLS's device sendmsg() support MSG_SPLICE_PAGES. This causes pages to
be spliced from the source iterator if possible and copied the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_device.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 6c593788dc25..f5c3b56ac1ce 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -508,7 +508,30 @@ static int tls_push_data(struct sock *sk,
zc_pfrag.offset = iter_offset.offset;
zc_pfrag.size = copy;
tls_append_frag(record, &zc_pfrag, copy);
+ } else if (copy && (flags & MSG_SPLICE_PAGES)) {
+ struct page_frag zc_pfrag;
+ struct page **pages = &zc_pfrag.page;
+ size_t off;
+
+ rc = iov_iter_extract_pages(iter_offset.msg_iter, &pages,
+ copy, 1, 0, &off);
+ if (rc <= 0) {
+ if (rc == 0)
+ rc = -EIO;
+ goto handle_error;
+ }
+ copy = rc;
+
+ if (!sendpage_ok(zc_pfrag.page)) {
+ iov_iter_revert(iter_offset.msg_iter, copy);
+ goto no_zcopy_this_page;
+ }
+
+ zc_pfrag.offset = off;
+ zc_pfrag.size = copy;
+ tls_append_frag(record, &zc_pfrag, copy);
} else if (copy) {
+no_zcopy_this_page:
copy = min_t(size_t, copy, pfrag->size - pfrag->offset);

rc = tls_device_copy_data(page_address(pfrag->page) +
@@ -571,6 +594,9 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
union tls_iter_offset iter;
int rc;

+ if (!tls_ctx->zerocopy_sendfile)
+ msg->msg_flags &= ~MSG_SPLICE_PAGES;
+
mutex_lock(&tls_ctx->tx_lock);
lock_sock(sk);


2023-03-31 16:17:01

by David Howells

[permalink] [raw]
Subject: [PATCH v3 28/55] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

Convert tls_device_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself. With that, the tls_iter_offset
union is no longer necessary and can be replaced with an iov_iter pointer
and the zc_page argument to tls_push_data() can also be removed.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_device.c | 79 ++++++++++----------------------------------
1 file changed, 18 insertions(+), 61 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f5c3b56ac1ce..6cfd1577a212 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -424,16 +424,10 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
return 0;
}

-union tls_iter_offset {
- struct iov_iter *msg_iter;
- int offset;
-};
-
static int tls_push_data(struct sock *sk,
- union tls_iter_offset iter_offset,
+ struct iov_iter *iter,
size_t size, int flags,
- unsigned char record_type,
- struct page *zc_page)
+ unsigned char record_type)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_prot_info *prot = &tls_ctx->prot_info;
@@ -501,19 +495,12 @@ static int tls_push_data(struct sock *sk,
record = ctx->open_record;

copy = min_t(size_t, size, max_open_record_len - record->len);
- if (copy && zc_page) {
- struct page_frag zc_pfrag;
-
- zc_pfrag.page = zc_page;
- zc_pfrag.offset = iter_offset.offset;
- zc_pfrag.size = copy;
- tls_append_frag(record, &zc_pfrag, copy);
- } else if (copy && (flags & MSG_SPLICE_PAGES)) {
+ if (copy && (flags & MSG_SPLICE_PAGES)) {
struct page_frag zc_pfrag;
struct page **pages = &zc_pfrag.page;
size_t off;

- rc = iov_iter_extract_pages(iter_offset.msg_iter, &pages,
+ rc = iov_iter_extract_pages(iter, &pages,
copy, 1, 0, &off);
if (rc <= 0) {
if (rc == 0)
@@ -523,7 +510,7 @@ static int tls_push_data(struct sock *sk,
copy = rc;

if (!sendpage_ok(zc_pfrag.page)) {
- iov_iter_revert(iter_offset.msg_iter, copy);
+ iov_iter_revert(iter, copy);
goto no_zcopy_this_page;
}

@@ -536,7 +523,7 @@ static int tls_push_data(struct sock *sk,

rc = tls_device_copy_data(page_address(pfrag->page) +
pfrag->offset, copy,
- iter_offset.msg_iter);
+ iter);
if (rc)
goto handle_error;
tls_append_frag(record, pfrag, copy);
@@ -591,7 +578,6 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
unsigned char record_type = TLS_RECORD_TYPE_DATA;
struct tls_context *tls_ctx = tls_get_ctx(sk);
- union tls_iter_offset iter;
int rc;

if (!tls_ctx->zerocopy_sendfile)
@@ -606,8 +592,7 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
goto out;
}

- iter.msg_iter = &msg->msg_iter;
- rc = tls_push_data(sk, iter, size, msg->msg_flags, record_type, NULL);
+ rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags, record_type);

out:
release_sock(sk);
@@ -618,44 +603,18 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
int tls_device_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
- struct tls_context *tls_ctx = tls_get_ctx(sk);
- union tls_iter_offset iter_offset;
- struct iov_iter msg_iter;
- char *kaddr;
- struct kvec iov;
- int rc;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
-
- mutex_lock(&tls_ctx->tx_lock);
- lock_sock(sk);
+ msg.msg_flags |= MSG_MORE;

- if (flags & MSG_OOB) {
- rc = -EOPNOTSUPP;
- goto out;
- }
-
- if (tls_ctx->zerocopy_sendfile) {
- iter_offset.offset = offset;
- rc = tls_push_data(sk, iter_offset, size,
- flags, TLS_RECORD_TYPE_DATA, page);
- goto out;
- }
-
- kaddr = kmap(page);
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- iov_iter_kvec(&msg_iter, ITER_SOURCE, &iov, 1, size);
- iter_offset.msg_iter = &msg_iter;
- rc = tls_push_data(sk, iter_offset, size, flags, TLS_RECORD_TYPE_DATA,
- NULL);
- kunmap(page);
+ if (flags & MSG_OOB)
+ return -EOPNOTSUPP;

-out:
- release_sock(sk);
- mutex_unlock(&tls_ctx->tx_lock);
- return rc;
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return tls_device_sendmsg(sk, &msg, size);
}

struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
@@ -720,12 +679,10 @@ EXPORT_SYMBOL(tls_get_record);

static int tls_device_push_pending_record(struct sock *sk, int flags)
{
- union tls_iter_offset iter;
- struct iov_iter msg_iter;
+ struct iov_iter iter;

- iov_iter_kvec(&msg_iter, ITER_SOURCE, NULL, 0, 0);
- iter.msg_iter = &msg_iter;
- return tls_push_data(sk, iter, 0, flags, TLS_RECORD_TYPE_DATA, NULL);
+ iov_iter_kvec(&iter, ITER_SOURCE, NULL, 0, 0);
+ return tls_push_data(sk, &iter, 0, flags, TLS_RECORD_TYPE_DATA);
}

void tls_device_write_space(struct sock *sk, struct tls_context *ctx)

2023-03-31 16:17:22

by David Howells

[permalink] [raw]
Subject: [PATCH v3 29/55] tls/sw: Support MSG_SPLICE_PAGES

Make TLS's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible and copied the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_sw.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 782d3701b86f..ce0c289e68ca 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -929,6 +929,49 @@ static int tls_sw_push_pending_record(struct sock *sk, int flags)
&copied, flags);
}

+static int rls_sw_sendmsg_splice(struct sock *sk, struct msghdr *msg,
+ struct sk_msg *msg_pl, size_t try_to_copy,
+ ssize_t *copied)
+{
+ struct page *page, **pages = &page;
+
+ do {
+ ssize_t part;
+ size_t off;
+ bool put = false;
+
+ part = iov_iter_extract_pages(&msg->msg_iter, &pages,
+ try_to_copy, 1, 0, &off);
+ if (part <= 0)
+ return part ?: -EIO;
+
+ if (!sendpage_ok(page)) {
+ const void *p = kmap_local_page(page);
+ void *q;
+
+ q = page_frag_memdup(NULL, p + off, part,
+ sk->sk_allocation, ULONG_MAX);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(&msg->msg_iter, part);
+ return -ENOMEM;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ }
+
+ sk_msg_page_add(msg_pl, page, part, off);
+ sk_mem_charge(sk, part);
+ if (put)
+ put_page(page);
+ *copied += part;
+ try_to_copy -= part;
+ } while (try_to_copy && !sk_msg_full(msg_pl));
+
+ return 0;
+}
+
int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
@@ -1016,6 +1059,17 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
full_record = true;
}

+ if (try_to_copy && (msg->msg_flags & MSG_SPLICE_PAGES)) {
+ ret = rls_sw_sendmsg_splice(sk, msg, msg_pl,
+ try_to_copy, &copied);
+ if (ret < 0)
+ goto send_end;
+ tls_ctx->pending_open_record_frags = true;
+ if (full_record || eor || sk_msg_full(msg_pl))
+ goto copied;
+ continue;
+ }
+
if (!is_kvec && (full_record || eor) && !async_capable) {
u32 first = msg_pl->sg.end;

@@ -1078,8 +1132,9 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
/* Open records defined only if successfully copied, otherwise
* we would trim the sg but not reset the open record frags.
*/
- tls_ctx->pending_open_record_frags = true;
copied += try_to_copy;
+copied:
+ tls_ctx->pending_open_record_frags = true;
if (full_record || eor) {
ret = bpf_exec_tx_verdict(msg_pl, sk, full_record,
record_type, &copied,

2023-03-31 16:17:25

by David Howells

[permalink] [raw]
Subject: [PATCH v3 30/55] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES

Convert tls_sw_sendpage() and tls_sw_sendpage_locked() to use sendmsg()
with MSG_SPLICE_PAGES rather than directly splicing in the pages itself.

[!] Note that tls_sw_sendpage_locked() appears to have the wrong locking
upstream. I think the caller will only hold the socket lock, but it
should hold tls_ctx->tx_lock too.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_sw.c | 158 +++++++++--------------------------------------
1 file changed, 28 insertions(+), 130 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ce0c289e68ca..256824fca651 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -972,7 +972,7 @@ static int rls_sw_sendmsg_splice(struct sock *sk, struct msghdr *msg,
return 0;
}

-int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
{
long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -995,13 +995,6 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
int ret = 0;
int pending;

- if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
- MSG_CMSG_COMPAT))
- return -EOPNOTSUPP;
-
- mutex_lock(&tls_ctx->tx_lock);
- lock_sock(sk);
-
if (unlikely(msg->msg_controllen)) {
ret = tls_process_cmsg(sk, msg, &record_type);
if (ret) {
@@ -1202,155 +1195,60 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)

send_end:
ret = sk_stream_error(sk, msg->msg_flags, ret);
-
- release_sock(sk);
- mutex_unlock(&tls_ctx->tx_lock);
return copied > 0 ? copied : ret;
}

-static int tls_sw_do_sendpage(struct sock *sk, struct page *page,
- int offset, size_t size, int flags)
+int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
- long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
struct tls_context *tls_ctx = tls_get_ctx(sk);
- struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
- struct tls_prot_info *prot = &tls_ctx->prot_info;
- unsigned char record_type = TLS_RECORD_TYPE_DATA;
- struct sk_msg *msg_pl;
- struct tls_rec *rec;
- int num_async = 0;
- ssize_t copied = 0;
- bool full_record;
- int record_room;
- int ret = 0;
- bool eor;
-
- eor = !(flags & MSG_SENDPAGE_NOTLAST);
- sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
-
- /* Call the sk_stream functions to manage the sndbuf mem. */
- while (size > 0) {
- size_t copy, required_size;
-
- if (sk->sk_err) {
- ret = -sk->sk_err;
- goto sendpage_end;
- }
-
- if (ctx->open_rec)
- rec = ctx->open_rec;
- else
- rec = ctx->open_rec = tls_get_rec(sk);
- if (!rec) {
- ret = -ENOMEM;
- goto sendpage_end;
- }
-
- msg_pl = &rec->msg_plaintext;
-
- full_record = false;
- record_room = TLS_MAX_PAYLOAD_SIZE - msg_pl->sg.size;
- copy = size;
- if (copy >= record_room) {
- copy = record_room;
- full_record = true;
- }
-
- required_size = msg_pl->sg.size + copy + prot->overhead_size;
-
- if (!sk_stream_memory_free(sk))
- goto wait_for_sndbuf;
-alloc_payload:
- ret = tls_alloc_encrypted_msg(sk, required_size);
- if (ret) {
- if (ret != -ENOSPC)
- goto wait_for_memory;
-
- /* Adjust copy according to the amount that was
- * actually allocated. The difference is due
- * to max sg elements limit
- */
- copy -= required_size - msg_pl->sg.size;
- full_record = true;
- }
-
- sk_msg_page_add(msg_pl, page, copy, offset);
- sk_mem_charge(sk, copy);
-
- offset += copy;
- size -= copy;
- copied += copy;
-
- tls_ctx->pending_open_record_frags = true;
- if (full_record || eor || sk_msg_full(msg_pl)) {
- ret = bpf_exec_tx_verdict(msg_pl, sk, full_record,
- record_type, &copied, flags);
- if (ret) {
- if (ret == -EINPROGRESS)
- num_async++;
- else if (ret == -ENOMEM)
- goto wait_for_memory;
- else if (ret != -EAGAIN) {
- if (ret == -ENOSPC)
- ret = 0;
- goto sendpage_end;
- }
- }
- }
- continue;
-wait_for_sndbuf:
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-wait_for_memory:
- ret = sk_stream_wait_memory(sk, &timeo);
- if (ret) {
- if (ctx->open_rec)
- tls_trim_both_msgs(sk, msg_pl->sg.size);
- goto sendpage_end;
- }
+ int ret;

- if (ctx->open_rec)
- goto alloc_payload;
- }
+ if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
+ MSG_CMSG_COMPAT | MSG_SPLICE_PAGES |
+ MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
+ return -EOPNOTSUPP;

- if (num_async) {
- /* Transmit if any encryptions have completed */
- if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
- cancel_delayed_work(&ctx->tx_work.work);
- tls_tx_records(sk, flags);
- }
- }
-sendpage_end:
- ret = sk_stream_error(sk, flags, ret);
- return copied > 0 ? copied : ret;
+ mutex_lock(&tls_ctx->tx_lock);
+ lock_sock(sk);
+ ret = tls_sw_sendmsg_locked(sk, msg, size);
+ release_sock(sk);
+ mutex_unlock(&tls_ctx->tx_lock);
+ return ret;
}

int tls_sw_sendpage_locked(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
+
if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY |
MSG_NO_SHARED_FRAGS))
return -EOPNOTSUPP;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- return tls_sw_do_sendpage(sk, page, offset, size, flags);
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return tls_sw_sendmsg_locked(sk, &msg, size);
}

int tls_sw_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
- struct tls_context *tls_ctx = tls_get_ctx(sk);
- int ret;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
return -EOPNOTSUPP;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- mutex_lock(&tls_ctx->tx_lock);
- lock_sock(sk);
- ret = tls_sw_do_sendpage(sk, page, offset, size, flags);
- release_sock(sk);
- mutex_unlock(&tls_ctx->tx_lock);
- return ret;
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return tls_sw_sendmsg(sk, &msg, size);
}

static int

2023-03-31 16:17:59

by David Howells

[permalink] [raw]
Subject: [PATCH v3 33/55] kcm: Support MSG_SPLICE_PAGES

Make AF_KCM sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible and copied otherwise.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/kcm/kcmsock.c | 89 ++++++++++++++++++++++++++++++++++++++---------
1 file changed, 72 insertions(+), 17 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index cfe828bd7fc6..0a3f79d81595 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -989,29 +989,84 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
merge = false;
}

- copy = min_t(int, msg_data_left(msg),
- pfrag->size - pfrag->offset);
+ if (msg->msg_flags & MSG_SPLICE_PAGES) {
+ struct page *page = NULL, **pages = &page;
+ size_t off;
+ bool put = false;
+
+ err = iov_iter_extract_pages(&msg->msg_iter, &pages,
+ INT_MAX, 1, 0, &off);
+ if (err <= 0) {
+ err = err ?: -EIO;
+ goto out_error;
+ }
+ copy = err;

- if (!sk_wmem_schedule(sk, copy))
- goto wait_for_memory;
+ if (skb_can_coalesce(skb, i, page, off)) {
+ skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+ goto coalesced;
+ }

- err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
- pfrag->page,
- pfrag->offset,
- copy);
- if (err)
- goto out_error;
+ if (!sk_wmem_schedule(sk, copy)) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ goto wait_for_memory;
+ }
+
+ if (!sendpage_ok(page)) {
+ const void *p = kmap_local_page(page);
+ void *q;

- /* Update the skb. */
- if (merge) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+ q = page_frag_memdup(NULL, p + off, copy,
+ sk->sk_allocation, ULONG_MAX);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ err = -ENOMEM;
+ goto out_error;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ }
+
+ skb_fill_page_desc_noacc(skb, i, page, off, copy);
+ if (put)
+ put_page(page);
+coalesced:
+ skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+ skb->len += copy;
+ skb->data_len += copy;
+ skb->truesize += copy;
+ sk->sk_wmem_queued += copy;
+ sk_mem_charge(sk, copy);
+
+ if (head != skb)
+ head->truesize += copy;
} else {
- skb_fill_page_desc(skb, i, pfrag->page,
- pfrag->offset, copy);
- get_page(pfrag->page);
+ copy = min_t(int, msg_data_left(msg),
+ pfrag->size - pfrag->offset);
+ if (!sk_wmem_schedule(sk, copy))
+ goto wait_for_memory;
+
+ err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
+ pfrag->page,
+ pfrag->offset,
+ copy);
+ if (err)
+ goto out_error;
+
+ /* Update the skb. */
+ if (merge) {
+ skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+ } else {
+ skb_fill_page_desc(skb, i, pfrag->page,
+ pfrag->offset, copy);
+ get_page(pfrag->page);
+ }
+
+ pfrag->offset += copy;
}

- pfrag->offset += copy;
copied += copy;
if (head != skb) {
head->len += copy;

2023-03-31 16:18:03

by David Howells

[permalink] [raw]
Subject: [PATCH v3 32/55] chelsio: Convert chtls_sendpage() to use MSG_SPLICE_PAGES

Convert chtls_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Ayush Sawal <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
.../chelsio/inline_crypto/chtls/chtls_io.c | 109 ++----------------
1 file changed, 7 insertions(+), 102 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
index ca3daf5df95c..5c397cb57300 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
@@ -1288,110 +1288,15 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
int chtls_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
- struct chtls_sock *csk;
- struct chtls_dev *cdev;
- int mss, err, copied;
- struct tcp_sock *tp;
- long timeo;
-
- tp = tcp_sk(sk);
- copied = 0;
- csk = rcu_dereference_sk_user_data(sk);
- cdev = csk->cdev;
- lock_sock(sk);
- timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

- err = sk_stream_wait_connect(sk, &timeo);
- if (!sk_in_state(sk, TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
- err != 0)
- goto out_err;
-
- mss = csk->mss;
- csk_set_flag(csk, CSK_TX_MORE_DATA);
-
- while (size > 0) {
- struct sk_buff *skb = skb_peek_tail(&csk->txq);
- int copy, i;
-
- if (!skb || (ULP_SKB_CB(skb)->flags & ULPCB_FLAG_NO_APPEND) ||
- (copy = mss - skb->len) <= 0) {
-new_buf:
- if (!csk_mem_free(cdev, sk))
- goto wait_for_sndbuf;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- if (is_tls_tx(csk)) {
- skb = get_record_skb(sk,
- select_size(sk, size,
- flags,
- TX_TLSHDR_LEN),
- true);
- } else {
- skb = get_tx_skb(sk, 0);
- }
- if (!skb)
- goto wait_for_memory;
- copy = mss;
- }
- if (copy > size)
- copy = size;
-
- i = skb_shinfo(skb)->nr_frags;
- if (skb_can_coalesce(skb, i, page, offset)) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
- } else if (i < MAX_SKB_FRAGS) {
- get_page(page);
- skb_fill_page_desc(skb, i, page, offset, copy);
- } else {
- tx_skb_finalize(skb);
- push_frames_if_head(sk);
- goto new_buf;
- }
-
- skb->len += copy;
- if (skb->len == mss)
- tx_skb_finalize(skb);
- skb->data_len += copy;
- skb->truesize += copy;
- sk->sk_wmem_queued += copy;
- tp->write_seq += copy;
- copied += copy;
- offset += copy;
- size -= copy;
-
- if (corked(tp, flags) &&
- (sk_stream_wspace(sk) < sk_stream_min_wspace(sk)))
- ULP_SKB_CB(skb)->flags |= ULPCB_FLAG_NO_APPEND;
-
- if (!size)
- break;
-
- if (unlikely(ULP_SKB_CB(skb)->flags & ULPCB_FLAG_NO_APPEND))
- push_frames_if_head(sk);
- continue;
-wait_for_sndbuf:
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-wait_for_memory:
- err = csk_wait_memory(cdev, sk, &timeo);
- if (err)
- goto do_error;
- }
-out:
- csk_reset_flag(csk, CSK_TX_MORE_DATA);
- if (copied)
- chtls_tcp_push(sk, flags);
-done:
- release_sock(sk);
- return copied;
-
-do_error:
- if (copied)
- goto out;
-
-out_err:
- if (csk_conn_inline(csk))
- csk_reset_flag(csk, CSK_TX_MORE_DATA);
- copied = sk_stream_error(sk, flags, err);
- goto done;
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return chtls_sendmsg(sk, &msg, size);
}

static void chtls_select_window(struct sock *sk)

2023-03-31 16:18:09

by David Howells

[permalink] [raw]
Subject: [PATCH v3 31/55] chelsio: Support MSG_SPLICE_PAGES

Make Chelsio's TLS offload sendmsg() support MSG_SPLICE_PAGES, splicing in
pages from the source iterator if possible and copying the data in
otherwise.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Ayush Sawal <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
.../chelsio/inline_crypto/chtls/chtls_io.c | 60 ++++++++++++++++++-
1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
index ae6b17b96bf1..ca3daf5df95c 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
@@ -1092,7 +1092,65 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
if (copy > size)
copy = size;

- if (skb_tailroom(skb) > 0) {
+ if (msg->msg_flags & MSG_SPLICE_PAGES) {
+ struct page *page, **pages = &page;
+ ssize_t part;
+ size_t off, spliced = 0;
+ bool put = false;
+ int i;
+
+ do {
+ i = skb_shinfo(skb)->nr_frags;
+ part = iov_iter_extract_pages(&msg->msg_iter, &pages,
+ copy - spliced, 1, 0, &off);
+ if (part <= 0) {
+ err = part ?: -EIO;
+ goto do_fault;
+ }
+
+ if (!sendpage_ok(page)) {
+ const void *p = kmap_local_page(page);
+ void *q;
+
+ q = page_frag_memdup(NULL, p + off, part,
+ sk->sk_allocation, ULONG_MAX);
+ kunmap_local(p);
+ if (!q) {
+ iov_iter_revert(&msg->msg_iter, part);
+ return -ENOMEM;
+ }
+ page = virt_to_page(q);
+ off = offset_in_page(q);
+ put = true;
+ }
+
+ if (skb_can_coalesce(skb, i, page, off)) {
+ skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], part);
+ spliced += part;
+ if (put)
+ put_page(page);
+ } else if (i < MAX_SKB_FRAGS) {
+ if (!put)
+ get_page(page);
+ skb_fill_page_desc(skb, i, page, off, spliced);
+ spliced += part;
+ put = false;
+ } else {
+ if (put)
+ put_page(page);
+ if (!spliced)
+ goto new_buf;
+ break;
+ }
+ } while (spliced < copy);
+
+ copy = spliced;
+ skb->len += copy;
+ skb->data_len += copy;
+ skb->truesize += copy;
+ sk->sk_wmem_queued += copy;
+
+ } else if (skb_tailroom(skb) > 0) {
copy = min(copy, skb_tailroom(skb));
if (is_tls_tx(csk))
copy = min_t(int, copy, csk->tlshws.txleft);

2023-03-31 16:18:25

by David Howells

[permalink] [raw]
Subject: [PATCH v3 34/55] kcm: Convert kcm_sendpage() to use MSG_SPLICE_PAGES

Convert kcm_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/kcm/kcmsock.c | 161 ++++++----------------------------------------
1 file changed, 18 insertions(+), 143 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 0a3f79d81595..d77d28fbf389 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -761,149 +761,6 @@ static void kcm_push(struct kcm_sock *kcm)
kcm_write_msgs(kcm);
}

-static ssize_t kcm_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-
-{
- struct sock *sk = sock->sk;
- struct kcm_sock *kcm = kcm_sk(sk);
- struct sk_buff *skb = NULL, *head = NULL;
- long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
- bool eor;
- int err = 0;
- int i;
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
-
- /* No MSG_EOR from splice, only look at MSG_MORE */
- eor = !(flags & MSG_MORE);
-
- lock_sock(sk);
-
- sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
-
- err = -EPIPE;
- if (sk->sk_err)
- goto out_error;
-
- if (kcm->seq_skb) {
- /* Previously opened message */
- head = kcm->seq_skb;
- skb = kcm_tx_msg(head)->last_skb;
- i = skb_shinfo(skb)->nr_frags;
-
- if (skb_can_coalesce(skb, i, page, offset)) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
- skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
- goto coalesced;
- }
-
- if (i >= MAX_SKB_FRAGS) {
- struct sk_buff *tskb;
-
- tskb = alloc_skb(0, sk->sk_allocation);
- while (!tskb) {
- kcm_push(kcm);
- err = sk_stream_wait_memory(sk, &timeo);
- if (err)
- goto out_error;
- }
-
- if (head == skb)
- skb_shinfo(head)->frag_list = tskb;
- else
- skb->next = tskb;
-
- skb = tskb;
- skb->ip_summed = CHECKSUM_UNNECESSARY;
- i = 0;
- }
- } else {
- /* Call the sk_stream functions to manage the sndbuf mem. */
- if (!sk_stream_memory_free(sk)) {
- kcm_push(kcm);
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
- err = sk_stream_wait_memory(sk, &timeo);
- if (err)
- goto out_error;
- }
-
- head = alloc_skb(0, sk->sk_allocation);
- while (!head) {
- kcm_push(kcm);
- err = sk_stream_wait_memory(sk, &timeo);
- if (err)
- goto out_error;
- }
-
- skb = head;
- i = 0;
- }
-
- get_page(page);
- skb_fill_page_desc_noacc(skb, i, page, offset, size);
- skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
-coalesced:
- skb->len += size;
- skb->data_len += size;
- skb->truesize += size;
- sk->sk_wmem_queued += size;
- sk_mem_charge(sk, size);
-
- if (head != skb) {
- head->len += size;
- head->data_len += size;
- head->truesize += size;
- }
-
- if (eor) {
- bool not_busy = skb_queue_empty(&sk->sk_write_queue);
-
- /* Message complete, queue it on send buffer */
- __skb_queue_tail(&sk->sk_write_queue, head);
- kcm->seq_skb = NULL;
- KCM_STATS_INCR(kcm->stats.tx_msgs);
-
- if (flags & MSG_BATCH) {
- kcm->tx_wait_more = true;
- } else if (kcm->tx_wait_more || not_busy) {
- err = kcm_write_msgs(kcm);
- if (err < 0) {
- /* We got a hard error in write_msgs but have
- * already queued this message. Report an error
- * in the socket, but don't affect return value
- * from sendmsg
- */
- pr_warn("KCM: Hard failure on kcm_write_msgs\n");
- report_csk_error(&kcm->sk, -err);
- }
- }
- } else {
- /* Message not complete, save state */
- kcm->seq_skb = head;
- kcm_tx_msg(head)->last_skb = skb;
- }
-
- KCM_STATS_ADD(kcm->stats.tx_bytes, size);
-
- release_sock(sk);
- return size;
-
-out_error:
- kcm_push(kcm);
-
- err = sk_stream_error(sk, flags, err);
-
- /* make sure we wake any epoll edge trigger waiter */
- if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN))
- sk->sk_write_space(sk);
-
- release_sock(sk);
- return err;
-}
-
static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
{
struct sock *sk = sock->sk;
@@ -1143,6 +1000,24 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
return err;
}

+static ssize_t kcm_sendpage(struct socket *sock, struct page *page,
+ int offset, size_t size, int flags)
+
+{
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
+
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;
+
+ if (flags & MSG_OOB)
+ return -EOPNOTSUPP;
+
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return kcm_sendmsg(sock, &msg, size);
+}
+
static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
size_t len, int flags)
{

2023-03-31 16:18:46

by David Howells

[permalink] [raw]
Subject: [PATCH v3 35/55] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()

Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() to splice data from
a pipe to a socket. This paves the way for passing in multiple pages at
once from a pipe and the handling of multipage folios.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
fs/splice.c | 42 +++++++++++++++++++++++-------------------
include/linux/fs.h | 2 --
include/linux/splice.h | 2 ++
net/socket.c | 26 ++------------------------
4 files changed, 27 insertions(+), 45 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index f46dd1fb367b..23ead122d631 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -32,6 +32,7 @@
#include <linux/uio.h>
#include <linux/security.h>
#include <linux/gfp.h>
+#include <linux/net.h>
#include <linux/socket.h>
#include <linux/sched/signal.h>

@@ -410,29 +411,32 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
};
EXPORT_SYMBOL(nosteal_pipe_buf_ops);

+#ifdef CONFIG_NET
/*
* Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
* using sendpage(). Return the number of bytes sent.
*/
-static int pipe_to_sendpage(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf, struct splice_desc *sd)
+static int pipe_to_sendmsg(struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf, struct splice_desc *sd)
{
- struct file *file = sd->u.file;
- loff_t pos = sd->pos;
- int more;
-
- if (!likely(file->f_op->sendpage))
- return -EINVAL;
+ struct socket *sock = sock_from_file(sd->u.file);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES,
+ };

- more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
+ if (sd->flags & SPLICE_F_MORE)
+ msg.msg_flags |= MSG_MORE;

if (sd->len < sd->total_len &&
pipe_occupancy(pipe->head, pipe->tail) > 1)
- more |= MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE;

- return file->f_op->sendpage(file, buf->page, buf->offset,
- sd->len, &pos, more);
+ bvec_set_page(&bvec, buf->page, sd->len, buf->offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, sd->len);
+ return sock_sendmsg(sock, &msg);
}
+#endif

static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
{
@@ -614,7 +618,7 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
* Description:
* This function does little more than loop over the pipe and call
* @actor to do the actual moving of a single struct pipe_buffer to
- * the desired destination. See pipe_to_file, pipe_to_sendpage, or
+ * the desired destination. See pipe_to_file, pipe_to_sendmsg, or
* pipe_to_user.
*
*/
@@ -795,8 +799,9 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,

EXPORT_SYMBOL(iter_file_splice_write);

+#ifdef CONFIG_NET
/**
- * generic_splice_sendpage - splice data from a pipe to a socket
+ * splice_to_socket - splice data from a pipe to a socket
* @pipe: pipe to splice from
* @out: socket to write to
* @ppos: position in @out
@@ -808,13 +813,12 @@ EXPORT_SYMBOL(iter_file_splice_write);
* is involved.
*
*/
-ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
- loff_t *ppos, size_t len, unsigned int flags)
+ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags)
{
- return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendpage);
+ return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendmsg);
}
-
-EXPORT_SYMBOL(generic_splice_sendpage);
+#endif

static int warn_unsupported(struct file *file, const char *op)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..f3ccc243851e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2740,8 +2740,6 @@ extern ssize_t generic_file_splice_read(struct file *, loff_t *,
struct pipe_inode_info *, size_t, unsigned int);
extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
struct file *, loff_t *, size_t, unsigned int);
-extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
- struct file *out, loff_t *, size_t len, unsigned int flags);
extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
loff_t *opos, size_t len, unsigned int flags);

diff --git a/include/linux/splice.h b/include/linux/splice.h
index 8f052c3dae95..e6153feda86c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -87,6 +87,8 @@ extern long do_splice(struct file *in, loff_t *off_in,

extern long do_tee(struct file *in, struct file *out, size_t len,
unsigned int flags);
+extern ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags);

/*
* for dynamic pipe sizing
diff --git a/net/socket.c b/net/socket.c
index 0c39ce57d603..3e9bd8261357 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -57,6 +57,7 @@
#include <linux/mm.h>
#include <linux/socket.h>
#include <linux/file.h>
+#include <linux/splice.h>
#include <linux/net.h>
#include <linux/interrupt.h>
#include <linux/thread_info.h>
@@ -126,8 +127,6 @@ static long compat_sock_ioctl(struct file *file,
unsigned int cmd, unsigned long arg);
#endif
static int sock_fasync(int fd, struct file *filp, int on);
-static ssize_t sock_sendpage(struct file *file, struct page *page,
- int offset, size_t size, loff_t *ppos, int more);
static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
@@ -162,8 +161,7 @@ static const struct file_operations socket_file_ops = {
.mmap = sock_mmap,
.release = sock_close,
.fasync = sock_fasync,
- .sendpage = sock_sendpage,
- .splice_write = generic_splice_sendpage,
+ .splice_write = splice_to_socket,
.splice_read = sock_splice_read,
.show_fdinfo = sock_show_fdinfo,
};
@@ -1062,26 +1060,6 @@ int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
}
EXPORT_SYMBOL(kernel_recvmsg);

-static ssize_t sock_sendpage(struct file *file, struct page *page,
- int offset, size_t size, loff_t *ppos, int more)
-{
- struct socket *sock;
- int flags;
- int ret;
-
- sock = file->private_data;
-
- flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
- /* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
- flags |= more;
-
- ret = kernel_sendpage(sock, page, offset, size, flags);
-
- if (trace_sock_send_length_enabled())
- call_trace_sock_send_length(sock->sk, ret, 0);
- return ret;
-}
-
static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)

2023-03-31 16:18:47

by David Howells

[permalink] [raw]
Subject: [PATCH v3 38/55] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.

To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator. The header and trailer (if present) are copied into
page fragments that can be freed with put_page().

Signed-off-by: David Howells <[email protected]>
cc: Bernard Metzler <[email protected]>
cc: Tom Talpey <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
drivers/infiniband/sw/siw/siw_qp_tx.c | 234 ++++++--------------------
1 file changed, 48 insertions(+), 186 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index fa5de40d85d5..fbe80c06d0ca 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -312,114 +312,8 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
return rv;
}

-/*
- * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
- *
- * Using sendpage to push page by page appears to be less efficient
- * than using sendmsg, even if data are copied.
- *
- * A general performance limitation might be the extra four bytes
- * trailer checksum segment to be pushed after user data.
- */
-static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
- size_t size)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST |
- MSG_SPLICE_PAGES),
- };
- struct sock *sk = s->sk;
- int i = 0, rv = 0, sent = 0;
-
- while (size) {
- size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
-
- if (size + offset <= PAGE_SIZE)
- msg.msg_flags = MSG_MORE | MSG_DONTWAIT;
-
- tcp_rate_check_app_limited(sk);
- bvec_set_page(&bvec, page[i], bytes, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
-try_page_again:
- lock_sock(sk);
- rv = tcp_sendmsg_locked(sk, &msg, size);
- release_sock(sk);
-
- if (rv > 0) {
- size -= rv;
- sent += rv;
- if (rv != bytes) {
- offset += rv;
- bytes -= rv;
- goto try_page_again;
- }
- offset = 0;
- } else {
- if (rv == -EAGAIN || rv == 0)
- break;
- return rv;
- }
- i++;
- }
- return sent;
-}
-
-/*
- * siw_0copy_tx()
- *
- * Pushes list of pages to TCP socket. If pages from multiple
- * SGE's, all referenced pages of each SGE are pushed in one
- * shot.
- */
-static int siw_0copy_tx(struct socket *s, struct page **page,
- struct siw_sge *sge, unsigned int offset,
- unsigned int size)
-{
- int i = 0, sent = 0, rv;
- int sge_bytes = min(sge->length - offset, size);
-
- offset = (sge->laddr + offset) & ~PAGE_MASK;
-
- while (sent != size) {
- rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes);
- if (rv >= 0) {
- sent += rv;
- if (size == sent || sge_bytes > rv)
- break;
-
- i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT;
- sge++;
- sge_bytes = min(sge->length, size - sent);
- offset = sge->laddr & ~PAGE_MASK;
- } else {
- sent = rv;
- break;
- }
- }
- return sent;
-}
-
#define MAX_TRAILER (MPA_CRC_SIZE + 4)

-static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len)
-{
- int i;
-
- /*
- * Work backwards through the array to honor the kmap_local_page()
- * ordering requirements.
- */
- for (i = (len-1); i >= 0; i--) {
- if (kmap_mask & BIT(i)) {
- unsigned long addr = (unsigned long)iov[i].iov_base;
-
- kunmap_local((void *)(addr & PAGE_MASK));
- }
- }
-}
-
/*
* siw_tx_hdt() tries to push a complete packet to TCP where all
* packet fragments are referenced by the elements of one iovec.
@@ -439,15 +333,14 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
{
struct siw_wqe *wqe = &c_tx->wqe_active;
struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
- struct kvec iov[MAX_ARRAY];
- struct page *page_array[MAX_ARRAY];
+ struct bio_vec bvec[MAX_ARRAY];
struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+ void *trl, *t;

int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
sge_off = c_tx->sge_off, sge_idx = c_tx->sge_idx,
pbl_idx = c_tx->pbl_idx;
- unsigned long kmap_mask = 0L;

if (c_tx->state == SIW_SEND_HDR) {
if (c_tx->use_sendpage) {
@@ -457,10 +350,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)

c_tx->state = SIW_SEND_DATA;
} else {
- iov[0].iov_base =
- (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent;
- iov[0].iov_len = hdr_len =
- c_tx->ctrl_len - c_tx->ctrl_sent;
+ const void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent;
+ void *h;
+
+ rv = -ENOMEM;
+ hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent;
+ h = page_frag_memdup(NULL, hdr, hdr_len, GFP_NOFS, ULONG_MAX);
+ if (!h)
+ goto done;
+ bvec_set_virt(&bvec[0], h, hdr_len);
seg = 1;
}
}
@@ -478,28 +376,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
} else {
is_kva = 1;
}
- if (is_kva && !c_tx->use_sendpage) {
- /*
- * tx from kernel virtual address: either inline data
- * or memory region with assigned kernel buffer
- */
- iov[seg].iov_base =
- (void *)(uintptr_t)(sge->laddr + sge_off);
- iov[seg].iov_len = sge_len;
-
- if (do_crc)
- crypto_shash_update(c_tx->mpa_crc_hd,
- iov[seg].iov_base,
- sge_len);
- sge_off += sge_len;
- data_len -= sge_len;
- seg++;
- goto sge_done;
- }

while (sge_len) {
size_t plen = min((int)PAGE_SIZE - fp_off, sge_len);
- void *kaddr;

if (!is_kva) {
struct page *p;
@@ -512,33 +391,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
p = siw_get_upage(mem->umem,
sge->laddr + sge_off);
if (unlikely(!p)) {
- siw_unmap_pages(iov, kmap_mask, seg);
wqe->processed -= c_tx->bytes_unsent;
rv = -EFAULT;
goto done_crc;
}
- page_array[seg] = p;
-
- if (!c_tx->use_sendpage) {
- void *kaddr = kmap_local_page(p);
-
- /* Remember for later kunmap() */
- kmap_mask |= BIT(seg);
- iov[seg].iov_base = kaddr + fp_off;
- iov[seg].iov_len = plen;
-
- if (do_crc)
- crypto_shash_update(
- c_tx->mpa_crc_hd,
- iov[seg].iov_base,
- plen);
- } else if (do_crc) {
- kaddr = kmap_local_page(p);
- crypto_shash_update(c_tx->mpa_crc_hd,
- kaddr + fp_off,
- plen);
- kunmap_local(kaddr);
- }
+
+ bvec_set_page(&bvec[seg], p, plen, fp_off);
} else {
/*
* Cast to an uintptr_t to preserve all 64 bits
@@ -552,12 +410,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
* bits on a 64 bit platform and 32 bits on a
* 32 bit platform.
*/
- page_array[seg] = virt_to_page((void *)(va & PAGE_MASK));
- if (do_crc)
- crypto_shash_update(
- c_tx->mpa_crc_hd,
- (void *)va,
- plen);
+ bvec_set_virt(&bvec[seg], (void *)va, plen);
+ }
+
+ if (do_crc) {
+ void *kaddr = kmap_local_page(bvec[seg].bv_page);
+ crypto_shash_update(c_tx->mpa_crc_hd,
+ kaddr + bvec[seg].bv_offset,
+ bvec[seg].bv_len);
+ kunmap_local(kaddr);
}

sge_len -= plen;
@@ -567,13 +428,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)

if (++seg > (int)MAX_ARRAY) {
siw_dbg_qp(tx_qp(c_tx), "to many fragments\n");
- siw_unmap_pages(iov, kmap_mask, seg-1);
wqe->processed -= c_tx->bytes_unsent;
rv = -EMSGSIZE;
goto done_crc;
}
}
-sge_done:
+
/* Update SGE variables at end of SGE */
if (sge_off == sge->length &&
(data_len != 0 || wqe->processed < wqe->bytes)) {
@@ -582,15 +442,8 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
sge_off = 0;
}
}
- /* trailer */
- if (likely(c_tx->state != SIW_SEND_TRAILER)) {
- iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad];
- iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad);
- } else {
- iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent];
- iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent;
- }

+ /* Set the CRC in the trailer */
if (c_tx->pad) {
*(u32 *)c_tx->trailer.pad = 0;
if (do_crc)
@@ -603,23 +456,29 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
else if (do_crc)
crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)&c_tx->trailer.crc);

- data_len = c_tx->bytes_unsent;
-
- if (c_tx->use_sendpage) {
- rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
- c_tx->sge_off, data_len);
- if (rv == data_len) {
- rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
- if (rv > 0)
- rv += data_len;
- else
- rv = data_len;
- }
+ /* Copy the trailer and add it to the output list */
+ if (likely(c_tx->state != SIW_SEND_TRAILER)) {
+ trl = &c_tx->trailer.pad[4 - c_tx->pad];
+ trl_len = MAX_TRAILER - (4 - c_tx->pad);
} else {
- rv = kernel_sendmsg(s, &msg, iov, seg + 1,
- hdr_len + data_len + trl_len);
- siw_unmap_pages(iov, kmap_mask, seg);
+ trl = &c_tx->trailer.pad[c_tx->ctrl_sent];
+ trl_len = MAX_TRAILER - c_tx->ctrl_sent;
}
+
+ rv = -ENOMEM;
+ t = page_frag_memdup(NULL, trl, trl_len, GFP_NOFS, ULONG_MAX);
+ if (!t)
+ goto done_crc;
+ bvec_set_virt(&bvec[seg], t, trl_len);
+
+ data_len = c_tx->bytes_unsent;
+
+ if (c_tx->use_sendpage)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1,
+ hdr_len + data_len + trl_len);
+ rv = sock_sendmsg(s, &msg);
+
if (rv < (int)hdr_len) {
/* Not even complete hdr pushed or negative rv */
wqe->processed -= data_len;
@@ -680,6 +539,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
}
done_crc:
c_tx->do_crc = 0;
+ if (c_tx->state == SIW_SEND_HDR)
+ folio_put(page_folio(bvec[0].bv_page));
+ folio_put(page_folio(bvec[seg].bv_page));
done:
return rv;
}

2023-03-31 16:19:03

by David Howells

[permalink] [raw]
Subject: [PATCH v3 43/55] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()

Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage in
skb_send_sock(). This causes pages to be spliced from the source iterator
if possible (the iterator must be ITER_BVEC and the pages must be
spliceable).

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Note that this could perhaps be improved to fill out a bvec array with all
the frags and then make a single sendmsg call, possibly sticking the header
on the front also.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/core/skbuff.c | 49 ++++++++++++++++++++++++++---------------------
1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0506e4cf1ed9..3693b3526d33 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2919,32 +2919,32 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
}
EXPORT_SYMBOL_GPL(skb_splice_bits);

-static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg,
- struct kvec *vec, size_t num, size_t size)
+static int sendmsg_locked(struct sock *sk, struct msghdr *msg)
{
struct socket *sock = sk->sk_socket;
+ size_t size = msg_data_left(msg);

if (!sock)
return -EINVAL;
- return kernel_sendmsg(sock, msg, vec, num, size);
+
+ if (!sock->ops->sendmsg_locked)
+ return sock_no_sendmsg_locked(sk, msg, size);
+
+ return sock->ops->sendmsg_locked(sk, msg, size);
}

-static int sendpage_unlocked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
+static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg)
{
struct socket *sock = sk->sk_socket;

if (!sock)
return -EINVAL;
- return kernel_sendpage(sock, page, offset, size, flags);
+ return sock_sendmsg(sock, msg);
}

-typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg,
- struct kvec *vec, size_t num, size_t size);
-typedef int (*sendpage_func)(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
+typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg);
static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
- int len, sendmsg_func sendmsg, sendpage_func sendpage)
+ int len, sendmsg_func sendmsg)
{
unsigned int orig_len = len;
struct sk_buff *head = skb;
@@ -2964,8 +2964,9 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
memset(&msg, 0, sizeof(msg));
msg.msg_flags = MSG_DONTWAIT;

- ret = INDIRECT_CALL_2(sendmsg, kernel_sendmsg_locked,
- sendmsg_unlocked, sk, &msg, &kv, 1, slen);
+ iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &kv, 1, slen);
+ ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+ sendmsg_unlocked, sk, &msg);
if (ret <= 0)
goto error;

@@ -2996,11 +2997,17 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
slen = min_t(size_t, len, skb_frag_size(frag) - offset);

while (slen) {
- ret = INDIRECT_CALL_2(sendpage, kernel_sendpage_locked,
- sendpage_unlocked, sk,
- skb_frag_page(frag),
- skb_frag_off(frag) + offset,
- slen, MSG_DONTWAIT);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT,
+ };
+
+ bvec_set_page(&bvec, skb_frag_page(frag), slen,
+ skb_frag_off(frag) + offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, slen);
+
+ ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+ sendmsg_unlocked, sk, &msg);
if (ret <= 0)
goto error;

@@ -3037,16 +3044,14 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
int len)
{
- return __skb_send_sock(sk, skb, offset, len, kernel_sendmsg_locked,
- kernel_sendpage_locked);
+ return __skb_send_sock(sk, skb, offset, len, sendmsg_locked);
}
EXPORT_SYMBOL_GPL(skb_send_sock_locked);

/* Send skb data on a socket. Socket must be unlocked. */
int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len)
{
- return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked,
- sendpage_unlocked);
+ return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked);
}

/**

2023-03-31 16:19:29

by David Howells

[permalink] [raw]
Subject: [PATCH v3 36/55] splice, net: Reimplement splice_to_socket() to pass multiple bufs to sendmsg()

Reimplement splice_to_socket() so that it can pass multiple pipe buffer
pages to sendmsg() in a single go.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
fs/splice.c | 148 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 120 insertions(+), 28 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 23ead122d631..d5bc28b59720 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -411,33 +411,6 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
};
EXPORT_SYMBOL(nosteal_pipe_buf_ops);

-#ifdef CONFIG_NET
-/*
- * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
- * using sendpage(). Return the number of bytes sent.
- */
-static int pipe_to_sendmsg(struct pipe_inode_info *pipe,
- struct pipe_buffer *buf, struct splice_desc *sd)
-{
- struct socket *sock = sock_from_file(sd->u.file);
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = MSG_SPLICE_PAGES,
- };
-
- if (sd->flags & SPLICE_F_MORE)
- msg.msg_flags |= MSG_MORE;
-
- if (sd->len < sd->total_len &&
- pipe_occupancy(pipe->head, pipe->tail) > 1)
- msg.msg_flags |= MSG_MORE;
-
- bvec_set_page(&bvec, buf->page, sd->len, buf->offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, sd->len);
- return sock_sendmsg(sock, &msg);
-}
-#endif
-
static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
{
smp_mb();
@@ -816,7 +789,126 @@ EXPORT_SYMBOL(iter_file_splice_write);
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
- return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendmsg);
+ struct socket *sock = sock_from_file(out);
+ struct bio_vec bvec[16];
+ struct msghdr msg = {};
+ ssize_t ret;
+ size_t spliced = 0;
+ bool need_wakeup = false;
+
+ pipe_lock(pipe);
+
+ while (len > 0) {
+ unsigned int head, tail, mask, bc = 0;
+ size_t remain = len;
+
+ /*
+ * Check for signal early to make process killable when there
+ * are always buffers available
+ */
+ ret = -ERESTARTSYS;
+ if (signal_pending(current))
+ break;
+
+ while (pipe_empty(pipe->head, pipe->tail)) {
+ ret = 0;
+ if (!pipe->writers)
+ goto out;
+
+ if (spliced)
+ goto out;
+
+ ret = -EAGAIN;
+ if (flags & SPLICE_F_NONBLOCK)
+ goto out;
+
+ ret = -ERESTARTSYS;
+ if (signal_pending(current))
+ goto out;
+
+ if (need_wakeup) {
+ wakeup_pipe_writers(pipe);
+ need_wakeup = false;
+ }
+
+ pipe_wait_readable(pipe);
+ }
+
+ head = pipe->head;
+ tail = pipe->tail;
+ mask = pipe->ring_size - 1;
+
+ while (!pipe_empty(head, tail)) {
+ struct pipe_buffer *buf = &pipe->bufs[tail & mask];
+ size_t seg;
+
+ if (!buf->len) {
+ tail++;
+ continue;
+ }
+
+ seg = min_t(size_t, remain, buf->len);
+ seg = min_t(size_t, seg, PAGE_SIZE);
+
+ ret = pipe_buf_confirm(pipe, buf);
+ if (unlikely(ret)) {
+ if (ret == -ENODATA)
+ ret = 0;
+ break;
+ }
+
+ bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
+ remain -= seg;
+ if (seg >= buf->len)
+ tail++;
+ if (bc >= ARRAY_SIZE(bvec))
+ break;
+ }
+
+ if (!bc)
+ break;
+
+ msg.msg_flags = 0;
+ if (flags & SPLICE_F_MORE)
+ msg.msg_flags = MSG_MORE;
+ if (remain && pipe_occupancy(pipe->head, tail) > 0)
+ msg.msg_flags = MSG_MORE;
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain);
+ ret = sock_sendmsg(sock, &msg);
+ if (ret <= 0)
+ break;
+
+ spliced += ret;
+ len -= ret;
+ tail = pipe->tail;
+ while (ret > 0) {
+ struct pipe_buffer *buf = &pipe->bufs[tail & mask];
+ size_t seg = min_t(size_t, ret, buf->len);
+
+ buf->offset += seg;
+ buf->len -= seg;
+ ret -= seg;
+
+ if (!buf->len) {
+ pipe_buf_release(pipe, buf);
+ tail++;
+ }
+ }
+
+ if (tail != pipe->tail) {
+ pipe->tail = tail;
+ if (pipe->files)
+ need_wakeup = true;
+ }
+ }
+
+out:
+ pipe_unlock(pipe);
+ if (need_wakeup)
+ wakeup_pipe_writers(pipe);
+ return spliced ?: ret;
}
#endif


2023-03-31 16:19:47

by David Howells

[permalink] [raw]
Subject: [PATCH v3 39/55] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.

Signed-off-by: David Howells <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Xiubo Li <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ceph/messenger_v1.c | 58 ++++++++++++++---------------------------
1 file changed, 19 insertions(+), 39 deletions(-)

diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index d664cb1593a7..b2d801a49122 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -74,37 +74,6 @@ static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
return r;
}

-/*
- * @more: either or both of MSG_MORE and MSG_SENDPAGE_NOTLAST
- */
-static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int more)
-{
- ssize_t (*sendpage)(struct socket *sock, struct page *page,
- int offset, size_t size, int flags);
- int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
- int ret;
-
- /*
- * sendpage cannot properly handle pages with page_count == 0,
- * we need to fall back to sendmsg if that's the case.
- *
- * Same goes for slab pages: skb_can_coalesce() allows
- * coalescing neighboring slab objects into a single frag which
- * triggers one of hardened usercopy checks.
- */
- if (sendpage_ok(page))
- sendpage = sock->ops->sendpage;
- else
- sendpage = sock_no_sendpage;
-
- ret = sendpage(sock, page, offset, size, flags);
- if (ret == -EAGAIN)
- ret = 0;
-
- return ret;
-}
-
static void con_out_kvec_reset(struct ceph_connection *con)
{
BUG_ON(con->v1.out_skip);
@@ -464,7 +433,6 @@ static int write_partial_message_data(struct ceph_connection *con)
struct ceph_msg *msg = con->out_msg;
struct ceph_msg_data_cursor *cursor = &msg->cursor;
bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
- int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
u32 crc;

dout("%s %p msg %p\n", __func__, con, msg);
@@ -482,6 +450,10 @@ static int write_partial_message_data(struct ceph_connection *con)
*/
crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0;
while (cursor->total_resid) {
+ struct bio_vec bvec;
+ struct msghdr msghdr = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST,
+ };
struct page *page;
size_t page_offset;
size_t length;
@@ -494,9 +466,12 @@ static int write_partial_message_data(struct ceph_connection *con)

page = ceph_msg_data_next(cursor, &page_offset, &length);
if (length == cursor->total_resid)
- more = MSG_MORE;
- ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
- more);
+ msghdr.msg_flags |= MSG_MORE;
+
+ bvec_set_page(&bvec, page, length, page_offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, length);
+
+ ret = sock_sendmsg(con->sock, &msghdr);
if (ret <= 0) {
if (do_datacrc)
msg->footer.data_crc = cpu_to_le32(crc);
@@ -526,7 +501,10 @@ static int write_partial_message_data(struct ceph_connection *con)
*/
static int write_partial_skip(struct ceph_connection *con)
{
- int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ struct bio_vec bvec;
+ struct msghdr msghdr = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST | MSG_MORE,
+ };
int ret;

dout("%s %p %d left\n", __func__, con, con->v1.out_skip);
@@ -534,9 +512,11 @@ static int write_partial_skip(struct ceph_connection *con)
size_t size = min(con->v1.out_skip, (int)PAGE_SIZE);

if (size == con->v1.out_skip)
- more = MSG_MORE;
- ret = ceph_tcp_sendpage(con->sock, ceph_zero_page, 0, size,
- more);
+ msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
+ bvec_set_page(&bvec, ZERO_PAGE(0), size, 0);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+ ret = sock_sendmsg(con->sock, &msghdr);
if (ret <= 0)
goto out;
con->v1.out_skip -= ret;

2023-03-31 16:20:01

by David Howells

[permalink] [raw]
Subject: [PATCH v3 44/55] algif: Remove hash_sendpage*()

Remove hash_sendpage*()..

Signed-off-by: David Howells <[email protected]>
cc: Herbert Xu <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
crypto/algif_hash.c | 66 ---------------------------------------------
1 file changed, 66 deletions(-)

diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index b89c2c50cecc..dc6c45637b2d 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -162,58 +162,6 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
goto unlock;
}

-static ssize_t hash_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- struct alg_sock *ask = alg_sk(sk);
- struct hash_ctx *ctx = ask->private;
- int err;
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
-
- lock_sock(sk);
- sg_init_table(ctx->sgl.sgl, 1);
- sg_set_page(ctx->sgl.sgl, page, size, offset);
-
- if (!(flags & MSG_MORE)) {
- err = hash_alloc_result(sk, ctx);
- if (err)
- goto unlock;
- } else if (!ctx->more)
- hash_free_result(sk, ctx);
-
- ahash_request_set_crypt(&ctx->req, ctx->sgl.sgl, ctx->result, size);
-
- if (!(flags & MSG_MORE)) {
- if (ctx->more)
- err = crypto_ahash_finup(&ctx->req);
- else
- err = crypto_ahash_digest(&ctx->req);
- } else {
- if (!ctx->more) {
- err = crypto_ahash_init(&ctx->req);
- err = crypto_wait_req(err, &ctx->wait);
- if (err)
- goto unlock;
- }
-
- err = crypto_ahash_update(&ctx->req);
- }
-
- err = crypto_wait_req(err, &ctx->wait);
- if (err)
- goto unlock;
-
- ctx->more = flags & MSG_MORE;
-
-unlock:
- release_sock(sk);
-
- return err ?: size;
-}
-
static int hash_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
int flags)
{
@@ -318,7 +266,6 @@ static struct proto_ops algif_hash_ops = {

.release = af_alg_release,
.sendmsg = hash_sendmsg,
- .sendpage = hash_sendpage,
.recvmsg = hash_recvmsg,
.accept = hash_accept,
};
@@ -370,18 +317,6 @@ static int hash_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return hash_sendmsg(sock, msg, size);
}

-static ssize_t hash_sendpage_nokey(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
-
- err = hash_check_key(sock);
- if (err)
- return err;
-
- return hash_sendpage(sock, page, offset, size, flags);
-}
-
static int hash_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
@@ -420,7 +355,6 @@ static struct proto_ops algif_hash_ops_nokey = {

.release = af_alg_release,
.sendmsg = hash_sendmsg_nokey,
- .sendpage = hash_sendpage_nokey,
.recvmsg = hash_recvmsg_nokey,
.accept = hash_accept_nokey,
};

2023-03-31 16:20:04

by David Howells

[permalink] [raw]
Subject: [PATCH v3 46/55] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header and data
pages.

To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator. The header are copied into memory acquired from
zcopy_alloc() which just breaks a page up into small pieces that can be
freed with put_page().

Signed-off-by: David Howells <[email protected]>
cc: Santosh Shilimkar <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
net/rds/tcp_send.c | 86 +++++++++++++++++++++-------------------------
1 file changed, 40 insertions(+), 46 deletions(-)

diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 8c4d1d6e9249..660d9f203d99 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -52,29 +52,24 @@ void rds_tcp_xmit_path_complete(struct rds_conn_path *cp)
tcp_sock_set_cork(tc->t_sock->sk, false);
}

-/* the core send_sem serializes this with other xmit and shutdown */
-static int rds_tcp_sendmsg(struct socket *sock, void *data, unsigned int len)
-{
- struct kvec vec = {
- .iov_base = data,
- .iov_len = len,
- };
- struct msghdr msg = {
- .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL,
- };
-
- return kernel_sendmsg(sock, &msg, &vec, 1, vec.iov_len);
-}
-
/* the core send_sem serializes this with other xmit and shutdown */
int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
unsigned int hdr_off, unsigned int sg, unsigned int off)
{
struct rds_conn_path *cp = rm->m_inc.i_conn_path;
struct rds_tcp_connection *tc = cp->cp_transport_data;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+ };
+ struct bio_vec *bvec;
+ unsigned int i, size = 0, ix = 0;
+ bool free_hdr = false;
int done = 0;
- int ret = 0;
- int more;
+ int ret = -ENOMEM;
+
+ bvec = kmalloc_array(1 + sg, sizeof(struct bio_vec), GFP_KERNEL);
+ if (!bvec)
+ goto out;

if (hdr_off == 0) {
/*
@@ -99,43 +94,37 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,

if (hdr_off < sizeof(struct rds_header)) {
/* see rds_tcp_write_space() */
+ void *p;
+
set_bit(SOCK_NOSPACE, &tc->t_sock->sk->sk_socket->flags);

- ret = rds_tcp_sendmsg(tc->t_sock,
- (void *)&rm->m_inc.i_hdr + hdr_off,
- sizeof(rm->m_inc.i_hdr) - hdr_off);
- if (ret < 0)
- goto out;
- done += ret;
- if (hdr_off + done != sizeof(struct rds_header))
+ ret = -ENOMEM;
+ p = page_frag_memdup(NULL,
+ (void *)&rm->m_inc.i_hdr + hdr_off,
+ sizeof(rm->m_inc.i_hdr) - hdr_off,
+ GFP_KERNEL, ULONG_MAX);
+ if (!p)
goto out;
+ bvec_set_virt(&bvec[ix], p, sizeof(rm->m_inc.i_hdr) - hdr_off);
+ free_hdr = true;
+ size += bvec[ix].bv_len;
+ ix++;
}

- more = rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0;
- while (sg < rm->data.op_nents) {
- int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
-
- ret = tc->t_sock->ops->sendpage(tc->t_sock,
- sg_page(&rm->data.op_sg[sg]),
- rm->data.op_sg[sg].offset + off,
- rm->data.op_sg[sg].length - off,
- flags);
- rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void *)sg_page(&rm->data.op_sg[sg]),
- rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off,
- ret);
- if (ret <= 0)
- break;
-
- off += ret;
- done += ret;
- if (off == rm->data.op_sg[sg].length) {
- off = 0;
- sg++;
- }
- if (sg == rm->data.op_nents - 1)
- more = 0;
+ for (i = sg; i < rm->data.op_nents; i++) {
+ bvec_set_page(&bvec[ix],
+ sg_page(&rm->data.op_sg[i]),
+ rm->data.op_sg[i].length - off,
+ rm->data.op_sg[i].offset + off);
+ off = 0;
+ size += bvec[ix].bv_len;
+ ix++;
}

+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, ix, size);
+ ret = sock_sendmsg(tc->t_sock, &msg);
+ rdsdebug("tcp sendmsg-splice %u,%u ret %d\n", ix, size, ret);
+
out:
if (ret <= 0) {
/* write_space will hit after EAGAIN, all else fatal */
@@ -158,6 +147,11 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
}
if (done == 0)
done = ret;
+ if (bvec) {
+ if (free_hdr)
+ put_page(bvec[0].bv_page);
+ kfree(bvec);
+ }
return done;
}


2023-03-31 16:20:08

by David Howells

[permalink] [raw]
Subject: [PATCH v3 48/55] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

When transmitting data, call down into TCP using sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing sendpage calls to transmit header, data pages and trailer.

Signed-off-by: David Howells <[email protected]>
Acked-by: Chuck Lever <[email protected]>
cc: Trond Myklebust <[email protected]>
cc: Anna Schumaker <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
include/linux/sunrpc/svc.h | 11 +++++------
net/sunrpc/svcsock.c | 38 ++++++++++++--------------------------
2 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 877891536c2f..456ae554aa11 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -161,16 +161,15 @@ static inline bool svc_put_not_last(struct svc_serv *serv)
extern u32 svc_max_payload(const struct svc_rqst *rqstp);

/*
- * RPC Requsts and replies are stored in one or more pages.
+ * RPC Requests and replies are stored in one or more pages.
* We maintain an array of pages for each server thread.
* Requests are copied into these pages as they arrive. Remaining
* pages are available to write the reply into.
*
- * Pages are sent using ->sendpage so each server thread needs to
- * allocate more to replace those used in sending. To help keep track
- * of these pages we have a receive list where all pages initialy live,
- * and a send list where pages are moved to when there are to be part
- * of a reply.
+ * Pages are sent using ->sendmsg with MSG_SPLICE_PAGES so each server thread
+ * needs to allocate more to replace those used in sending. To help keep track
+ * of these pages we have a receive list where all pages initialy live, and a
+ * send list where pages are moved to when there are to be part of a reply.
*
* We use xdr_buf for holding responses as it fits well with NFS
* read responses (that have a header, and some data pages, and possibly
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 03a4f5615086..3a015abac5bd 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1063,13 +1063,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
int flags)
{
- return kernel_sendpage(sock, virt_to_page(vec->iov_base),
- offset_in_page(vec->iov_base),
- vec->iov_len, flags);
+ struct msghdr msg = { .msg_flags = MSG_SPLICE_PAGES | flags, };
+
+ iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, vec, 1, vec->iov_len);
+ return sock_sendmsg(sock, &msg);
}

/*
- * kernel_sendpage() is used exclusively to reduce the number of
+ * MSG_SPLICE_PAGES is used exclusively to reduce the number of
* copy operations in this path. Therefore the caller must ensure
* that the pages backing @xdr are unchanging.
*
@@ -1109,28 +1110,13 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
if (ret != head->iov_len)
goto out;

- if (xdr->page_len) {
- unsigned int offset, len, remaining;
- struct bio_vec *bvec;
-
- bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
- offset = offset_in_page(xdr->page_base);
- remaining = xdr->page_len;
- while (remaining > 0) {
- len = min(remaining, bvec->bv_len - offset);
- ret = kernel_sendpage(sock, bvec->bv_page,
- bvec->bv_offset + offset,
- len, 0);
- if (ret < 0)
- return ret;
- *sentp += ret;
- if (ret != len)
- goto out;
- remaining -= len;
- offset = 0;
- bvec++;
- }
- }
+ msg.msg_flags = MSG_SPLICE_PAGES;
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec,
+ xdr_buf_pagecount(xdr), xdr->page_len);
+ ret = sock_sendmsg(sock, &msg);
+ if (ret < 0)
+ return ret;
+ *sentp += ret;

if (tail->iov_len) {
ret = svc_tcp_send_kvec(sock, tail, 0);

2023-03-31 16:21:03

by David Howells

[permalink] [raw]
Subject: [PATCH v3 47/55] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage. This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Christine Caulfield <[email protected]>
cc: David Teigland <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
fs/dlm/lowcomms.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index a9b14f81d655..9c0c691b6106 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1394,8 +1394,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg)
/* Send a message */
static int send_to_sock(struct connection *con)
{
- const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
struct writequeue_entry *e;
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+ };
int len, offset, ret;

spin_lock_bh(&con->writequeue_lock);
@@ -1411,8 +1414,9 @@ static int send_to_sock(struct connection *con)
WARN_ON_ONCE(len == 0 && e->users == 0);
spin_unlock_bh(&con->writequeue_lock);

- ret = kernel_sendpage(con->sock, e->page, offset, len,
- msg_flags);
+ bvec_set_page(&bvec, e->page, len, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(con->sock, &msg);
trace_dlm_send(con->nodeid, ret);
if (ret == -EAGAIN || ret == 0) {
lock_sock(con->sock->sk);

2023-03-31 16:21:36

by David Howells

[permalink] [raw]
Subject: [PATCH v3 50/55] kcm: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.

Signed-off-by: David Howells <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Tom Herbert <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/kcm/kcmsock.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d77d28fbf389..9c9d379aafb1 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -641,6 +641,10 @@ static int kcm_write_msgs(struct kcm_sock *kcm)

for (fragidx = 0; fragidx < skb_shinfo(skb)->nr_frags;
fragidx++) {
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
+ };
skb_frag_t *frag;

frag_offset = 0;
@@ -651,11 +655,12 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
goto out;
}

- ret = kernel_sendpage(psock->sk->sk_socket,
- skb_frag_page(frag),
- skb_frag_off(frag) + frag_offset,
- skb_frag_size(frag) - frag_offset,
- MSG_DONTWAIT);
+ bvec_set_page(&bvec,
+ skb_frag_page(frag),
+ skb_frag_size(frag) - frag_offset,
+ skb_frag_off(frag) + frag_offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, bvec.bv_len);
+ ret = sock_sendmsg(psock->sk->sk_socket, &msg);
if (ret <= 0) {
if (ret == -EAGAIN) {
/* Save state to try again when there's

2023-03-31 16:29:35

by David Howells

[permalink] [raw]
Subject: [PATCH v3 52/55] ocfs2: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()

Fix ocfs2 to use the page fragment allocator rather than kzalloc in order
to allocate the buffers for the handshake message and keepalive request and
reply messages. Slab pages should not be given to sendpage, but fragments
can be.

Switch from using sendpage() to using sendmsg() + MSG_SPLICE_PAGES so that
sendpage can be phased out.

Signed-off-by: David Howells <[email protected]>

cc: Mark Fasheh <[email protected]>
cc: Joel Becker <[email protected]>
cc: Joseph Qi <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
---
fs/ocfs2/cluster/tcp.c | 107 ++++++++++++++++++++++-------------------
1 file changed, 58 insertions(+), 49 deletions(-)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index aecbd712a00c..e568ad2f34bf 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -110,9 +110,6 @@ static struct work_struct o2net_listen_work;
static struct o2hb_callback_func o2net_hb_up, o2net_hb_down;
#define O2NET_HB_PRI 0x1

-static struct o2net_handshake *o2net_hand;
-static struct o2net_msg *o2net_keep_req, *o2net_keep_resp;
-
static int o2net_sys_err_translations[O2NET_ERR_MAX] =
{[O2NET_ERR_NONE] = 0,
[O2NET_ERR_NO_HNDLR] = -ENOPROTOOPT,
@@ -930,19 +927,22 @@ static int o2net_send_tcp_msg(struct socket *sock, struct kvec *vec,
}

static void o2net_sendpage(struct o2net_sock_container *sc,
- void *kmalloced_virt,
- size_t size)
+ void *virt, size_t size)
{
struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);
+ struct msghdr msg = {};
+ struct bio_vec bv;
ssize_t ret;

+ bvec_set_virt(&bv, virt, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, size);
+
while (1) {
+ msg.msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES;
mutex_lock(&sc->sc_send_lock);
- ret = sc->sc_sock->ops->sendpage(sc->sc_sock,
- virt_to_page(kmalloced_virt),
- offset_in_page(kmalloced_virt),
- size, MSG_DONTWAIT);
+ ret = sock_sendmsg(sc->sc_sock, &msg);
mutex_unlock(&sc->sc_send_lock);
+
if (ret == size)
break;
if (ret == (ssize_t)-EAGAIN) {
@@ -1168,6 +1168,7 @@ static int o2net_process_message(struct o2net_sock_container *sc,
struct o2net_msg *hdr)
{
struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);
+ struct o2net_msg *keep_resp;
int ret = 0, handler_status;
enum o2net_system_error syserr;
struct o2net_msg_handler *nmh = NULL;
@@ -1186,8 +1187,16 @@ static int o2net_process_message(struct o2net_sock_container *sc,
be32_to_cpu(hdr->status));
goto out;
case O2NET_MSG_KEEP_REQ_MAGIC:
- o2net_sendpage(sc, o2net_keep_resp,
- sizeof(*o2net_keep_resp));
+ keep_resp = page_frag_alloc(NULL, sizeof(*keep_resp),
+ GFP_KERNEL);
+ if (!keep_resp) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ memset(keep_resp, 0, sizeof(*keep_resp));
+ keep_resp->magic = cpu_to_be16(O2NET_MSG_KEEP_RESP_MAGIC);
+ o2net_sendpage(sc, keep_resp, sizeof(*keep_resp));
+ folio_put(virt_to_folio(keep_resp));
goto out;
case O2NET_MSG_KEEP_RESP_MAGIC:
goto out;
@@ -1439,15 +1448,22 @@ static void o2net_rx_until_empty(struct work_struct *work)
sc_put(sc);
}

-static void o2net_initialize_handshake(void)
+static struct o2net_handshake *o2net_initialize_handshake(void)
{
- o2net_hand->o2hb_heartbeat_timeout_ms = cpu_to_be32(
- O2HB_MAX_WRITE_TIMEOUT_MS);
- o2net_hand->o2net_idle_timeout_ms = cpu_to_be32(o2net_idle_timeout());
- o2net_hand->o2net_keepalive_delay_ms = cpu_to_be32(
- o2net_keepalive_delay());
- o2net_hand->o2net_reconnect_delay_ms = cpu_to_be32(
- o2net_reconnect_delay());
+ struct o2net_handshake *hand;
+
+ hand = page_frag_alloc(NULL, sizeof(*hand), GFP_KERNEL);
+ if (!hand)
+ return NULL;
+
+ memset(hand, 0, sizeof(*hand));
+ hand->protocol_version = cpu_to_be64(O2NET_PROTOCOL_VERSION);
+ hand->connector_id = cpu_to_be64(1);
+ hand->o2hb_heartbeat_timeout_ms = cpu_to_be32(O2HB_MAX_WRITE_TIMEOUT_MS);
+ hand->o2net_idle_timeout_ms = cpu_to_be32(o2net_idle_timeout());
+ hand->o2net_keepalive_delay_ms = cpu_to_be32(o2net_keepalive_delay());
+ hand->o2net_reconnect_delay_ms = cpu_to_be32(o2net_reconnect_delay());
+ return hand;
}

/* ------------------------------------------------------------ */
@@ -1456,16 +1472,22 @@ static void o2net_initialize_handshake(void)
* rx path will see the response and mark the sc valid */
static void o2net_sc_connect_completed(struct work_struct *work)
{
+ struct o2net_handshake *hand;
struct o2net_sock_container *sc =
container_of(work, struct o2net_sock_container,
sc_connect_work);

+ hand = o2net_initialize_handshake();
+ if (!hand)
+ goto out;
+
mlog(ML_MSG, "sc sending handshake with ver %llu id %llx\n",
(unsigned long long)O2NET_PROTOCOL_VERSION,
- (unsigned long long)be64_to_cpu(o2net_hand->connector_id));
+ (unsigned long long)be64_to_cpu(hand->connector_id));

- o2net_initialize_handshake();
- o2net_sendpage(sc, o2net_hand, sizeof(*o2net_hand));
+ o2net_sendpage(sc, hand, sizeof(*hand));
+ folio_put(virt_to_folio(hand));
+out:
sc_put(sc);
}

@@ -1475,8 +1497,15 @@ static void o2net_sc_send_keep_req(struct work_struct *work)
struct o2net_sock_container *sc =
container_of(work, struct o2net_sock_container,
sc_keepalive_work.work);
+ struct o2net_msg *keep_req;

- o2net_sendpage(sc, o2net_keep_req, sizeof(*o2net_keep_req));
+ keep_req = page_frag_alloc(NULL, sizeof(*keep_req), GFP_KERNEL);
+ if (keep_req) {
+ memset(keep_req, 0, sizeof(*keep_req));
+ keep_req->magic = cpu_to_be16(O2NET_MSG_KEEP_REQ_MAGIC);
+ o2net_sendpage(sc, keep_req, sizeof(*keep_req));
+ folio_put(virt_to_folio(keep_req));
+ }
sc_put(sc);
}

@@ -1780,6 +1809,7 @@ static int o2net_accept_one(struct socket *sock, int *more)
struct socket *new_sock = NULL;
struct o2nm_node *node = NULL;
struct o2nm_node *local_node = NULL;
+ struct o2net_handshake *hand;
struct o2net_sock_container *sc = NULL;
struct o2net_node *nn;
unsigned int nofs_flag;
@@ -1882,8 +1912,11 @@ static int o2net_accept_one(struct socket *sock, int *more)
o2net_register_callbacks(sc->sc_sock->sk, sc);
o2net_sc_queue_work(sc, &sc->sc_rx_work);

- o2net_initialize_handshake();
- o2net_sendpage(sc, o2net_hand, sizeof(*o2net_hand));
+ hand = o2net_initialize_handshake();
+ if (hand) {
+ o2net_sendpage(sc, hand, sizeof(*hand));
+ folio_put(virt_to_folio(hand));
+ }

out:
if (new_sock)
@@ -2090,21 +2123,8 @@ int o2net_init(void)
unsigned long i;

o2quo_init();
-
o2net_debugfs_init();

- o2net_hand = kzalloc(sizeof(struct o2net_handshake), GFP_KERNEL);
- o2net_keep_req = kzalloc(sizeof(struct o2net_msg), GFP_KERNEL);
- o2net_keep_resp = kzalloc(sizeof(struct o2net_msg), GFP_KERNEL);
- if (!o2net_hand || !o2net_keep_req || !o2net_keep_resp)
- goto out;
-
- o2net_hand->protocol_version = cpu_to_be64(O2NET_PROTOCOL_VERSION);
- o2net_hand->connector_id = cpu_to_be64(1);
-
- o2net_keep_req->magic = cpu_to_be16(O2NET_MSG_KEEP_REQ_MAGIC);
- o2net_keep_resp->magic = cpu_to_be16(O2NET_MSG_KEEP_RESP_MAGIC);
-
for (i = 0; i < ARRAY_SIZE(o2net_nodes); i++) {
struct o2net_node *nn = o2net_nn_from_num(i);

@@ -2122,21 +2142,10 @@ int o2net_init(void)
}

return 0;
-
-out:
- kfree(o2net_hand);
- kfree(o2net_keep_req);
- kfree(o2net_keep_resp);
- o2net_debugfs_exit();
- o2quo_exit();
- return -ENOMEM;
}

void o2net_exit(void)
{
o2quo_exit();
- kfree(o2net_hand);
- kfree(o2net_keep_req);
- kfree(o2net_keep_resp);
o2net_debugfs_exit();
}

2023-03-31 16:34:29

by David Howells

[permalink] [raw]
Subject: [PATCH v3 54/55] drdb: Send an entire bio in a single sendmsg

Since _drdb_sendpage() is now using sendmsg to send the pages rather
sendpage, pass the entire bio in one go using a bvec iterator instead of
doing it piecemeal.

Signed-off-by: David Howells <[email protected]>
cc: Philipp Reisner <[email protected]>
cc: Lars Ellenberg <[email protected]>
cc: "Christoph Böhmwalder" <[email protected]>
cc: Jens Axboe <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
drivers/block/drbd/drbd_main.c | 77 +++++++++++-----------------------
1 file changed, 25 insertions(+), 52 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index e5f90abd29b6..ab63d6138407 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1512,28 +1512,15 @@ static void drbd_update_congested(struct drbd_connection *connection)
* As a workaround, we disable sendpage on pages
* with page_count == 0 or PageSlab.
*/
-static int _drbd_no_send_page(struct drbd_peer_device *peer_device, struct page *page,
- int offset, size_t size, unsigned msg_flags)
-{
- struct socket *socket;
- void *addr;
- int err;
-
- socket = peer_device->connection->data.socket;
- addr = kmap(page) + offset;
- err = drbd_send_all(peer_device->connection, socket, addr, size, msg_flags);
- kunmap(page);
- if (!err)
- peer_device->device->send_cnt += size >> 9;
- return err;
-}
-
-static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *page,
- int offset, size_t size, unsigned msg_flags)
+static int _drbd_send_pages(struct drbd_peer_device *peer_device,
+ struct iov_iter *iter, unsigned msg_flags)
{
struct socket *socket = peer_device->connection->data.socket;
- struct bio_vec bvec;
- struct msghdr msg = { .msg_flags = msg_flags, };
+ struct msghdr msg = {
+ .msg_flags = msg_flags | MSG_NOSIGNAL,
+ .msg_iter = *iter,
+ };
+ size_t size = iov_iter_count(iter);
int err = -EIO;

/* e.g. XFS meta- & log-data is in slab pages, which have a
@@ -1542,11 +1529,8 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
- if (!drbd_disable_sendpage && sendpage_ok(page))
- msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;
-
- bvec_set_page(&bvec, page, offset, size);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ if (drbd_disable_sendpage)
+ msg.msg_flags &= ~(MSG_NOSIGNAL | MSG_SPLICE_PAGES);

drbd_update_congested(peer_device->connection);
do {
@@ -1577,39 +1561,22 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa

static int _drbd_send_bio(struct drbd_peer_device *peer_device, struct bio *bio)
{
- struct bio_vec bvec;
- struct bvec_iter iter;
+ struct iov_iter iter;

- /* hint all but last page with MSG_MORE */
- bio_for_each_segment(bvec, bio, iter) {
- int err;
+ iov_iter_bvec(&iter, ITER_SOURCE, bio->bi_io_vec, bio->bi_vcnt,
+ bio->bi_iter.bi_size);

- err = _drbd_no_send_page(peer_device, bvec.bv_page,
- bvec.bv_offset, bvec.bv_len,
- bio_iter_last(bvec, iter)
- ? 0 : MSG_MORE);
- if (err)
- return err;
- }
- return 0;
+ return _drbd_send_pages(peer_device, &iter, 0);
}

static int _drbd_send_zc_bio(struct drbd_peer_device *peer_device, struct bio *bio)
{
- struct bio_vec bvec;
- struct bvec_iter iter;
+ struct iov_iter iter;

- /* hint all but last page with MSG_MORE */
- bio_for_each_segment(bvec, bio, iter) {
- int err;
+ iov_iter_bvec(&iter, ITER_SOURCE, bio->bi_io_vec, bio->bi_vcnt,
+ bio->bi_iter.bi_size);

- err = _drbd_send_page(peer_device, bvec.bv_page,
- bvec.bv_offset, bvec.bv_len,
- bio_iter_last(bvec, iter) ? 0 : MSG_MORE);
- if (err)
- return err;
- }
- return 0;
+ return _drbd_send_pages(peer_device, &iter, MSG_SPLICE_PAGES);
}

static int _drbd_send_zc_ee(struct drbd_peer_device *peer_device,
@@ -1621,10 +1588,16 @@ static int _drbd_send_zc_ee(struct drbd_peer_device *peer_device,

/* hint all but last page with MSG_MORE */
page_chain_for_each(page) {
+ struct iov_iter iter;
+ struct bio_vec bvec;
unsigned l = min_t(unsigned, len, PAGE_SIZE);

- err = _drbd_send_page(peer_device, page, 0, l,
- page_chain_next(page) ? MSG_MORE : 0);
+ bvec_set_page(&bvec, page, 0, l);
+ iov_iter_bvec(&iter, ITER_SOURCE, &bvec, 1, l);
+
+ err = _drbd_send_pages(peer_device, &iter,
+ MSG_SPLICE_PAGES |
+ (page_chain_next(page) ? MSG_MORE : 0));
if (err)
return err;
len -= l;

2023-03-31 16:35:21

by David Howells

[permalink] [raw]
Subject: [PATCH v3 41/55] iscsi: Assume "sendpage" is okay in iscsi_tcp_segment_map()

As iscsi is now using sendmsg() with MSG_SPLICE_PAGES rather than sendpage,
assume that sendpage_ok() will return true in iscsi_tcp_segment_map() and
leave it to TCP to copy the data if not.

Signed-off-by: David Howells <[email protected]>
cc: "Martin K. Petersen" <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
drivers/scsi/libiscsi_tcp.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/libiscsi_tcp.c b/drivers/scsi/libiscsi_tcp.c
index c182aa83f2c9..07ba0d864820 100644
--- a/drivers/scsi/libiscsi_tcp.c
+++ b/drivers/scsi/libiscsi_tcp.c
@@ -128,18 +128,11 @@ static void iscsi_tcp_segment_map(struct iscsi_segment *segment, int recv)
* coalescing neighboring slab objects into a single frag which
* triggers one of hardened usercopy checks.
*/
- if (!recv && sendpage_ok(sg_page(sg)))
+ if (!recv)
return;

- if (recv) {
- segment->atomic_mapped = true;
- segment->sg_mapped = kmap_atomic(sg_page(sg));
- } else {
- segment->atomic_mapped = false;
- /* the xmit path can sleep with the page mapped so use kmap */
- segment->sg_mapped = kmap(sg_page(sg));
- }
-
+ segment->atomic_mapped = true;
+ segment->sg_mapped = kmap_atomic(sg_page(sg));
segment->data = segment->sg_mapped + sg->offset + segment->sg_offset;
}


2023-03-31 16:37:26

by David Howells

[permalink] [raw]
Subject: [PATCH v3 49/55] nvme: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.

Signed-off-by: David Howells <[email protected]>
cc: Keith Busch <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Sagi Grimberg <[email protected]>
cc: Chaitanya Kulkarni <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
drivers/nvme/host/tcp.c | 44 ++++++++++++++++++------------------
drivers/nvme/target/tcp.c | 47 +++++++++++++++++++++++++--------------
2 files changed, 52 insertions(+), 39 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index fa32969b532f..cc617692702d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -979,25 +979,23 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
u32 h2cdata_left = req->h2cdata_left;

while (true) {
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
struct page *page = nvme_tcp_req_cur_page(req);
size_t offset = nvme_tcp_req_cur_offset(req);
size_t len = nvme_tcp_req_cur_length(req);
bool last = nvme_tcp_pdu_last_send(req, len);
int req_data_sent = req->data_sent;
- int ret, flags = MSG_DONTWAIT;
+ int ret;

if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
- flags |= MSG_EOR;
+ msg.msg_flags |= MSG_EOR;
else
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;

- if (sendpage_ok(page)) {
- ret = kernel_sendpage(queue->sock, page, offset, len,
- flags);
- } else {
- ret = sock_no_sendpage(queue->sock, page, offset, len,
- flags);
- }
+ bvec_set_page(&bvec, page, len, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(queue->sock, &msg);
if (ret <= 0)
return ret;

@@ -1036,22 +1034,24 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
struct nvme_tcp_cmd_pdu *pdu = req->pdu;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
bool inline_data = nvme_tcp_has_inline_data(req);
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) + hdgst - req->offset;
- int flags = MSG_DONTWAIT;
int ret;

if (inline_data || nvme_tcp_queue_more(queue))
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
else
- flags |= MSG_EOR;
+ msg.msg_flags |= MSG_EOR;

if (queue->hdr_digest && !req->offset)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

- ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
- offset_in_page(pdu) + req->offset, len, flags);
+ bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;

@@ -1075,6 +1075,8 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
struct nvme_tcp_data_pdu *pdu = req->pdu;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_MORE, };
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) - req->offset + hdgst;
int ret;
@@ -1083,13 +1085,11 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));

if (!req->h2cdata_left)
- ret = kernel_sendpage(queue->sock, virt_to_page(pdu),
- offset_in_page(pdu) + req->offset, len,
- MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
- else
- ret = sock_no_sendpage(queue->sock, virt_to_page(pdu),
- offset_in_page(pdu) + req->offset, len,
- MSG_DONTWAIT | MSG_MORE);
+ msg.msg_flags |= MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST;
+
+ bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index d6cc557cc539..00b491abf50f 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -548,13 +548,18 @@ static void nvmet_tcp_execute_request(struct nvmet_tcp_cmd *cmd)

static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
{
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = (MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST |
+ MSG_SPLICE_PAGES),
+ };
u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
int left = sizeof(*cmd->data_pdu) - cmd->offset + hdgst;
int ret;

- ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->data_pdu),
- offset_in_page(cmd->data_pdu) + cmd->offset,
- left, MSG_DONTWAIT | MSG_MORE | MSG_SENDPAGE_NOTLAST);
+ bvec_set_virt(&bvec, (void *)cmd->data_pdu + cmd->offset, left);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, left);
+ ret = sock_sendmsg(cmd->queue->sock, &msg);
if (ret <= 0)
return ret;

@@ -575,17 +580,21 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
int ret;

while (cmd->cur_sg) {
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
+ };
struct page *page = sg_page(cmd->cur_sg);
u32 left = cmd->cur_sg->length - cmd->offset;
- int flags = MSG_DONTWAIT;

if ((!last_in_batch && cmd->queue->send_list_len) ||
cmd->wbytes_done + left < cmd->req.transfer_len ||
queue->data_digest || !queue->nvme_sq.sqhd_disabled)
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;

- ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
- left, flags);
+ bvec_set_page(&bvec, page, left, cmd->offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, left);
+ ret = sock_sendmsg(cmd->queue->sock, &msg);
if (ret <= 0)
return ret;

@@ -621,18 +630,20 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd,
bool last_in_batch)
{
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
int left = sizeof(*cmd->rsp_pdu) - cmd->offset + hdgst;
- int flags = MSG_DONTWAIT;
int ret;

if (!last_in_batch && cmd->queue->send_list_len)
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
else
- flags |= MSG_EOR;
+ msg.msg_flags |= MSG_EOR;

- ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->rsp_pdu),
- offset_in_page(cmd->rsp_pdu) + cmd->offset, left, flags);
+ bvec_set_virt(&bvec, (void *)cmd->rsp_pdu + cmd->offset, left);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, left);
+ ret = sock_sendmsg(cmd->queue->sock, &msg);
if (ret <= 0)
return ret;
cmd->offset += ret;
@@ -649,18 +660,20 @@ static int nvmet_try_send_response(struct nvmet_tcp_cmd *cmd,

static int nvmet_try_send_r2t(struct nvmet_tcp_cmd *cmd, bool last_in_batch)
{
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, };
u8 hdgst = nvmet_tcp_hdgst_len(cmd->queue);
int left = sizeof(*cmd->r2t_pdu) - cmd->offset + hdgst;
- int flags = MSG_DONTWAIT;
int ret;

if (!last_in_batch && cmd->queue->send_list_len)
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
else
- flags |= MSG_EOR;
+ msg.msg_flags |= MSG_EOR;

- ret = kernel_sendpage(cmd->queue->sock, virt_to_page(cmd->r2t_pdu),
- offset_in_page(cmd->r2t_pdu) + cmd->offset, left, flags);
+ bvec_set_virt(&bvec, (void *)cmd->r2t_pdu + cmd->offset, left);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, left);
+ ret = sock_sendmsg(cmd->queue->sock, &msg);
if (ret <= 0)
return ret;
cmd->offset += ret;

2023-03-31 16:38:44

by David Howells

[permalink] [raw]
Subject: [PATCH v3 42/55] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES)

Translate tcp_bpf_sendpage() calls to tcp_bpf_sendmsg(MSG_SPLICE_PAGES).

Signed-off-by: David Howells <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Sitnicki <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ipv4/tcp_bpf.c | 49 +++++++++-------------------------------------
1 file changed, 9 insertions(+), 40 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 7f17134637eb..de37a4372437 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -485,49 +485,18 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
- struct sk_msg tmp, *msg = NULL;
- int err = 0, copied = 0;
- struct sk_psock *psock;
- bool enospc = false;
-
- psock = sk_psock_get(sk);
- if (unlikely(!psock))
- return tcp_sendpage(sk, page, offset, size, flags);
+ struct bio_vec bvec;
+ struct msghdr msg = {
+ .msg_flags = flags | MSG_SPLICE_PAGES,
+ };

- lock_sock(sk);
- if (psock->cork) {
- msg = psock->cork;
- } else {
- msg = &tmp;
- sk_msg_init(msg);
- }
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);

- /* Catch case where ring is full and sendpage is stalled. */
- if (unlikely(sk_msg_full(msg)))
- goto out_err;
-
- sk_msg_page_add(msg, page, size, offset);
- sk_mem_charge(sk, size);
- copied = size;
- if (sk_msg_full(msg))
- enospc = true;
- if (psock->cork_bytes) {
- if (size > psock->cork_bytes)
- psock->cork_bytes = 0;
- else
- psock->cork_bytes -= size;
- if (psock->cork_bytes && !enospc)
- goto out_err;
- /* All cork bytes are accounted, rerun the prog. */
- psock->eval = __SK_NONE;
- psock->cork_bytes = 0;
- }
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- err = tcp_bpf_send_verdict(sk, psock, msg, &copied, flags);
-out_err:
- release_sock(sk);
- sk_psock_put(sk, psock);
- return copied ? copied : err;
+ return tcp_bpf_sendmsg(sk, &msg, size);
}

enum {

2023-03-31 16:38:50

by David Howells

[permalink] [raw]
Subject: [PATCH v3 37/55] Remove file->f_op->sendpage

Remove file->f_op->sendpage as splicing to a socket now calls sendmsg
rather than sendpage.

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/linux/fs.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3ccc243851e..a9f1b2543d2c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1773,7 +1773,6 @@ struct file_operations {
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
- ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);

2023-03-31 16:39:09

by David Howells

[permalink] [raw]
Subject: Trivial TLS server

Here's a trivial TLS server that can be used to test this.

David
---
/*
* TLS-over-TCP sink server
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <linux/tls.h>

#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)

static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));

static void set_tls(int sock)
{
struct tls12_crypto_info_aes_gcm_128 crypto_info;

crypto_info.info.version = TLS_1_2_VERSION;
crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128;
memset(crypto_info.iv, 0, TLS_CIPHER_AES_GCM_128_IV_SIZE);
memset(crypto_info.rec_seq, 0, TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
memset(crypto_info.key, 0, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
memset(crypto_info.salt, 0, TLS_CIPHER_AES_GCM_128_SALT_SIZE);

OSERROR(setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")),
"TCP_ULP");
OSERROR(setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)),
"TLS_TX");
OSERROR(setsockopt(sock, SOL_TLS, TLS_RX, &crypto_info, sizeof(crypto_info)),
"TLS_RX");
}

int main(int argc, char *argv[])
{
struct sockaddr_in sin = { .sin_family = AF_INET, .sin_port = htons(5556) };
int sfd, afd;

sfd = socket(AF_INET, SOCK_STREAM, 0);
OSERROR(sfd, "socket");
OSERROR(bind(sfd, (struct sockaddr *)&sin, sizeof(sin)), "bind");
OSERROR(listen(sfd, 1), "listen");

for (;;) {
afd = accept(sfd, NULL, NULL);
if (afd != -1) {
set_tls(afd);
while (read(afd, buffer, sizeof(buffer)) > 0) {}
close(afd);
}
}
}

2023-03-31 16:39:35

by David Howells

[permalink] [raw]
Subject: [PATCH v3 53/55] drbd: Use sendmsg(MSG_SPLICE_PAGES) rather than sendmsg()

Use sendmsg() conditionally with MSG_SPLICE_PAGES in _drbd_send_page()
rather than calling sendpage() or _drbd_no_send_page().

Signed-off-by: David Howells <[email protected]>
cc: Philipp Reisner <[email protected]>
cc: Lars Ellenberg <[email protected]>
cc: "Christoph Böhmwalder" <[email protected]>
cc: Jens Axboe <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
drivers/block/drbd/drbd_main.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 2c764f7ee4a7..e5f90abd29b6 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -1532,7 +1532,8 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
int offset, size_t size, unsigned msg_flags)
{
struct socket *socket = peer_device->connection->data.socket;
- int len = size;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = msg_flags, };
int err = -EIO;

/* e.g. XFS meta- & log-data is in slab pages, which have a
@@ -1541,33 +1542,33 @@ static int _drbd_send_page(struct drbd_peer_device *peer_device, struct page *pa
* put_page(); and would cause either a VM_BUG directly, or
* __page_cache_release a page that would actually still be referenced
* by someone, leading to some obscure delayed Oops somewhere else. */
- if (drbd_disable_sendpage || !sendpage_ok(page))
- return _drbd_no_send_page(peer_device, page, offset, size, msg_flags);
+ if (!drbd_disable_sendpage && sendpage_ok(page))
+ msg.msg_flags |= MSG_NOSIGNAL | MSG_SPLICE_PAGES;
+
+ bvec_set_page(&bvec, page, offset, size);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);

- msg_flags |= MSG_NOSIGNAL;
drbd_update_congested(peer_device->connection);
do {
int sent;

- sent = socket->ops->sendpage(socket, page, offset, len, msg_flags);
+ sent = sock_sendmsg(socket, &msg);
if (sent <= 0) {
if (sent == -EAGAIN) {
if (we_should_drop_the_connection(peer_device->connection, socket))
break;
continue;
}
- drbd_warn(peer_device->device, "%s: size=%d len=%d sent=%d\n",
- __func__, (int)size, len, sent);
+ drbd_warn(peer_device->device, "%s: size=%d len=%zu sent=%d\n",
+ __func__, (int)size, msg_data_left(&msg), sent);
if (sent < 0)
err = sent;
break;
}
- len -= sent;
- offset += sent;
- } while (len > 0 /* THINK && device->cstate >= C_CONNECTED*/);
+ } while (msg_data_left(&msg) /* THINK && device->cstate >= C_CONNECTED*/);
clear_bit(NET_CONGESTED, &peer_device->connection->flags);

- if (len == 0) {
+ if (!msg_data_left(&msg)) {
err = 0;
peer_device->device->send_cnt += size >> 9;
}

2023-03-31 16:39:54

by David Howells

[permalink] [raw]
Subject: [PATCH v3 45/55] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()

Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data. For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.

Signed-off-by: David Howells <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Xiubo Li <[email protected]>
cc: Jeff Layton <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/ceph/messenger_v2.c | 89 +++++++++--------------------------------
1 file changed, 18 insertions(+), 71 deletions(-)

diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index 301a991dc6a6..1637a0c21126 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -117,91 +117,38 @@ static int ceph_tcp_recv(struct ceph_connection *con)
return ret;
}

-static int do_sendmsg(struct socket *sock, struct iov_iter *it)
-{
- struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
- int ret;
-
- msg.msg_iter = *it;
- while (iov_iter_count(it)) {
- ret = sock_sendmsg(sock, &msg);
- if (ret <= 0) {
- if (ret == -EAGAIN)
- ret = 0;
- return ret;
- }
-
- iov_iter_advance(it, ret);
- }
-
- WARN_ON(msg_data_left(&msg));
- return 1;
-}
-
-static int do_try_sendpage(struct socket *sock, struct iov_iter *it)
-{
- struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
- struct bio_vec bv;
- int ret;
-
- if (WARN_ON(!iov_iter_is_bvec(it)))
- return -EINVAL;
-
- while (iov_iter_count(it)) {
- /* iov_iter_iovec() for ITER_BVEC */
- bvec_set_page(&bv, it->bvec->bv_page,
- min(iov_iter_count(it),
- it->bvec->bv_len - it->iov_offset),
- it->bvec->bv_offset + it->iov_offset);
-
- /*
- * sendpage cannot properly handle pages with
- * page_count == 0, we need to fall back to sendmsg if
- * that's the case.
- *
- * Same goes for slab pages: skb_can_coalesce() allows
- * coalescing neighboring slab objects into a single frag
- * which triggers one of hardened usercopy checks.
- */
- if (sendpage_ok(bv.bv_page)) {
- ret = sock->ops->sendpage(sock, bv.bv_page,
- bv.bv_offset, bv.bv_len,
- CEPH_MSG_FLAGS);
- } else {
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, bv.bv_len);
- ret = sock_sendmsg(sock, &msg);
- }
- if (ret <= 0) {
- if (ret == -EAGAIN)
- ret = 0;
- return ret;
- }
-
- iov_iter_advance(it, ret);
- }
-
- return 1;
-}
-
/*
* Write as much as possible. The socket is expected to be corked,
* so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
*
* Return:
- * 1 - done, nothing (else) to write
+ * >0 - done, nothing (else) to write
* 0 - socket is full, need to wait
* <0 - error
*/
static int ceph_tcp_send(struct ceph_connection *con)
{
+ struct msghdr msg = {
+ .msg_iter = con->v2.out_iter,
+ .msg_flags = CEPH_MSG_FLAGS,
+ };
int ret;

+ if (WARN_ON(!iov_iter_is_bvec(&con->v2.out_iter)))
+ return -EINVAL;
+
+ if (con->v2.out_iter_sendpage)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
+
dout("%s con %p have %zu try_sendpage %d\n", __func__, con,
iov_iter_count(&con->v2.out_iter), con->v2.out_iter_sendpage);
- if (con->v2.out_iter_sendpage)
- ret = do_try_sendpage(con->sock, &con->v2.out_iter);
- else
- ret = do_sendmsg(con->sock, &con->v2.out_iter);
+
+ ret = sock_sendmsg(con->sock, &msg);
+ if (ret > 0)
+ iov_iter_advance(&con->v2.out_iter, ret);
+ else if (ret == -EAGAIN)
+ ret = 0;
+
dout("%s con %p ret %d left %zu\n", __func__, con, ret,
iov_iter_count(&con->v2.out_iter));
return ret;

2023-03-31 16:40:46

by David Howells

[permalink] [raw]
Subject: Trivial TLS client

Here's a trivial TLS client program for testing this.

David
---
/*
* TLS-over-TCP send client
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <sys/stat.h>
#include <sys/sendfile.h>
#include <linux/tls.h>

#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)

static unsigned char buffer[4096] __attribute__((aligned(4096)));

static void set_tls(int sock)
{
struct tls12_crypto_info_aes_gcm_128 crypto_info;

crypto_info.info.version = TLS_1_2_VERSION;
crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128;
memset(crypto_info.iv, 0, TLS_CIPHER_AES_GCM_128_IV_SIZE);
memset(crypto_info.rec_seq, 0, TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
memset(crypto_info.key, 0, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
memset(crypto_info.salt, 0, TLS_CIPHER_AES_GCM_128_SALT_SIZE);

OSERROR(setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")),
"TCP_ULP");
OSERROR(setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)),
"TLS_TX");
OSERROR(setsockopt(sock, SOL_TLS, TLS_RX, &crypto_info, sizeof(crypto_info)),
"TLS_RX");
}

int main(int argc, char *argv[])
{
struct sockaddr_in sin = { .sin_family = AF_INET, .sin_port = htons(5556) };
struct hostent *h;
struct stat st;
ssize_t r, o;
int sf = 0;
int cfd, fd;

if (argc > 1 && strcmp(argv[1], "-s") == 0) {
sf = 1;
argc--;
argv++;
}

if (argc != 3) {
fprintf(stderr, "tcp-send [-s] <server> <file>\n");
exit(2);
}

h = gethostbyname(argv[1]);
if (!h) {
fprintf(stderr, "%s: %s\n", argv[1], hstrerror(h_errno));
exit(3);
}

if (!h->h_addr_list[0]) {
fprintf(stderr, "%s: No addresses\n", argv[1]);
exit(3);
}

memcpy(&sin.sin_addr, h->h_addr_list[0], h->h_length);

cfd = socket(AF_INET, SOCK_STREAM, 0);
OSERROR(cfd, "socket");
OSERROR(connect(cfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");
set_tls(cfd);

fd = open(argv[2], O_RDONLY);
OSERROR(fd, argv[2]);
OSERROR(fstat(fd, &st), argv[2]);

if (!sf) {
for (;;) {
r = read(fd, buffer, sizeof(buffer));
OSERROR(r, argv[2]);
if (r == 0)
break;

o = 0;
do {
ssize_t w = write(cfd, buffer + o, r - o);
OSERROR(w, "write");
o += w;
} while (o < r);
}
} else {
off_t off = 0;
r = sendfile(cfd, fd, &off, st.st_size);
OSERROR(r, "sendfile");
if (r != st.st_size) {
fprintf(stderr, "Short sendfile\n");
exit(1);
}
}

OSERROR(close(cfd), "close/c");
OSERROR(close(fd), "close/f");
return 0;
}

2023-03-31 16:42:43

by David Howells

[permalink] [raw]
Subject: [PATCH v3 40/55] iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage

Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage. This allows
multiple pages and multipage folios to be passed through.

TODO: iscsit_fe_sendpage_sg() should perhaps set up a bio_vec array for the
entire set of pages it's going to transfer plus two for the header and
trailer and page fragments to hold the header and trailer - and then call
sendmsg once for the entire message.

Signed-off-by: David Howells <[email protected]>
cc: "Martin K. Petersen" <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
drivers/scsi/iscsi_tcp.c | 31 ++++++++++++------------
drivers/scsi/iscsi_tcp.h | 2 +-
drivers/target/iscsi/iscsi_target_util.c | 14 ++++++-----
3 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index c76f82fb8b63..cf3eb55d2a76 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -301,35 +301,37 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn *tcp_conn,

while (!iscsi_tcp_segment_done(tcp_conn, segment, 0, r)) {
struct scatterlist *sg;
+ struct msghdr msg = {};
+ union {
+ struct kvec kv;
+ struct bio_vec bv;
+ } vec;
unsigned int offset, copy;
- int flags = 0;

r = 0;
offset = segment->copied;
copy = segment->size - offset;

if (segment->total_copied + segment->size < segment->total_size)
- flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
+ msg.msg_flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;

if (tcp_sw_conn->queue_recv)
- flags |= MSG_DONTWAIT;
+ msg.msg_flags |= MSG_DONTWAIT;

- /* Use sendpage if we can; else fall back to sendmsg */
if (!segment->data) {
+ if (tcp_conn->iscsi_conn->datadgst_en)
+ msg.msg_flags |= MSG_SPLICE_PAGES;
sg = segment->sg;
offset += segment->sg_offset + sg->offset;
- r = tcp_sw_conn->sendpage(sk, sg_page(sg), offset,
- copy, flags);
+ bvec_set_page(&vec.bv, sg_page(sg), copy, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &vec.bv, 1, copy);
} else {
- struct msghdr msg = { .msg_flags = flags };
- struct kvec iov = {
- .iov_base = segment->data + offset,
- .iov_len = copy
- };
-
- r = kernel_sendmsg(sk, &msg, &iov, 1, copy);
+ vec.kv.iov_base = segment->data + offset;
+ vec.kv.iov_len = copy;
+ iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &vec.kv, 1, copy);
}

+ r = sock_sendmsg(sk, &msg);
if (r < 0) {
iscsi_tcp_segment_unmap(segment);
return r;
@@ -746,7 +748,6 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
sock_no_linger(sk);

iscsi_sw_tcp_conn_set_callbacks(conn);
- tcp_sw_conn->sendpage = tcp_sw_conn->sock->ops->sendpage;
/*
* set receive state machine into initial state
*/
@@ -778,8 +779,6 @@ static int iscsi_sw_tcp_conn_set_param(struct iscsi_cls_conn *cls_conn,
mutex_unlock(&tcp_sw_conn->sock_lock);
return -ENOTCONN;
}
- tcp_sw_conn->sendpage = conn->datadgst_en ?
- sock_no_sendpage : tcp_sw_conn->sock->ops->sendpage;
mutex_unlock(&tcp_sw_conn->sock_lock);
break;
case ISCSI_PARAM_MAX_R2T:
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 68e14a344904..d6ec08d7eb63 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -48,7 +48,7 @@ struct iscsi_sw_tcp_conn {
uint32_t sendpage_failures_cnt;
uint32_t discontiguous_hdr_cnt;

- ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+ bool can_splice_to_tcp;
};

struct iscsi_sw_tcp_host {
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 26dc8ed3045b..c7d58e41ac3b 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1078,6 +1078,8 @@ int iscsit_fe_sendpage_sg(
struct iscsit_conn *conn)
{
struct scatterlist *sg = cmd->first_data_sg;
+ struct bio_vec bvec;
+ struct msghdr msghdr = { .msg_flags = MSG_SPLICE_PAGES, };
struct kvec iov;
u32 tx_hdr_size, data_len;
u32 offset = cmd->first_data_sg_off;
@@ -1121,17 +1123,17 @@ int iscsit_fe_sendpage_sg(
u32 space = (sg->length - offset);
u32 sub_len = min_t(u32, data_len, space);
send_pg:
- tx_sent = conn->sock->ops->sendpage(conn->sock,
- sg_page(sg), sg->offset + offset, sub_len, 0);
+ bvec_set_page(&bvec, sg_page(sg), sub_len, sg->offset + offset);
+ iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, sub_len);
+
+ tx_sent = conn->sock->ops->sendmsg(conn->sock, &msghdr, sub_len);
if (tx_sent != sub_len) {
if (tx_sent == -EAGAIN) {
- pr_err("tcp_sendpage() returned"
- " -EAGAIN\n");
+ pr_err("sendmsg/splice returned -EAGAIN\n");
goto send_pg;
}

- pr_err("tcp_sendpage() failure: %d\n",
- tx_sent);
+ pr_err("sendmsg/splice failure: %d\n", tx_sent);
return -1;
}


2023-03-31 16:43:18

by David Howells

[permalink] [raw]
Subject: [PATCH v3 55/55] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

Remove ->sendpage() and ->sendpage_locked(). sendmsg() with
MSG_SPLICE_PAGES should be used instead. This allows multiple pages and
multipage folios to be passed through.

Signed-off-by: David Howells <[email protected]>
Acked-by: Marc Kleine-Budde <[email protected]> # for net/can
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
Documentation/networking/scaling.rst | 4 +-
crypto/af_alg.c | 29 ----
crypto/algif_aead.c | 22 +--
crypto/algif_rng.c | 2 -
crypto/algif_skcipher.c | 14 --
.../chelsio/inline_crypto/chtls/chtls.h | 2 -
.../chelsio/inline_crypto/chtls/chtls_io.c | 14 --
.../chelsio/inline_crypto/chtls/chtls_main.c | 1 -
include/linux/net.h | 8 -
include/net/inet_common.h | 2 -
include/net/sock.h | 6 -
net/appletalk/ddp.c | 1 -
net/atm/pvc.c | 1 -
net/atm/svc.c | 1 -
net/ax25/af_ax25.c | 1 -
net/caif/caif_socket.c | 2 -
net/can/bcm.c | 1 -
net/can/isotp.c | 1 -
net/can/j1939/socket.c | 1 -
net/can/raw.c | 1 -
net/core/sock.c | 35 +----
net/dccp/ipv4.c | 1 -
net/dccp/ipv6.c | 1 -
net/ieee802154/socket.c | 2 -
net/ipv4/af_inet.c | 21 ---
net/ipv4/tcp.c | 34 -----
net/ipv4/tcp_bpf.c | 21 +--
net/ipv4/tcp_ipv4.c | 1 -
net/ipv4/udp.c | 22 ---
net/ipv4/udp_impl.h | 2 -
net/ipv4/udplite.c | 1 -
net/ipv6/af_inet6.c | 3 -
net/ipv6/raw.c | 1 -
net/ipv6/tcp_ipv6.c | 1 -
net/kcm/kcmsock.c | 20 ---
net/key/af_key.c | 1 -
net/l2tp/l2tp_ip.c | 1 -
net/l2tp/l2tp_ip6.c | 1 -
net/llc/af_llc.c | 1 -
net/mctp/af_mctp.c | 1 -
net/mptcp/protocol.c | 2 -
net/netlink/af_netlink.c | 1 -
net/netrom/af_netrom.c | 1 -
net/packet/af_packet.c | 2 -
net/phonet/socket.c | 2 -
net/qrtr/af_qrtr.c | 1 -
net/rds/af_rds.c | 1 -
net/rose/af_rose.c | 1 -
net/rxrpc/af_rxrpc.c | 1 -
net/sctp/protocol.c | 1 -
net/socket.c | 48 ------
net/tipc/socket.c | 3 -
net/tls/tls_main.c | 7 -
net/unix/af_unix.c | 139 ------------------
net/vmw_vsock/af_vsock.c | 3 -
net/x25/af_x25.c | 1 -
net/xdp/xsk.c | 1 -
57 files changed, 9 insertions(+), 491 deletions(-)

diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 3d435caa3ef2..92c9fb46d6a2 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes.
rps_sock_flow_table is a global flow table that contains the *desired* CPU
for flows: the CPU that is currently processing the flow in userspace.
Each table value is a CPU index that is updated during calls to recvmsg
-and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
-and tcp_splice_read()).
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
+tcp_splice_read()).

When the scheduler moves a thread to a new CPU while it has outstanding
receive packets on the old CPU, packets may arrive out of order. To
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 686610a4986f..9f84816dcabf 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -483,7 +483,6 @@ static const struct proto_ops alg_proto_ops = {
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,

@@ -1110,34 +1109,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
}
EXPORT_SYMBOL_GPL(af_alg_sendmsg);

-/**
- * af_alg_sendpage - sendpage system call handler
- * @sock: socket of connection to user space to write to
- * @page: data to send
- * @offset: offset into page to begin sending
- * @size: length of data
- * @flags: message send/receive flags
- *
- * This is a generic implementation of sendpage to fill ctx->tsgl_list.
- */
-ssize_t af_alg_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES,
- };
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- return sock_sendmsg(sock, &msg);
-}
-EXPORT_SYMBOL_GPL(af_alg_sendpage);
-
/**
* af_alg_free_resources - release resources required for crypto request
* @areq: Request holding the TX and RX SGL
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index b16111a3025a..37b08e5f9114 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -9,10 +9,10 @@
* The following concept of the memory management is used:
*
* The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage. Filling up
- * the TX SGL does not cause a crypto operation -- the data will only be
- * tracked by the kernel. Upon receipt of one recvmsg call, the caller must
- * provide a buffer which is tracked with the RX SGL.
+ * filled by user space with the data submitted via sendmsg (maybe with with
+ * MSG_SPLICE_PAGES). Filling up the TX SGL does not cause a crypto operation
+ * -- the data will only be tracked by the kernel. Upon receipt of one recvmsg
+ * call, the caller must provide a buffer which is tracked with the RX SGL.
*
* During the processing of the recvmsg operation, the cipher request is
* allocated and prepared. As part of the recvmsg operation, the processed
@@ -368,7 +368,6 @@ static struct proto_ops algif_aead_ops = {

.release = af_alg_release,
.sendmsg = aead_sendmsg,
- .sendpage = af_alg_sendpage,
.recvmsg = aead_recvmsg,
.poll = af_alg_poll,
};
@@ -420,18 +419,6 @@ static int aead_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return aead_sendmsg(sock, msg, size);
}

-static ssize_t aead_sendpage_nokey(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
-
- err = aead_check_key(sock);
- if (err)
- return err;
-
- return af_alg_sendpage(sock, page, offset, size, flags);
-}
-
static int aead_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
@@ -459,7 +446,6 @@ static struct proto_ops algif_aead_ops_nokey = {

.release = af_alg_release,
.sendmsg = aead_sendmsg_nokey,
- .sendpage = aead_sendpage_nokey,
.recvmsg = aead_recvmsg_nokey,
.poll = af_alg_poll,
};
diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
index 407408c43730..10c41adac3b1 100644
--- a/crypto/algif_rng.c
+++ b/crypto/algif_rng.c
@@ -174,7 +174,6 @@ static struct proto_ops algif_rng_ops = {
.bind = sock_no_bind,
.accept = sock_no_accept,
.sendmsg = sock_no_sendmsg,
- .sendpage = sock_no_sendpage,

.release = af_alg_release,
.recvmsg = rng_recvmsg,
@@ -192,7 +191,6 @@ static struct proto_ops __maybe_unused algif_rng_test_ops = {
.mmap = sock_no_mmap,
.bind = sock_no_bind,
.accept = sock_no_accept,
- .sendpage = sock_no_sendpage,

.release = af_alg_release,
.recvmsg = rng_test_recvmsg,
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index b1f321b9f846..9ada9b741af8 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -194,7 +194,6 @@ static struct proto_ops algif_skcipher_ops = {

.release = af_alg_release,
.sendmsg = skcipher_sendmsg,
- .sendpage = af_alg_sendpage,
.recvmsg = skcipher_recvmsg,
.poll = af_alg_poll,
};
@@ -246,18 +245,6 @@ static int skcipher_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
return skcipher_sendmsg(sock, msg, size);
}

-static ssize_t skcipher_sendpage_nokey(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
-
- err = skcipher_check_key(sock);
- if (err)
- return err;
-
- return af_alg_sendpage(sock, page, offset, size, flags);
-}
-
static int skcipher_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
size_t ignored, int flags)
{
@@ -285,7 +272,6 @@ static struct proto_ops algif_skcipher_ops_nokey = {

.release = af_alg_release,
.sendmsg = skcipher_sendmsg_nokey,
- .sendpage = skcipher_sendpage_nokey,
.recvmsg = skcipher_recvmsg_nokey,
.poll = af_alg_poll,
};
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
index 41714203ace8..94760a681566 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
@@ -568,8 +568,6 @@ void chtls_destroy_sock(struct sock *sk);
int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
int chtls_recvmsg(struct sock *sk, struct msghdr *msg,
size_t len, int flags, int *addr_len);
-int chtls_sendpage(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);
int send_tx_flowc_wr(struct sock *sk, int compl,
u32 snd_nxt, u32 rcv_nxt);
void chtls_tcp_push(struct sock *sk, int flags);
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
index 5c397cb57300..fb44333efa3e 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
@@ -1285,20 +1285,6 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
goto done;
}

-int chtls_sendpage(struct sock *sk, struct page *page,
- int offset, size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- bvec_set_page(&bvec, page, offset, size);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- return chtls_sendmsg(sk, &msg, size);
-}
-
static void chtls_select_window(struct sock *sk)
{
struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
index 1e55b12fee51..1b8e6994e8fe 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
@@ -606,7 +606,6 @@ static void __init chtls_init_ulp_ops(void)
chtls_cpl_prot.destroy = chtls_destroy_sock;
chtls_cpl_prot.shutdown = chtls_shutdown;
chtls_cpl_prot.sendmsg = chtls_sendmsg;
- chtls_cpl_prot.sendpage = chtls_sendpage;
chtls_cpl_prot.recvmsg = chtls_recvmsg;
chtls_cpl_prot.setsockopt = chtls_setsockopt;
chtls_cpl_prot.getsockopt = chtls_getsockopt;
diff --git a/include/linux/net.h b/include/linux/net.h
index b73ad8e3c212..e5794968ac9f 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -206,8 +206,6 @@ struct proto_ops {
size_t total_len, int flags);
int (*mmap) (struct file *file, struct socket *sock,
struct vm_area_struct * vma);
- ssize_t (*sendpage) (struct socket *sock, struct page *page,
- int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
int (*set_peek_off)(struct sock *sk, int val);
@@ -220,8 +218,6 @@ struct proto_ops {
sk_read_actor_t recv_actor);
/* This is different from read_sock(), it reads an entire skb at a time. */
int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor);
- int (*sendpage_locked)(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);
int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
size_t size);
int (*set_rcvlowat)(struct sock *sk, int val);
@@ -339,10 +335,6 @@ int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen,
int flags);
int kernel_getsockname(struct socket *sock, struct sockaddr *addr);
int kernel_getpeername(struct socket *sock, struct sockaddr *addr);
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags);
-int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags);
int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);

/* Routine returns the IP overhead imposed by a (caller-protected) socket. */
diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index cec453c18f1d..054c3388fa51 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -33,8 +33,6 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags,
bool kern);
int inet_send_prepare(struct sock *sk);
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags);
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
int flags);
int inet_shutdown(struct socket *sock, int how);
diff --git a/include/net/sock.h b/include/net/sock.h
index 573f2bf7e0de..4618cd21e16b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1265,8 +1265,6 @@ struct proto {
size_t len);
int (*recvmsg)(struct sock *sk, struct msghdr *msg,
size_t len, int flags, int *addr_len);
- int (*sendpage)(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);
int (*bind)(struct sock *sk,
struct sockaddr *addr, int addr_len);
int (*bind_add)(struct sock *sk,
@@ -1906,10 +1904,6 @@ int sock_no_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
int sock_no_recvmsg(struct socket *, struct msghdr *, size_t, int);
int sock_no_mmap(struct file *file, struct socket *sock,
struct vm_area_struct *vma);
-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags);
-ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page,
- int offset, size_t size, int flags);

/*
* Functions to fill in entries in struct proto_ops when a protocol
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index a06f4d4a6f47..8978fb6212ff 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1929,7 +1929,6 @@ static const struct proto_ops atalk_dgram_ops = {
.sendmsg = atalk_sendmsg,
.recvmsg = atalk_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct notifier_block ddp_notifier = {
diff --git a/net/atm/pvc.c b/net/atm/pvc.c
index 53e7d3f39e26..66d9a9bd5896 100644
--- a/net/atm/pvc.c
+++ b/net/atm/pvc.c
@@ -126,7 +126,6 @@ static const struct proto_ops pvc_proto_ops = {
.sendmsg = vcc_sendmsg,
.recvmsg = vcc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};


diff --git a/net/atm/svc.c b/net/atm/svc.c
index 4a02bcaad279..289240fe234e 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -649,7 +649,6 @@ static const struct proto_ops svc_proto_ops = {
.sendmsg = vcc_sendmsg,
.recvmsg = vcc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};


diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index d8da400cb4de..5db805d5f74d 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -2022,7 +2022,6 @@ static const struct proto_ops ax25_proto_ops = {
.sendmsg = ax25_sendmsg,
.recvmsg = ax25_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

/*
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index 4eebcc66c19a..9c82698da4f5 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -976,7 +976,6 @@ static const struct proto_ops caif_seqpacket_ops = {
.sendmsg = caif_seqpkt_sendmsg,
.recvmsg = caif_seqpkt_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static const struct proto_ops caif_stream_ops = {
@@ -996,7 +995,6 @@ static const struct proto_ops caif_stream_ops = {
.sendmsg = caif_stream_sendmsg,
.recvmsg = caif_stream_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

/* This function is called when a socket is finally destroyed. */
diff --git a/net/can/bcm.c b/net/can/bcm.c
index 27706f6ace34..65a946a36d92 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1699,7 +1699,6 @@ static const struct proto_ops bcm_ops = {
.sendmsg = bcm_sendmsg,
.recvmsg = bcm_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct proto bcm_proto __read_mostly = {
diff --git a/net/can/isotp.c b/net/can/isotp.c
index 9bc344851704..0c3d11c29a2b 100644
--- a/net/can/isotp.c
+++ b/net/can/isotp.c
@@ -1633,7 +1633,6 @@ static const struct proto_ops isotp_ops = {
.sendmsg = isotp_sendmsg,
.recvmsg = isotp_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct proto isotp_proto __read_mostly = {
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 7e90f9e61d9b..2bfe4f79bb67 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1301,7 +1301,6 @@ static const struct proto_ops j1939_ops = {
.sendmsg = j1939_sk_sendmsg,
.recvmsg = j1939_sk_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct proto j1939_proto __read_mostly = {
diff --git a/net/can/raw.c b/net/can/raw.c
index f64469b98260..15c79b079184 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -962,7 +962,6 @@ static const struct proto_ops raw_ops = {
.sendmsg = raw_sendmsg,
.recvmsg = raw_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct proto raw_proto __read_mostly = {
diff --git a/net/core/sock.c b/net/core/sock.c
index 341c565dbc26..c2ae77bb2075 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3223,36 +3223,6 @@ void __receive_sock(struct file *file)
}
}

-ssize_t sock_no_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags)
-{
- ssize_t res;
- struct msghdr msg = {.msg_flags = flags};
- struct kvec iov;
- char *kaddr = kmap(page);
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- res = kernel_sendmsg(sock, &msg, &iov, 1, size);
- kunmap(page);
- return res;
-}
-EXPORT_SYMBOL(sock_no_sendpage);
-
-ssize_t sock_no_sendpage_locked(struct sock *sk, struct page *page,
- int offset, size_t size, int flags)
-{
- ssize_t res;
- struct msghdr msg = {.msg_flags = flags};
- struct kvec iov;
- char *kaddr = kmap(page);
-
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- res = kernel_sendmsg_locked(sk, &msg, &iov, 1, size);
- kunmap(page);
- return res;
-}
-EXPORT_SYMBOL(sock_no_sendpage_locked);
-
/*
* Default Socket Callbacks
*/
@@ -4008,7 +3978,7 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
{

seq_printf(seq, "%-9s %4u %6d %6ld %-3s %6u %-3s %-10s "
- "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
+ "%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
sock_prot_inuse_get(seq_file_net(seq), proto),
@@ -4029,7 +3999,6 @@ static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
proto_method_implemented(proto->getsockopt),
proto_method_implemented(proto->sendmsg),
proto_method_implemented(proto->recvmsg),
- proto_method_implemented(proto->sendpage),
proto_method_implemented(proto->bind),
proto_method_implemented(proto->backlog_rcv),
proto_method_implemented(proto->hash),
@@ -4050,7 +4019,7 @@ static int proto_seq_show(struct seq_file *seq, void *v)
"maxhdr",
"slab",
"module",
- "cl co di ac io in de sh ss gs se re sp bi br ha uh gp em\n");
+ "cl co di ac io in de sh ss gs se re bi br ha uh gp em\n");
else
proto_seq_printf(seq, list_entry(v, struct proto, node));
return 0;
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index b780827f5e0a..ea808de374ea 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -1008,7 +1008,6 @@ static const struct proto_ops inet_dccp_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct inet_protosw dccp_v4_protosw = {
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index b9d7c3dd1cb3..23eb8159e3cd 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -1085,7 +1085,6 @@ static const struct proto_ops inet6_dccp_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/ieee802154/socket.c b/net/ieee802154/socket.c
index 1fa2fe041ec0..1238f036117f 100644
--- a/net/ieee802154/socket.c
+++ b/net/ieee802154/socket.c
@@ -426,7 +426,6 @@ static const struct proto_ops ieee802154_raw_ops = {
.sendmsg = ieee802154_sock_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

/* DGRAM Sockets (802.15.4 dataframes) */
@@ -990,7 +989,6 @@ static const struct proto_ops ieee802154_dgram_ops = {
.sendmsg = ieee802154_sock_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static void ieee802154_sock_destruct(struct sock *sk)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8db6747f892f..869b49933f15 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -827,23 +827,6 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(inet_sendmsg);

-ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- const struct proto *prot;
-
- if (unlikely(inet_send_prepare(sk)))
- return -EAGAIN;
-
- /* IPV6_ADDRFORM can change sk->sk_prot under us. */
- prot = READ_ONCE(sk->sk_prot);
- if (prot->sendpage)
- return prot->sendpage(sk, page, offset, size, flags);
- return sock_no_sendpage(sock, page, offset, size, flags);
-}
-EXPORT_SYMBOL(inet_sendpage);
-
INDIRECT_CALLABLE_DECLARE(int udp_recvmsg(struct sock *, struct msghdr *,
size_t, int, int *));
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
@@ -1046,12 +1029,10 @@ const struct proto_ops inet_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
- .sendpage = inet_sendpage,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
.read_skb = tcp_read_skb,
.sendmsg_locked = tcp_sendmsg_locked,
- .sendpage_locked = tcp_sendpage_locked,
.peek_len = tcp_peek_len,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
@@ -1080,7 +1061,6 @@ const struct proto_ops inet_dgram_ops = {
.read_skb = udp_read_skb,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
@@ -1111,7 +1091,6 @@ static const struct proto_ops inet_sockraw_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
#endif
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a8f8ccaed10e..bd01a1b23c7b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,40 +971,6 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
return min(copy, sk->sk_forward_alloc);
}

-int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
-
- if (!(sk->sk_route_caps & NETIF_F_SG))
- return sock_no_sendpage_locked(sk, page, offset, size, flags);
-
- tcp_rate_check_app_limited(sk); /* is sending application-limited? */
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- return tcp_sendmsg_locked(sk, &msg, size);
-}
-EXPORT_SYMBOL_GPL(tcp_sendpage_locked);
-
-int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- int ret;
-
- lock_sock(sk);
- ret = tcp_sendpage_locked(sk, page, offset, size, flags);
- release_sock(sk);
-
- return ret;
-}
-EXPORT_SYMBOL(tcp_sendpage);
-
void tcp_free_fastopen_req(struct tcp_sock *tp)
{
if (tp->fastopen_req) {
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index de37a4372437..ab83cfb9de22 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -482,23 +482,6 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
return copied ? copied : err;
}

-static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES,
- };
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- return tcp_bpf_sendmsg(sk, &msg, size);
-}
-
enum {
TCP_BPF_IPV4,
TCP_BPF_IPV6,
@@ -528,7 +511,6 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],

prot[TCP_BPF_TX] = prot[TCP_BPF_BASE];
prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg;
- prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage;

prot[TCP_BPF_RX] = prot[TCP_BPF_BASE];
prot[TCP_BPF_RX].recvmsg = tcp_bpf_recvmsg_parser;
@@ -563,8 +545,7 @@ static int tcp_bpf_assert_proto_ops(struct proto *ops)
* indeed valid assumptions.
*/
return ops->recvmsg == tcp_recvmsg &&
- ops->sendmsg == tcp_sendmsg &&
- ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP;
+ ops->sendmsg == tcp_sendmsg ? 0 : -ENOTSUPP;
}

int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..5c2e1c1ca329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3112,7 +3112,6 @@ struct proto tcp_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
- .sendpage = tcp_sendpage,
.backlog_rcv = tcp_v4_do_rcv,
.release_cb = tcp_release_cb,
.hash = inet_hash,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 097feb92e215..85bd5960f7ef 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1329,27 +1329,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udp_sendmsg);

-int udp_sendpage(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct bio_vec bvec;
- struct msghdr msg = {
- .msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE
- };
- int ret;
-
- bvec_set_page(&bvec, page, size, offset);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- lock_sock(sk);
- ret = udp_sendmsg(sk, &msg, size);
- release_sock(sk);
- return ret;
-}
-
#define UDP_SKB_IS_STATELESS 0x80000000

/* all head states (dst, sk, nf conntrack) except skb extensions are
@@ -2926,7 +2905,6 @@ struct proto udp_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
- .sendpage = udp_sendpage,
.release_cb = ip4_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,
diff --git a/net/ipv4/udp_impl.h b/net/ipv4/udp_impl.h
index 4ba7a88a1b1d..e1ff3a375996 100644
--- a/net/ipv4/udp_impl.h
+++ b/net/ipv4/udp_impl.h
@@ -19,8 +19,6 @@ int udp_getsockopt(struct sock *sk, int level, int optname,

int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
int *addr_len);
-int udp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
- int flags);
void udp_destroy_sock(struct sock *sk);

#ifdef CONFIG_PROC_FS
diff --git a/net/ipv4/udplite.c b/net/ipv4/udplite.c
index e0c9cc39b81e..69870f0afc6c 100644
--- a/net/ipv4/udplite.c
+++ b/net/ipv4/udplite.c
@@ -54,7 +54,6 @@ struct proto udplite_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
- .sendpage = udp_sendpage,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,
.rehash = udp_v4_rehash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 38689bedfce7..769c76d59053 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -695,9 +695,7 @@ const struct proto_ops inet6_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
- .sendpage = inet_sendpage,
.sendmsg_locked = tcp_sendmsg_locked,
- .sendpage_locked = tcp_sendpage_locked,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
.read_skb = tcp_read_skb,
@@ -728,7 +726,6 @@ const struct proto_ops inet6_dgram_ops = {
.recvmsg = inet6_recvmsg, /* retpoline's sake */
.read_skb = udp_read_skb,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index bac9ba747bde..c6c062678c0e 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1298,7 +1298,6 @@ const struct proto_ops inet6_sockraw_ops = {
.sendmsg = inet_sendmsg, /* ok */
.recvmsg = sock_common_recvmsg, /* ok */
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 1bf93b61aa06..03ba1e389901 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2151,7 +2151,6 @@ struct proto tcpv6_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
- .sendpage = tcp_sendpage,
.backlog_rcv = tcp_v6_do_rcv,
.release_cb = tcp_release_cb,
.hash = inet6_hash,
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 9c9d379aafb1..94442e359fe2 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1005,24 +1005,6 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
return err;
}

-static ssize_t kcm_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-
-{
- struct bio_vec bvec;
- struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
-
- if (flags & MSG_SENDPAGE_NOTLAST)
- msg.msg_flags |= MSG_MORE;
-
- if (flags & MSG_OOB)
- return -EOPNOTSUPP;
-
- bvec_set_page(&bvec, page, offset, size);
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
- return kcm_sendmsg(sock, &msg, size);
-}
-
static int kcm_recvmsg(struct socket *sock, struct msghdr *msg,
size_t len, int flags)
{
@@ -1810,7 +1792,6 @@ static const struct proto_ops kcm_dgram_ops = {
.sendmsg = kcm_sendmsg,
.recvmsg = kcm_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = kcm_sendpage,
};

static const struct proto_ops kcm_seqpacket_ops = {
@@ -1831,7 +1812,6 @@ static const struct proto_ops kcm_seqpacket_ops = {
.sendmsg = kcm_sendmsg,
.recvmsg = kcm_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = kcm_sendpage,
.splice_read = kcm_splice_read,
};

diff --git a/net/key/af_key.c b/net/key/af_key.c
index a815f5ab4c49..bf59d42dc697 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -3757,7 +3757,6 @@ static const struct proto_ops pfkey_ops = {
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,

/* Now the operations that really occur. */
.release = pfkey_release,
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4db5a554bdbd..d0dcbe3a4cd7 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -625,7 +625,6 @@ static const struct proto_ops l2tp_ip_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct inet_protosw l2tp_ip_protosw = {
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 2478aa60145f..49296ce14a90 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -751,7 +751,6 @@ static const struct proto_ops l2tp_ip6_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index da7fe94bea2e..addd94da2a81 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -1230,7 +1230,6 @@ static const struct proto_ops llc_ui_ops = {
.sendmsg = llc_ui_sendmsg,
.recvmsg = llc_ui_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static const char llc_proc_err_msg[] __initconst =
diff --git a/net/mctp/af_mctp.c b/net/mctp/af_mctp.c
index 3150f3f0c872..c6fe2e6b85dd 100644
--- a/net/mctp/af_mctp.c
+++ b/net/mctp/af_mctp.c
@@ -485,7 +485,6 @@ static const struct proto_ops mctp_dgram_ops = {
.sendmsg = mctp_sendmsg,
.recvmsg = mctp_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = mctp_compat_ioctl,
#endif
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3ad9c46202fc..ade89b8d0082 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3816,7 +3816,6 @@ static const struct proto_ops mptcp_stream_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
};

static struct inet_protosw mptcp_protosw = {
@@ -3911,7 +3910,6 @@ static const struct proto_ops mptcp_v6_stream_ops = {
.sendmsg = inet6_sendmsg,
.recvmsg = inet6_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet6_compat_ioctl,
#endif
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c64277659753..f70073a3bb49 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2841,7 +2841,6 @@ static const struct proto_ops netlink_ops = {
.sendmsg = netlink_sendmsg,
.recvmsg = netlink_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static const struct net_proto_family netlink_family_ops = {
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index 5a4cb796150f..eb8ccbd58df7 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -1364,7 +1364,6 @@ static const struct proto_ops nr_proto_ops = {
.sendmsg = nr_sendmsg,
.recvmsg = nr_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct notifier_block nr_dev_notifier = {
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d4e76e2ae153..385bd4982b80 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4604,7 +4604,6 @@ static const struct proto_ops packet_ops_spkt = {
.sendmsg = packet_sendmsg_spkt,
.recvmsg = packet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static const struct proto_ops packet_ops = {
@@ -4626,7 +4625,6 @@ static const struct proto_ops packet_ops = {
.sendmsg = packet_sendmsg,
.recvmsg = packet_recvmsg,
.mmap = packet_mmap,
- .sendpage = sock_no_sendpage,
};

static const struct net_proto_family packet_family_ops = {
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index 71e2caf6ab85..a246f7d0a817 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -441,7 +441,6 @@ const struct proto_ops phonet_dgram_ops = {
.sendmsg = pn_socket_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

const struct proto_ops phonet_stream_ops = {
@@ -462,7 +461,6 @@ const struct proto_ops phonet_stream_ops = {
.sendmsg = pn_socket_sendmsg,
.recvmsg = sock_common_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};
EXPORT_SYMBOL(phonet_stream_ops);

diff --git a/net/qrtr/af_qrtr.c b/net/qrtr/af_qrtr.c
index 5c2fb992803b..5bb7d680bd5f 100644
--- a/net/qrtr/af_qrtr.c
+++ b/net/qrtr/af_qrtr.c
@@ -1240,7 +1240,6 @@ static const struct proto_ops qrtr_proto_ops = {
.shutdown = sock_no_shutdown,
.release = qrtr_release,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct proto qrtr_proto = {
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 3ff6995244e5..01c4cdfef45d 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -653,7 +653,6 @@ static const struct proto_ops rds_proto_ops = {
.sendmsg = rds_sendmsg,
.recvmsg = rds_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static void rds_sock_destruct(struct sock *sk)
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index ca2b17f32670..49dafe9ac72f 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -1496,7 +1496,6 @@ static const struct proto_ops rose_proto_ops = {
.sendmsg = rose_sendmsg,
.recvmsg = rose_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct notifier_block rose_dev_notifier = {
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 102f5cbff91a..182495804f8f 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -938,7 +938,6 @@ static const struct proto_ops rxrpc_rpc_ops = {
.sendmsg = rxrpc_sendmsg,
.recvmsg = rxrpc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct proto rxrpc_proto = {
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index c365df24ad33..acb2d2a69268 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1135,7 +1135,6 @@ static const struct proto_ops inet_seqpacket_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

/* Registration with AF_INET family. */
diff --git a/net/socket.c b/net/socket.c
index 3e9bd8261357..8c7437c983da 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3543,54 +3543,6 @@ int kernel_getpeername(struct socket *sock, struct sockaddr *addr)
}
EXPORT_SYMBOL(kernel_getpeername);

-/**
- * kernel_sendpage - send a &page through a socket (kernel space)
- * @sock: socket
- * @page: page
- * @offset: page offset
- * @size: total size in bytes
- * @flags: flags (MSG_DONTWAIT, ...)
- *
- * Returns the total amount sent in bytes or an error.
- */
-
-int kernel_sendpage(struct socket *sock, struct page *page, int offset,
- size_t size, int flags)
-{
- if (sock->ops->sendpage) {
- /* Warn in case the improper page to zero-copy send */
- WARN_ONCE(!sendpage_ok(page), "improper page for zero-copy send");
- return sock->ops->sendpage(sock, page, offset, size, flags);
- }
- return sock_no_sendpage(sock, page, offset, size, flags);
-}
-EXPORT_SYMBOL(kernel_sendpage);
-
-/**
- * kernel_sendpage_locked - send a &page through the locked sock (kernel space)
- * @sk: sock
- * @page: page
- * @offset: page offset
- * @size: total size in bytes
- * @flags: flags (MSG_DONTWAIT, ...)
- *
- * Returns the total amount sent in bytes or an error.
- * Caller must hold @sk.
- */
-
-int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
- size_t size, int flags)
-{
- struct socket *sock = sk->sk_socket;
-
- if (sock->ops->sendpage_locked)
- return sock->ops->sendpage_locked(sk, page, offset, size,
- flags);
-
- return sock_no_sendpage_locked(sk, page, offset, size, flags);
-}
-EXPORT_SYMBOL(kernel_sendpage_locked);
-
/**
* kernel_sock_shutdown - shut down part of a full-duplex connection (kernel space)
* @sock: socket
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 37edfe10f8c6..d2072fbf3272 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3375,7 +3375,6 @@ static const struct proto_ops msg_ops = {
.sendmsg = tipc_sendmsg,
.recvmsg = tipc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage
};

static const struct proto_ops packet_ops = {
@@ -3396,7 +3395,6 @@ static const struct proto_ops packet_ops = {
.sendmsg = tipc_send_packet,
.recvmsg = tipc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage
};

static const struct proto_ops stream_ops = {
@@ -3417,7 +3415,6 @@ static const struct proto_ops stream_ops = {
.sendmsg = tipc_sendstream,
.recvmsg = tipc_recvstream,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage
};

static const struct net_proto_family tipc_family_ops = {
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 35b2f7ee2fa3..ff02697f484b 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -936,7 +936,6 @@ static void build_proto_ops(struct proto_ops ops[TLS_NUM_CONFIG][TLS_NUM_CONFIG]
ops[TLS_BASE][TLS_BASE] = *base;

ops[TLS_SW ][TLS_BASE] = ops[TLS_BASE][TLS_BASE];
- ops[TLS_SW ][TLS_BASE].sendpage_locked = tls_sw_sendpage_locked;

ops[TLS_BASE][TLS_SW ] = ops[TLS_BASE][TLS_BASE];
ops[TLS_BASE][TLS_SW ].splice_read = tls_sw_splice_read;
@@ -946,17 +945,14 @@ static void build_proto_ops(struct proto_ops ops[TLS_NUM_CONFIG][TLS_NUM_CONFIG]

#ifdef CONFIG_TLS_DEVICE
ops[TLS_HW ][TLS_BASE] = ops[TLS_BASE][TLS_BASE];
- ops[TLS_HW ][TLS_BASE].sendpage_locked = NULL;

ops[TLS_HW ][TLS_SW ] = ops[TLS_BASE][TLS_SW ];
- ops[TLS_HW ][TLS_SW ].sendpage_locked = NULL;

ops[TLS_BASE][TLS_HW ] = ops[TLS_BASE][TLS_SW ];

ops[TLS_SW ][TLS_HW ] = ops[TLS_SW ][TLS_SW ];

ops[TLS_HW ][TLS_HW ] = ops[TLS_HW ][TLS_SW ];
- ops[TLS_HW ][TLS_HW ].sendpage_locked = NULL;
#endif
#ifdef CONFIG_TLS_TOE
ops[TLS_HW_RECORD][TLS_HW_RECORD] = *base;
@@ -1004,7 +1000,6 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],

prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
prot[TLS_SW][TLS_BASE].sendmsg = tls_sw_sendmsg;
- prot[TLS_SW][TLS_BASE].sendpage = tls_sw_sendpage;

prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE];
prot[TLS_BASE][TLS_SW].recvmsg = tls_sw_recvmsg;
@@ -1019,11 +1014,9 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
#ifdef CONFIG_TLS_DEVICE
prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
prot[TLS_HW][TLS_BASE].sendmsg = tls_device_sendmsg;
- prot[TLS_HW][TLS_BASE].sendpage = tls_device_sendpage;

prot[TLS_HW][TLS_SW] = prot[TLS_BASE][TLS_SW];
prot[TLS_HW][TLS_SW].sendmsg = tls_device_sendmsg;
- prot[TLS_HW][TLS_SW].sendpage = tls_device_sendpage;

prot[TLS_BASE][TLS_HW] = prot[TLS_BASE][TLS_SW];

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 88b91005567e..751715ade2ae 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -758,8 +758,6 @@ static int unix_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned lon
static int unix_shutdown(struct socket *, int);
static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
-static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
- size_t size, int flags);
static ssize_t unix_stream_splice_read(struct socket *, loff_t *ppos,
struct pipe_inode_info *, size_t size,
unsigned int flags);
@@ -852,7 +850,6 @@ static const struct proto_ops unix_stream_ops = {
.recvmsg = unix_stream_recvmsg,
.read_skb = unix_stream_read_skb,
.mmap = sock_no_mmap,
- .sendpage = unix_stream_sendpage,
.splice_read = unix_stream_splice_read,
.set_peek_off = unix_set_peek_off,
.show_fdinfo = unix_show_fdinfo,
@@ -878,7 +875,6 @@ static const struct proto_ops unix_dgram_ops = {
.read_skb = unix_read_skb,
.recvmsg = unix_dgram_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_peek_off = unix_set_peek_off,
.show_fdinfo = unix_show_fdinfo,
};
@@ -902,7 +898,6 @@ static const struct proto_ops unix_seqpacket_ops = {
.sendmsg = unix_seqpacket_sendmsg,
.recvmsg = unix_seqpacket_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_peek_off = unix_set_peek_off,
.show_fdinfo = unix_show_fdinfo,
};
@@ -1839,24 +1834,6 @@ static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
}
}

-static int maybe_init_creds(struct scm_cookie *scm,
- struct socket *socket,
- const struct sock *other)
-{
- int err;
- struct msghdr msg = { .msg_controllen = 0 };
-
- err = scm_send(socket, &msg, scm, false);
- if (err)
- return err;
-
- if (unix_passcred_enabled(socket, other)) {
- scm->pid = get_pid(task_tgid(current));
- current_uid_gid(&scm->creds.uid, &scm->creds.gid);
- }
- return err;
-}
-
static bool unix_skb_scm_eq(struct sk_buff *skb,
struct scm_cookie *scm)
{
@@ -2349,122 +2326,6 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
return sent ? : err;
}

-static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
- int offset, size_t size, int flags)
-{
- int err;
- bool send_sigpipe = false;
- bool init_scm = true;
- struct scm_cookie scm;
- struct sock *other, *sk = socket->sk;
- struct sk_buff *skb, *newskb = NULL, *tail = NULL;
-
- if (flags & MSG_OOB)
- return -EOPNOTSUPP;
-
- other = unix_peer(sk);
- if (!other || sk->sk_state != TCP_ESTABLISHED)
- return -ENOTCONN;
-
- if (false) {
-alloc_skb:
- unix_state_unlock(other);
- mutex_unlock(&unix_sk(other)->iolock);
- newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
- &err, 0);
- if (!newskb)
- goto err;
- }
-
- /* we must acquire iolock as we modify already present
- * skbs in the sk_receive_queue and mess with skb->len
- */
- err = mutex_lock_interruptible(&unix_sk(other)->iolock);
- if (err) {
- err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
- goto err;
- }
-
- if (sk->sk_shutdown & SEND_SHUTDOWN) {
- err = -EPIPE;
- send_sigpipe = true;
- goto err_unlock;
- }
-
- unix_state_lock(other);
-
- if (sock_flag(other, SOCK_DEAD) ||
- other->sk_shutdown & RCV_SHUTDOWN) {
- err = -EPIPE;
- send_sigpipe = true;
- goto err_state_unlock;
- }
-
- if (init_scm) {
- err = maybe_init_creds(&scm, socket, other);
- if (err)
- goto err_state_unlock;
- init_scm = false;
- }
-
- skb = skb_peek_tail(&other->sk_receive_queue);
- if (tail && tail == skb) {
- skb = newskb;
- } else if (!skb || !unix_skb_scm_eq(skb, &scm)) {
- if (newskb) {
- skb = newskb;
- } else {
- tail = skb;
- goto alloc_skb;
- }
- } else if (newskb) {
- /* this is fast path, we don't necessarily need to
- * call to kfree_skb even though with newskb == NULL
- * this - does no harm
- */
- consume_skb(newskb);
- newskb = NULL;
- }
-
- if (skb_append_pagefrags(skb, page, offset, size)) {
- tail = skb;
- goto alloc_skb;
- }
-
- skb->len += size;
- skb->data_len += size;
- skb->truesize += size;
- refcount_add(size, &sk->sk_wmem_alloc);
-
- if (newskb) {
- err = unix_scm_to_skb(&scm, skb, false);
- if (err)
- goto err_state_unlock;
- spin_lock(&other->sk_receive_queue.lock);
- __skb_queue_tail(&other->sk_receive_queue, newskb);
- spin_unlock(&other->sk_receive_queue.lock);
- }
-
- unix_state_unlock(other);
- mutex_unlock(&unix_sk(other)->iolock);
-
- other->sk_data_ready(other);
- scm_destroy(&scm);
- return size;
-
-err_state_unlock:
- unix_state_unlock(other);
-err_unlock:
- mutex_unlock(&unix_sk(other)->iolock);
-err:
- kfree_skb(newskb);
- if (send_sigpipe && !(flags & MSG_NOSIGNAL))
- send_sig(SIGPIPE, current, 0);
- if (!init_scm)
- scm_destroy(&scm);
- return err;
-}
-
static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg,
size_t len)
{
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 19aea7cba26e..d0e476755cdc 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1271,7 +1271,6 @@ static const struct proto_ops vsock_dgram_ops = {
.sendmsg = vsock_dgram_sendmsg,
.recvmsg = vsock_dgram_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static int vsock_transport_cancel_pkt(struct vsock_sock *vsk)
@@ -2186,7 +2185,6 @@ static const struct proto_ops vsock_stream_ops = {
.sendmsg = vsock_connectible_sendmsg,
.recvmsg = vsock_connectible_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
.set_rcvlowat = vsock_set_rcvlowat,
};

@@ -2208,7 +2206,6 @@ static const struct proto_ops vsock_seqpacket_ops = {
.sendmsg = vsock_connectible_sendmsg,
.recvmsg = vsock_connectible_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static int vsock_create(struct net *net, struct socket *sock,
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 5c7ad301d742..0fb5143bec7a 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1757,7 +1757,6 @@ static const struct proto_ops x25_proto_ops = {
.sendmsg = x25_sendmsg,
.recvmsg = x25_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
};

static struct packet_type x25_packet_type __read_mostly = {
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2ac58b282b5e..eff1f0aaa4b5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1386,7 +1386,6 @@ static const struct proto_ops xsk_proto_ops = {
.sendmsg = xsk_sendmsg,
.recvmsg = xsk_recvmsg,
.mmap = xsk_mmap,
- .sendpage = sock_no_sendpage,
};

static void xsk_destruct(struct sock *sk)

2023-03-31 16:44:28

by David Howells

[permalink] [raw]
Subject: [PATCH v3 51/55] smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES

Drop the smc_sendpage() code as smc_sendmsg() just passes the call down to
the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
its sendmsg implementation.

Signed-off-by: David Howells <[email protected]>
cc: Karsten Graul <[email protected]>
cc: Wenjia Zhang <[email protected]>
cc: Jan Karcher <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/smc/af_smc.c | 29 -----------------------------
net/smc/smc_stats.c | 2 +-
net/smc/smc_stats.h | 1 -
net/smc/smc_tx.c | 16 ----------------
net/smc/smc_tx.h | 2 --
5 files changed, 1 insertion(+), 49 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index a4cccdfdc00a..d4113c8a7cda 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -3125,34 +3125,6 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd,
return put_user(answ, (int __user *)arg);
}

-static ssize_t smc_sendpage(struct socket *sock, struct page *page,
- int offset, size_t size, int flags)
-{
- struct sock *sk = sock->sk;
- struct smc_sock *smc;
- int rc = -EPIPE;
-
- smc = smc_sk(sk);
- lock_sock(sk);
- if (sk->sk_state != SMC_ACTIVE) {
- release_sock(sk);
- goto out;
- }
- release_sock(sk);
- if (smc->use_fallback) {
- rc = kernel_sendpage(smc->clcsock, page, offset,
- size, flags);
- } else {
- lock_sock(sk);
- rc = smc_tx_sendpage(smc, page, offset, size, flags);
- release_sock(sk);
- SMC_STAT_INC(smc, sendpage_cnt);
- }
-
-out:
- return rc;
-}
-
/* Map the affected portions of the rmbe into an spd, note the number of bytes
* to splice in conn->splice_pending, and press 'go'. Delays consumer cursor
* updates till whenever a respective page has been fully processed.
@@ -3224,7 +3196,6 @@ static const struct proto_ops smc_sock_ops = {
.sendmsg = smc_sendmsg,
.recvmsg = smc_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = smc_sendpage,
.splice_read = smc_splice_read,
};

diff --git a/net/smc/smc_stats.c b/net/smc/smc_stats.c
index e80e34f7ac15..ca14c0f3a07d 100644
--- a/net/smc/smc_stats.c
+++ b/net/smc/smc_stats.c
@@ -227,7 +227,7 @@ static int smc_nl_fill_stats_tech_data(struct sk_buff *skb,
SMC_NLA_STATS_PAD))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SENDPAGE_CNT,
- smc_tech->sendpage_cnt,
+ 0,
SMC_NLA_STATS_PAD))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CORK_CNT,
diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h
index 84b7ecd8c05c..b60fe1eb37ab 100644
--- a/net/smc/smc_stats.h
+++ b/net/smc/smc_stats.h
@@ -71,7 +71,6 @@ struct smc_stats_tech {
u64 clnt_v2_succ_cnt;
u64 srv_v1_succ_cnt;
u64 srv_v2_succ_cnt;
- u64 sendpage_cnt;
u64 urg_data_cnt;
u64 splice_cnt;
u64 cork_cnt;
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index f4b6a71ac488..d31ce8209fa2 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -298,22 +298,6 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
return rc;
}

-int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
- size_t size, int flags)
-{
- struct msghdr msg = {.msg_flags = flags};
- char *kaddr = kmap(page);
- struct kvec iov;
- int rc;
-
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
- rc = smc_tx_sendmsg(smc, &msg, size);
- kunmap(page);
- return rc;
-}
-
/***************************** sndbuf consumer *******************************/

/* sndbuf consumer: actual data transfer of one target chunk with ISM write */
diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
index 34b578498b1f..a59f370b8b43 100644
--- a/net/smc/smc_tx.h
+++ b/net/smc/smc_tx.h
@@ -31,8 +31,6 @@ void smc_tx_pending(struct smc_connection *conn);
void smc_tx_work(struct work_struct *work);
void smc_tx_init(struct smc_sock *smc);
int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
-int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
- size_t size, int flags);
int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
void smc_tx_sndbuf_nonfull(struct smc_sock *smc);
void smc_tx_consumer_update(struct smc_connection *conn, bool force);

2023-03-31 17:39:47

by David Howells

[permalink] [raw]
Subject: Test program for AF_KCM

Hi Tom,

I found a test program for AF_KCM:

https://gist.githubusercontent.com/peo3/fd0e266a3852d3422c08854aba96bff5/raw/98e02e120bd4b4bc5d499c4510e5879bb3a023d7/kcm-sample.c

I don't suppose you have a version that compiles? It seems that the userland
BPF API has changed.

Thanks,
David

2023-04-02 15:12:40

by Willem de Bruijn

[permalink] [raw]
Subject: RE: [PATCH v3 03/55] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag

David Howells wrote:
> Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
> network protocol that it should splice pages from the source iterator
> rather than copying the data if it can. This flag is added to a list that
> is cleared by sendmsg and recvmsg syscalls on entry.

nit: comment not longer matches implementation: recvmsg

> This is intended as a replacement for the ->sendpage() op, allowing a way
> to splice in several multipage folios in one go.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Willem de Bruijn <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]

Aside from that

Reviewed-by: Willem de Bruijn <[email protected]>

2023-04-02 15:16:57

by Willem de Bruijn

[permalink] [raw]
Subject: RE: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

David Howells wrote:
> Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
> spliced from the source iterator.
>
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Willem de Bruijn <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> ---
> net/ipv4/ip_output.c | 102 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 99 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 4e4e308c3230..e2eaba817c1f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -956,6 +956,79 @@ csum_page(struct page *page, int offset, int copy)
> return csum;
> }
>
> +/*
> + * Allocate a packet for MSG_SPLICE_PAGES.
> + */
> +static int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb,
> + unsigned int fragheaderlen, unsigned int maxfraglen,
> + unsigned int hh_len)
> +{
> + struct sk_buff *skb_prev = *pskb, *skb;
> + unsigned int fraggap = skb_prev->len - maxfraglen;
> + unsigned int alloclen = fragheaderlen + hh_len + fraggap + 15;
> +
> + skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
> + if (unlikely(!skb))
> + return -ENOBUFS;
> +
> + /* Fill in the control structures */
> + skb->ip_summed = CHECKSUM_NONE;
> + skb->csum = 0;
> + skb_reserve(skb, hh_len);
> +
> + /* Find where to start putting bytes. */
> + skb_put(skb, fragheaderlen + fraggap);
> + skb_reset_network_header(skb);
> + skb->transport_header = skb->network_header + fragheaderlen;
> + if (fraggap) {
> + skb->csum = skb_copy_and_csum_bits(skb_prev, maxfraglen,
> + skb_transport_header(skb),
> + fraggap);
> + skb_prev->csum = csum_sub(skb_prev->csum, skb->csum);
> + pskb_trim_unique(skb_prev, maxfraglen);
> + }
> +
> + /* Put the packet on the pending queue. */
> + __skb_queue_tail(&sk->sk_write_queue, skb);
> + *pskb = skb;
> + return 0;
> +}
> +
> +/*
> + * Add (or copy) data pages for MSG_SPLICE_PAGES.
> + */
> +static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
> + void *from, int *pcopy)
> +{
> + struct msghdr *msg = from;
> + struct page *page = NULL, **pages = &page;
> + ssize_t copy = *pcopy;
> + size_t off;
> + int err;
> +
> + copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
> + if (copy <= 0)
> + return copy ?: -EIO;
> +
> + err = skb_append_pagefrags(skb, page, off, copy);
> + if (err < 0) {
> + iov_iter_revert(&msg->msg_iter, copy);
> + return err;
> + }
> +
> + if (skb->ip_summed == CHECKSUM_NONE) {
> + __wsum csum;
> +
> + csum = csum_page(page, off, copy);
> + skb->csum = csum_block_add(skb->csum, csum, skb->len);
> + }
> +
> + skb_len_add(skb, copy);
> + refcount_add(copy, &sk->sk_wmem_alloc);
> + *pcopy = copy;
> + return 0;
> +}

These functions are derived from and replace ip_append_page.
That can be removed once udp_sendpage is converted?

>
> static int __ip_append_data(struct sock *sk,
> struct flowi4 *fl4,
> struct sk_buff_head *queue,
> @@ -977,7 +1050,7 @@ static int __ip_append_data(struct sock *sk,
> int err;
> int offset = 0;
> bool zc = false;
> - unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
> + unsigned int maxfraglen, fragheaderlen, maxnonfragsize, initial_length;
> int csummode = CHECKSUM_NONE;
> struct rtable *rt = (struct rtable *)cork->dst;
> unsigned int wmem_alloc_delta = 0;
> @@ -1017,6 +1090,7 @@ static int __ip_append_data(struct sock *sk,
> (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
> csummode = CHECKSUM_PARTIAL;
>
> + initial_length = length;
> if ((flags & MSG_ZEROCOPY) && length) {
> struct msghdr *msg = from;
>
> @@ -1047,6 +1121,14 @@ static int __ip_append_data(struct sock *sk,
> skb_zcopy_set(skb, uarg, &extra_uref);
> }
> }
> + } else if ((flags & MSG_SPLICE_PAGES) && length) {
> + if (inet->hdrincl)
> + return -EPERM;
> + if (rt->dst.dev->features & NETIF_F_SG)
> + /* We need an empty buffer to attach stuff to */
> + initial_length = transhdrlen;

I still don't entirely understand what initial_length means.

More importantly, transhdrlen can be zero. If not called for UDP
but for RAW. Or if this is a subsequent call to a packet that is
being held with MSG_MORE.

This works fine for existing use-cases, which go to alloc_new_skb.
Not sure how this case would be different. But the comment alludes
that it does.

> + else
> + flags &= ~MSG_SPLICE_PAGES;
> }
>
> cork->length += length;
> @@ -1074,6 +1156,16 @@ static int __ip_append_data(struct sock *sk,
> unsigned int alloclen, alloc_extra;
> unsigned int pagedlen;
> struct sk_buff *skb_prev;
> +
> + if (unlikely(flags & MSG_SPLICE_PAGES)) {
> + err = __ip_splice_alloc(sk, &skb, fragheaderlen,
> + maxfraglen, hh_len);
> + if (err < 0)
> + goto error;
> + continue;
> + }
> + initial_length = length;
> +
> alloc_new_skb:
> skb_prev = skb;
> if (skb_prev)
> @@ -1085,7 +1177,7 @@ static int __ip_append_data(struct sock *sk,
> * If remaining data exceeds the mtu,
> * we know we need more fragment(s).
> */
> - datalen = length + fraggap;
> + datalen = initial_length + fraggap;
> if (datalen > mtu - fragheaderlen)
> datalen = maxfraglen - fragheaderlen;
> fraglen = datalen + fragheaderlen;
> @@ -1099,7 +1191,7 @@ static int __ip_append_data(struct sock *sk,
> * because we have no idea what fragment will be
> * the last.
> */
> - if (datalen == length + fraggap)
> + if (datalen == initial_length + fraggap)
> alloc_extra += rt->dst.trailer_len;
>
> if ((flags & MSG_MORE) &&
> @@ -1206,6 +1298,10 @@ static int __ip_append_data(struct sock *sk,
> err = -EFAULT;
> goto error;
> }
> + } else if (flags & MSG_SPLICE_PAGES) {
> + err = __ip_splice_pages(sk, skb, from, &copy);
> + if (err < 0)
> + goto error;
> } else if (!zc) {
> int i = skb_shinfo(skb)->nr_frags;
>
>


2023-04-03 09:33:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 00/55] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES)

On Fri, Mar 31, 2023 at 05:08:19PM +0100, David Howells wrote:
> Hi Willy, Dave, et al.,

Can we please finish the previous big API transitions before starting
yet another one? We still have 10 callers if iov_iter_get_pages2 and
of iov_iter_get_pages_alloc2 that need conversion to
iov_iter_extract_pages. Except for the legacy direct I/O code they
should be easy _and_ largerly overlap what this series touches.

I'll gladly take on direct-io.c.

2023-04-03 09:44:20

by David Howells

[permalink] [raw]
Subject: Is AF_KCM functional?

Okay, I have a test program for AF_KCM that builds and works up to a point.
However, it doesn't seem to work for two reasons:

(1) When it clones a socket with SIOCKCMCLONE, it doesn't set the LSM context
on the new socket. This results in EACCES if, say, SELinux is enforcing.

ioctl(8, SIOCPROTOPRIVATE, 0x7ffe17cc3b24) = 0
ioctl(9, SIOCPROTOPRIVATE, 0x7ffe17cc3b24) = -1 EACCES (Permission denied)

from the SIOCKCMATTACH ioctl, so this won't work on a number of Linux
distributions, such as Fedora and RHEL.

(2) Assuming SELinux is set to non-enforcing mode, it then fails when trying
to attach the cloned KCM socket to the TCP socket:

ioctl(8, SIOCPROTOPRIVATE, 0x7ffddfb64f84) = 0
ioctl(9, SIOCPROTOPRIVATE, 0x7ffddfb64f84) = -1 EALREADY (Operation already in progress)

again from the SIOCKCMATTACH ioctl. This seems to be because the TCP
socket (csock in kcm_attach() in the kernel) has already got sk_user_data
set from the first ioctl on fd 8:

if (csk->sk_user_data) {
write_unlock_bh(&csk->sk_callback_lock);
kmem_cache_free(kcm_psockp, psock);
err = -EALREADY;
goto out;
}

Now, this could be a bug in the test program. Since both fds 8 and 9 should
correspond to the same multiplexor, presumably the TCP socket only needs
attaching once (note that the TCP socket is obtained from accept() in this
case).

David
---
/*
* A sample program of KCM.
*
* $ gcc -lbcc kcm-sample.c
* $ ./a.out 10000
*
* https://gist.github.com/dhowells/24b87fdf731884ed9ca19e9840c0c894
*/
#include <err.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netdb.h>
#include <poll.h>

#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <netinet/in.h>

/* libbcc */
#include <bcc/bcc_common.h>
#include <bcc/libbpf.h>
#include <bpf/bpf.h>

#include <linux/bpf.h>

/* From linux/kcm.h */
struct kcm_clone {
int fd;
};

struct kcm_attach {
int fd;
int bpf_fd;
};

#ifndef AF_KCM
/* From linux/socket.h */
#define AF_KCM 41 /* Kernel Connection Multiplexor*/
#endif

#ifndef KCMPROTO_CONNECTED
/* From linux/kcm.h */
#define KCMPROTO_CONNECTED 0
#endif

#ifndef SIOCKCMCLONE
/* From linux/sockios.h */
#define SIOCPROTOPRIVATE 0x89E0 /* to 89EF */
/* From linux/kcm.h */
#define SIOCKCMATTACH (SIOCPROTOPRIVATE + 0)
#define SIOCKCMCLONE (SIOCPROTOPRIVATE + 2)
#endif

struct my_proto {
struct _hdr {
uint32_t len;
} hdr;
char data[32];
};

const char *bpf_prog_string = " \
ssize_t bpf_prog1(struct __sk_buff *skb) \
{ \
return load_half(skb, 0) + 4; \
} \
";

int servsock_init(int port)
{
int s, error;
struct sockaddr_in addr;

s = socket(AF_INET, SOCK_STREAM, 0);

addr.sin_family = AF_INET;
addr.sin_port = htons(port);
addr.sin_addr.s_addr = INADDR_ANY;
error = bind(s, (struct sockaddr *)&addr, sizeof(addr));
if (error == -1)
err(EXIT_FAILURE, "bind");

error = listen(s, 10);
if (error == -1)
err(EXIT_FAILURE, "listen");

return s;
}

int bpf_init(void)
{
int fd, map_fd;
void *mod;
int key;
long long value = 0;

mod = bpf_module_create_c_from_string(bpf_prog_string, 0, NULL, 0, 0, NULL);
fd = bcc_prog_load(
BPF_PROG_TYPE_SOCKET_FILTER,
"bpf_prog1",
bpf_function_start(mod, "bpf_prog1"),
bpf_function_size(mod, "bpf_prog1"),
bpf_module_license(mod),
bpf_module_kern_version(mod),
0, NULL, 0);

if (fd == -1)
exit(1);
return fd;
}

void client(int port)
{
int s, error;
struct sockaddr_in addr;
struct hostent *host;
struct my_proto my_msg;
int len;

printf("client is starting\n");

s = socket(AF_INET, SOCK_STREAM, 0);
if (s == -1)
err(EXIT_FAILURE, "socket");

memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_port = htons(port);
host = gethostbyname("localhost");
if (host == NULL)
err(EXIT_FAILURE, "gethostbyname");
memcpy(&addr.sin_addr, host->h_addr, host->h_length);

error = connect(s, (struct sockaddr *)&addr, sizeof(addr));
if (error == -1)
err(EXIT_FAILURE, "connect");

len = sprintf(my_msg.data, "hello");
my_msg.data[len] = '\0';
my_msg.hdr.len = htons(len + 1);

len = write(s, &my_msg, sizeof(my_msg.hdr) + len + 1);
if (error == -1)
err(EXIT_FAILURE, "write");
printf("client sent data\n");

printf("client is waiting a reply\n");
len = read(s, &my_msg, sizeof(my_msg));
if (error == -1)
err(EXIT_FAILURE, "read");

printf("\"%s\" from server\n", my_msg.data);
printf("client received data\n");

close(s);
}

int kcm_init(void)
{
int kcmfd;

kcmfd = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
if (kcmfd == -1)
err(EXIT_FAILURE, "socket(AF_KCM)");

return kcmfd;
}

int kcm_clone(int kcmfd)
{
int error;
struct kcm_clone clone_info;

memset(&clone_info, 0, sizeof(clone_info));
error = ioctl(kcmfd, SIOCKCMCLONE, &clone_info);
if (error == -1)
err(EXIT_FAILURE, "ioctl(SIOCKCMCLONE)");

return clone_info.fd;
}

int kcm_attach(int kcmfd, int csock, int bpf_prog_fd)
{
int error;
struct kcm_attach attach_info;

memset(&attach_info, 0, sizeof(attach_info));
attach_info.fd = csock;
attach_info.bpf_fd = bpf_prog_fd;

error = ioctl(kcmfd, SIOCKCMATTACH, &attach_info);
if (error == -1)
err(EXIT_FAILURE, "ioctl(SIOCKCMATTACH)");
}

void process(int kcmfd0, int kcmfd1)
{
struct my_proto my_msg;
int error, len;
struct pollfd fds[2];
struct msghdr msg;
struct iovec iov;
int fd;

fds[0].fd = kcmfd0;
fds[0].events = POLLIN;
fds[0].revents = 0;
fds[1].fd = kcmfd1;
fds[1].events = POLLIN;
fds[1].revents = 0;

printf("server is waiting data\n");
error = poll(fds, 1, -1);
if (error == -1)
err(EXIT_FAILURE, "poll");

if (fds[0].revents & POLLIN)
fd = fds[0].fd;
else if (fds[1].revents & POLLIN)
fd = fds[1].fd;
iov.iov_base = &my_msg;
iov.iov_len = sizeof(my_msg);

memset(&msg, 0, sizeof(msg));
msg.msg_iov = &iov;
msg.msg_iovlen = 1;

printf("server is receiving data\n");
len = recvmsg(fd, &msg, 0);
if (len == -1)
err(EXIT_FAILURE, "recvmsg");
printf("\"%s\" from client\n", my_msg.data);
printf("server received data\n");

len = sprintf(my_msg.data, "goodbye");
my_msg.data[len] = '\0';
my_msg.hdr.len = htons(len + 1);

len = sendmsg(fd, &msg, 0);
if (len == -1)
err(EXIT_FAILURE, "sendmsg");
}

void server(int tcpfd, int bpf_prog_fd)
{
int kcmfd0, error, kcmfd1;
struct sockaddr_in client;
int len, csock;

printf("server is starting\n");

kcmfd0 = kcm_init();
kcmfd1 = kcm_clone(kcmfd0);

len = sizeof(client);
csock = accept(tcpfd, (struct sockaddr *)&client, &len);
if (csock == -1)
err(EXIT_FAILURE, "accept");

kcm_attach(kcmfd0, csock, bpf_prog_fd);
kcm_attach(kcmfd1, csock, bpf_prog_fd);

process(kcmfd0, kcmfd1);

close(kcmfd0);
close(kcmfd1);
}

int main(int argc, char **argv)
{
int error, tcpfd, bpf_prog_fd;
pid_t pid;
int pipefd[2];
int dummy;

if (argc != 2) {
fprintf(stderr, "Format %s <port>\n", argv[0]);
exit(2);
}

error = pipe(pipefd);
if (error == -1)
err(EXIT_FAILURE, "pipe");

pid = fork();
if (pid == -1)
err(EXIT_FAILURE, "fork");

if (pid == 0) {
/* wait for server's ready */
read(pipefd[0], &dummy, sizeof(dummy));

client(atoi(argv[1]));

exit(0);
}

tcpfd = servsock_init(atoi(argv[1]));
bpf_prog_fd = bpf_init();

/* tell ready */
write(pipefd[1], &dummy, sizeof(dummy));

server(tcpfd, bpf_prog_fd);

waitpid(pid, NULL, 0);

close(bpf_prog_fd);
close(tcpfd);

return 0;
}

2023-04-03 09:56:16

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

Willem de Bruijn <[email protected]> wrote:

> > + } else if ((flags & MSG_SPLICE_PAGES) && length) {
> > + if (inet->hdrincl)
> > + return -EPERM;
> > + if (rt->dst.dev->features & NETIF_F_SG)
> > + /* We need an empty buffer to attach stuff to */
> > + initial_length = transhdrlen;
>
> I still don't entirely understand what initial_length means.
>
> More importantly, transhdrlen can be zero. If not called for UDP
> but for RAW. Or if this is a subsequent call to a packet that is
> being held with MSG_MORE.
>
> This works fine for existing use-cases, which go to alloc_new_skb.
> Not sure how this case would be different. But the comment alludes
> that it does.

The problem is that in the non-MSG_ZEROCOPY case, __ip_append_data() assumes
that it's going to copy the data it is given and will allocate sufficient
space in the skb in advance to hold it - but I don't want to do that because I
want to splice in the pages holding the data instead. However, I do need to
allocate space to hold the transport header.

Maybe I should change 'initial_length' to 'initial_alloc'? It represents the
amount I think we should allocate. Or maybe I should have a separate
allocation clause for MSG_SPLICE_PAGES?

I also wonder if __ip_append_data() really needs two places that call
getfrag().

David

2023-04-03 11:26:14

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

Willem de Bruijn <[email protected]> wrote:

> These functions are derived from and replace ip_append_page.
> That can be removed once udp_sendpage is converted?

Yeah, looks like it.

David

2023-04-03 13:50:37

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

David Howells wrote:
> Willem de Bruijn <[email protected]> wrote:
>
> > > + } else if ((flags & MSG_SPLICE_PAGES) && length) {
> > > + if (inet->hdrincl)
> > > + return -EPERM;
> > > + if (rt->dst.dev->features & NETIF_F_SG)
> > > + /* We need an empty buffer to attach stuff to */
> > > + initial_length = transhdrlen;
> >
> > I still don't entirely understand what initial_length means.
> >
> > More importantly, transhdrlen can be zero. If not called for UDP
> > but for RAW. Or if this is a subsequent call to a packet that is
> > being held with MSG_MORE.
> >
> > This works fine for existing use-cases, which go to alloc_new_skb.
> > Not sure how this case would be different. But the comment alludes
> > that it does.
>
> The problem is that in the non-MSG_ZEROCOPY case, __ip_append_data() assumes
> that it's going to copy the data it is given and will allocate sufficient
> space in the skb in advance to hold it - but I don't want to do that because I
> want to splice in the pages holding the data instead. However, I do need to
> allocate space to hold the transport header.
>
> Maybe I should change 'initial_length' to 'initial_alloc'? It represents the
> amount I think we should allocate. Or maybe I should have a separate
> allocation clause for MSG_SPLICE_PAGES?

The code already has to avoid allocation in the MSG_ZEROCOPY case. I
added alloc_len and paged_len for that purpose.

Only the transhdrlen will be copied with getfrag due to

copy = datalen - transhdrlen - fraggap - pagedlen

On next iteration in the loop, when remaining data fits in the skb,
there are three cases. The first is skipped due to !NETIF_F_SG. The
other two are either copy to page frags or zerocopy page frags.

I think your code should be able to fit in. Maybe easier if it could
reuse the existing alloc_new_skb code to copy the transport header, as
MSG_ZEROCOPY does, rather than adding a new __ip_splice_alloc branch
that short-circuits that. Then __ip_splice_pages also does not need
code to copy the initial header. But this is trickier. It's fine to
leave as is.

Since your code currently does call continue before executing the rest
of that branch, no need to modify any code there? Notably replacing
length with initial_length, which itself is initialized to length in
all cases expect for MSG_SPLICE_PAGES.

Just hardcode transhdrlen as the copy argument to __ip_splice_pages.
> I also wonder if __ip_append_data() really needs two places that call
> getfrag().
>
> David
>


2023-04-03 22:07:06

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

Willem de Bruijn <[email protected]> wrote:

> The code already has to avoid allocation in the MSG_ZEROCOPY case. I
> added alloc_len and paged_len for that purpose.
>
> Only the transhdrlen will be copied with getfrag due to
>
> copy = datalen - transhdrlen - fraggap - pagedlen
>
> On next iteration in the loop, when remaining data fits in the skb,
> there are three cases. The first is skipped due to !NETIF_F_SG. The
> other two are either copy to page frags or zerocopy page frags.
>
> I think your code should be able to fit in. Maybe easier if it could
> reuse the existing alloc_new_skb code to copy the transport header, as
> MSG_ZEROCOPY does, rather than adding a new __ip_splice_alloc branch
> that short-circuits that. Then __ip_splice_pages also does not need
> code to copy the initial header. But this is trickier. It's fine to
> leave as is.
>
> Since your code currently does call continue before executing the rest
> of that branch, no need to modify any code there? Notably replacing
> length with initial_length, which itself is initialized to length in
> all cases expect for MSG_SPLICE_PAGES.

Okay. How about the attached? This seems to work. Just setting "paged" to
true seems to do the right thing in __ip_append_data() when allocating /
setting up the skbuff, and then __ip_splice_pages() is called to add the
pages.

David
---
commit 9ac72c83407c8aef4be0c84513ec27bac9cfbcaa
Author: David Howells <[email protected]>
Date: Thu Mar 9 14:27:29 2023 +0000

ip, udp: Support MSG_SPLICE_PAGES

Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6109a86a8a4b..fe2e48874191 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -956,6 +956,41 @@ csum_page(struct page *page, int offset, int copy)
return csum;
}

+/*
+ * Add (or copy) data pages for MSG_SPLICE_PAGES.
+ */
+static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
+ void *from, int *pcopy)
+{
+ struct msghdr *msg = from;
+ struct page *page = NULL, **pages = &page;
+ ssize_t copy = *pcopy;
+ size_t off;
+ int err;
+
+ copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
+ if (copy <= 0)
+ return copy ?: -EIO;
+
+ err = skb_append_pagefrags(skb, page, off, copy);
+ if (err < 0) {
+ iov_iter_revert(&msg->msg_iter, copy);
+ return err;
+ }
+
+ if (skb->ip_summed == CHECKSUM_NONE) {
+ __wsum csum;
+
+ csum = csum_page(page, off, copy);
+ skb->csum = csum_block_add(skb->csum, csum, skb->len);
+ }
+
+ skb_len_add(skb, copy);
+ refcount_add(copy, &sk->sk_wmem_alloc);
+ *pcopy = copy;
+ return 0;
+}
+
static int __ip_append_data(struct sock *sk,
struct flowi4 *fl4,
struct sk_buff_head *queue,
@@ -1047,6 +1082,15 @@ static int __ip_append_data(struct sock *sk,
skb_zcopy_set(skb, uarg, &extra_uref);
}
}
+ } else if ((flags & MSG_SPLICE_PAGES) && length) {
+ if (inet->hdrincl)
+ return -EPERM;
+ if (rt->dst.dev->features & NETIF_F_SG) {
+ /* We need an empty buffer to attach stuff to */
+ paged = true;
+ } else {
+ flags &= ~MSG_SPLICE_PAGES;
+ }
}

cork->length += length;
@@ -1206,6 +1250,10 @@ static int __ip_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
+ } else if (flags & MSG_SPLICE_PAGES) {
+ err = __ip_splice_pages(sk, skb, from, &copy);
+ if (err < 0)
+ goto error;
} else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;


2023-04-04 10:55:53

by Bernard Metzler

[permalink] [raw]
Subject: RE: [PATCH v3 38/55] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit



> -----Original Message-----
> From: David Howells <[email protected]>
> Sent: Friday, 31 March 2023 18:09
> To: Matthew Wilcox <[email protected]>; David S. Miller
> <[email protected]>; Eric Dumazet <[email protected]>; Jakub Kicinski
> <[email protected]>; Paolo Abeni <[email protected]>
> Cc: David Howells <[email protected]>; Al Viro <[email protected]>;
> Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Jeff
> Layton <[email protected]>; Christian Brauner <[email protected]>; Chuck
> Lever III <[email protected]>; Linus Torvalds <torvalds@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected]; Bernard Metzler
> <[email protected]>; Tom Talpey <[email protected]>; linux-
> [email protected]
> Subject: [EXTERNAL] [PATCH v3 38/55] siw: Use sendmsg(MSG_SPLICE_PAGES)
> rather than sendpage to transmit
>
> When transmitting data, call down into TCP using a single sendmsg with
> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
> performing several sendmsg and sendpage calls to transmit header, data
> pages and trailer.
>
> To make this work, the data is assembled in a bio_vec array and attached to
> a BVEC-type iterator. The header and trailer (if present) are copied into
> page fragments that can be freed with put_page().
>
> Signed-off-by: David Howells <[email protected]>
> cc: Bernard Metzler <[email protected]>
> cc: Tom Talpey <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
> drivers/infiniband/sw/siw/siw_qp_tx.c | 234 ++++++--------------------
> 1 file changed, 48 insertions(+), 186 deletions(-)
>
> diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c
> b/drivers/infiniband/sw/siw/siw_qp_tx.c
> index fa5de40d85d5..fbe80c06d0ca 100644
> --- a/drivers/infiniband/sw/siw/siw_qp_tx.c
> +++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
> @@ -312,114 +312,8 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx,
> struct socket *s,
> return rv;
> }
>
> -/*
> - * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
> - *
> - * Using sendpage to push page by page appears to be less efficient
> - * than using sendmsg, even if data are copied.
> - *
> - * A general performance limitation might be the extra four bytes
> - * trailer checksum segment to be pushed after user data.
> - */
> -static int siw_tcp_sendpages(struct socket *s, struct page **page, int
> offset,
> - size_t size)
> -{
> - struct bio_vec bvec;
> - struct msghdr msg = {
> - .msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST |
> - MSG_SPLICE_PAGES),
> - };
> - struct sock *sk = s->sk;
> - int i = 0, rv = 0, sent = 0;
> -
> - while (size) {
> - size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
> -
> - if (size + offset <= PAGE_SIZE)
> - msg.msg_flags = MSG_MORE | MSG_DONTWAIT;
> -
> - tcp_rate_check_app_limited(sk);
> - bvec_set_page(&bvec, page[i], bytes, offset);
> - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
> -
> -try_page_again:
> - lock_sock(sk);
> - rv = tcp_sendmsg_locked(sk, &msg, size);
> - release_sock(sk);
> -
> - if (rv > 0) {
> - size -= rv;
> - sent += rv;
> - if (rv != bytes) {
> - offset += rv;
> - bytes -= rv;
> - goto try_page_again;
> - }
> - offset = 0;
> - } else {
> - if (rv == -EAGAIN || rv == 0)
> - break;
> - return rv;
> - }
> - i++;
> - }
> - return sent;
> -}
> -
> -/*
> - * siw_0copy_tx()
> - *
> - * Pushes list of pages to TCP socket. If pages from multiple
> - * SGE's, all referenced pages of each SGE are pushed in one
> - * shot.
> - */
> -static int siw_0copy_tx(struct socket *s, struct page **page,
> - struct siw_sge *sge, unsigned int offset,
> - unsigned int size)
> -{
> - int i = 0, sent = 0, rv;
> - int sge_bytes = min(sge->length - offset, size);
> -
> - offset = (sge->laddr + offset) & ~PAGE_MASK;
> -
> - while (sent != size) {
> - rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes);
> - if (rv >= 0) {
> - sent += rv;
> - if (size == sent || sge_bytes > rv)
> - break;
> -
> - i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT;
> - sge++;
> - sge_bytes = min(sge->length, size - sent);
> - offset = sge->laddr & ~PAGE_MASK;
> - } else {
> - sent = rv;
> - break;
> - }
> - }
> - return sent;
> -}
> -
> #define MAX_TRAILER (MPA_CRC_SIZE + 4)
>
> -static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int
> len)
> -{
> - int i;
> -
> - /*
> - * Work backwards through the array to honor the kmap_local_page()
> - * ordering requirements.
> - */
> - for (i = (len-1); i >= 0; i--) {
> - if (kmap_mask & BIT(i)) {
> - unsigned long addr = (unsigned long)iov[i].iov_base;
> -
> - kunmap_local((void *)(addr & PAGE_MASK));
> - }
> - }
> -}
> -
> /*
> * siw_tx_hdt() tries to push a complete packet to TCP where all
> * packet fragments are referenced by the elements of one iovec.
> @@ -439,15 +333,14 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
> {
> struct siw_wqe *wqe = &c_tx->wqe_active;
> struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
> - struct kvec iov[MAX_ARRAY];
> - struct page *page_array[MAX_ARRAY];
> + struct bio_vec bvec[MAX_ARRAY];
> struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
> + void *trl, *t;
>
> int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
> unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
> sge_off = c_tx->sge_off, sge_idx = c_tx->sge_idx,
> pbl_idx = c_tx->pbl_idx;
> - unsigned long kmap_mask = 0L;
>
> if (c_tx->state == SIW_SEND_HDR) {
> if (c_tx->use_sendpage) {
> @@ -457,10 +350,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
>

Couldn't we now collapse the two header handling paths
into one, avoiding extra
'if (c_tx->use_sendpage) {} else {}' conditions?


> c_tx->state = SIW_SEND_DATA;
> } else {
> - iov[0].iov_base =
> - (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent;
> - iov[0].iov_len = hdr_len =
> - c_tx->ctrl_len - c_tx->ctrl_sent;
> + const void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent;
> + void *h;
> +
> + rv = -ENOMEM;
> + hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent;
> + h = page_frag_memdup(NULL, hdr, hdr_len, GFP_NOFS,
> ULONG_MAX);

Let's stay with < 80 chars per line for the RDMA
subsystem code. Two more cases further down....thanks!

> + if (!h)
> + goto done;
> + bvec_set_virt(&bvec[0], h, hdr_len);
> seg = 1;
> }
> }
> @@ -478,28 +376,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
> } else {
> is_kva = 1;
> }
> - if (is_kva && !c_tx->use_sendpage) {
> - /*
> - * tx from kernel virtual address: either inline data
> - * or memory region with assigned kernel buffer
> - */
> - iov[seg].iov_base =
> - (void *)(uintptr_t)(sge->laddr + sge_off);
> - iov[seg].iov_len = sge_len;
> -
> - if (do_crc)
> - crypto_shash_update(c_tx->mpa_crc_hd,
> - iov[seg].iov_base,
> - sge_len);
> - sge_off += sge_len;
> - data_len -= sge_len;
> - seg++;
> - goto sge_done;
> - }
>
> while (sge_len) {
> size_t plen = min((int)PAGE_SIZE - fp_off, sge_len);
> - void *kaddr;
>
> if (!is_kva) {
> struct page *p;
> @@ -512,33 +391,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
> p = siw_get_upage(mem->umem,
> sge->laddr + sge_off);
> if (unlikely(!p)) {
> - siw_unmap_pages(iov, kmap_mask, seg);
> wqe->processed -= c_tx->bytes_unsent;
> rv = -EFAULT;
> goto done_crc;
> }
> - page_array[seg] = p;
> -
> - if (!c_tx->use_sendpage) {
> - void *kaddr = kmap_local_page(p);
> -
> - /* Remember for later kunmap() */
> - kmap_mask |= BIT(seg);
> - iov[seg].iov_base = kaddr + fp_off;
> - iov[seg].iov_len = plen;
> -
> - if (do_crc)
> - crypto_shash_update(
> - c_tx->mpa_crc_hd,
> - iov[seg].iov_base,
> - plen);
> - } else if (do_crc) {
> - kaddr = kmap_local_page(p);
> - crypto_shash_update(c_tx->mpa_crc_hd,
> - kaddr + fp_off,
> - plen);
> - kunmap_local(kaddr);
> - }
> +
> + bvec_set_page(&bvec[seg], p, plen, fp_off);
> } else {
> /*
> * Cast to an uintptr_t to preserve all 64 bits
> @@ -552,12 +410,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
> * bits on a 64 bit platform and 32 bits on a
> * 32 bit platform.
> */
> - page_array[seg] = virt_to_page((void *)(va &
> PAGE_MASK));
> - if (do_crc)
> - crypto_shash_update(
> - c_tx->mpa_crc_hd,
> - (void *)va,
> - plen);
> + bvec_set_virt(&bvec[seg], (void *)va, plen);
> + }
> +
> + if (do_crc) {
> + void *kaddr = kmap_local_page(bvec[seg].bv_page);
> + crypto_shash_update(c_tx->mpa_crc_hd,
> + kaddr + bvec[seg].bv_offset,
> + bvec[seg].bv_len);
> + kunmap_local(kaddr);
> }
>
> sge_len -= plen;
> @@ -567,13 +428,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
>
> if (++seg > (int)MAX_ARRAY) {
> siw_dbg_qp(tx_qp(c_tx), "to many fragments\n");
> - siw_unmap_pages(iov, kmap_mask, seg-1);
> wqe->processed -= c_tx->bytes_unsent;
> rv = -EMSGSIZE;
> goto done_crc;
> }
> }
> -sge_done:
> +
> /* Update SGE variables at end of SGE */
> if (sge_off == sge->length &&
> (data_len != 0 || wqe->processed < wqe->bytes)) {
> @@ -582,15 +442,8 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
> sge_off = 0;
> }
> }
> - /* trailer */
> - if (likely(c_tx->state != SIW_SEND_TRAILER)) {
> - iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad];
> - iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad);
> - } else {
> - iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent];
> - iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent;
> - }
>
> + /* Set the CRC in the trailer */
> if (c_tx->pad) {
> *(u32 *)c_tx->trailer.pad = 0;
> if (do_crc)
> @@ -603,23 +456,29 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> struct socket *s)
> else if (do_crc)
> crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)&c_tx->trailer.crc);
>
> - data_len = c_tx->bytes_unsent;
> -
> - if (c_tx->use_sendpage) {
> - rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
> - c_tx->sge_off, data_len);
> - if (rv == data_len) {
> - rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
> - if (rv > 0)
> - rv += data_len;
> - else
> - rv = data_len;
> - }
> + /* Copy the trailer and add it to the output list */
> + if (likely(c_tx->state != SIW_SEND_TRAILER)) {
> + trl = &c_tx->trailer.pad[4 - c_tx->pad];
> + trl_len = MAX_TRAILER - (4 - c_tx->pad);
> } else {
> - rv = kernel_sendmsg(s, &msg, iov, seg + 1,
> - hdr_len + data_len + trl_len);
> - siw_unmap_pages(iov, kmap_mask, seg);
> + trl = &c_tx->trailer.pad[c_tx->ctrl_sent];
> + trl_len = MAX_TRAILER - c_tx->ctrl_sent;
> }
> +
> + rv = -ENOMEM;
> + t = page_frag_memdup(NULL, trl, trl_len, GFP_NOFS, ULONG_MAX);
> + if (!t)
> + goto done_crc;
> + bvec_set_virt(&bvec[seg], t, trl_len);
> +
> + data_len = c_tx->bytes_unsent;
> +
> + if (c_tx->use_sendpage)
> + msg.msg_flags |= MSG_SPLICE_PAGES;
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1,
> + hdr_len + data_len + trl_len);
> + rv = sock_sendmsg(s, &msg);
> +
> if (rv < (int)hdr_len) {
> /* Not even complete hdr pushed or negative rv */
> wqe->processed -= data_len;
> @@ -680,6 +539,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct
> socket *s)
> }
> done_crc:
> c_tx->do_crc = 0;
> + if (c_tx->state == SIW_SEND_HDR)
> + folio_put(page_folio(bvec[0].bv_page));
> + folio_put(page_folio(bvec[seg].bv_page));
> done:
> return rv;
> }

2023-04-04 17:10:19

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

David Howells wrote:
> Willem de Bruijn <[email protected]> wrote:
>
> > The code already has to avoid allocation in the MSG_ZEROCOPY case. I
> > added alloc_len and paged_len for that purpose.
> >
> > Only the transhdrlen will be copied with getfrag due to
> >
> > copy = datalen - transhdrlen - fraggap - pagedlen
> >
> > On next iteration in the loop, when remaining data fits in the skb,
> > there are three cases. The first is skipped due to !NETIF_F_SG. The
> > other two are either copy to page frags or zerocopy page frags.
> >
> > I think your code should be able to fit in. Maybe easier if it could
> > reuse the existing alloc_new_skb code to copy the transport header, as
> > MSG_ZEROCOPY does, rather than adding a new __ip_splice_alloc branch
> > that short-circuits that. Then __ip_splice_pages also does not need
> > code to copy the initial header. But this is trickier. It's fine to
> > leave as is.
> >
> > Since your code currently does call continue before executing the rest
> > of that branch, no need to modify any code there? Notably replacing
> > length with initial_length, which itself is initialized to length in
> > all cases expect for MSG_SPLICE_PAGES.
>
> Okay. How about the attached? This seems to work. Just setting "paged" to
> true seems to do the right thing in __ip_append_data() when allocating /
> setting up the skbuff, and then __ip_splice_pages() is called to add the
> pages.

If this works, much preferred. Looks great to me.

As said, then __ip_splice_pages() probably no longer needs the
preamble to copy initial header bytes.

> David
> ---
> commit 9ac72c83407c8aef4be0c84513ec27bac9cfbcaa
> Author: David Howells <[email protected]>
> Date: Thu Mar 9 14:27:29 2023 +0000
>
> ip, udp: Support MSG_SPLICE_PAGES
>
> Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
> spliced from the source iterator.
>
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Willem de Bruijn <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 6109a86a8a4b..fe2e48874191 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -956,6 +956,41 @@ csum_page(struct page *page, int offset, int copy)
> return csum;
> }
>
> +/*
> + * Add (or copy) data pages for MSG_SPLICE_PAGES.
> + */
> +static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
> + void *from, int *pcopy)
> +{
> + struct msghdr *msg = from;
> + struct page *page = NULL, **pages = &page;
> + ssize_t copy = *pcopy;
> + size_t off;
> + int err;
> +
> + copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
> + if (copy <= 0)
> + return copy ?: -EIO;
> +
> + err = skb_append_pagefrags(skb, page, off, copy);
> + if (err < 0) {
> + iov_iter_revert(&msg->msg_iter, copy);
> + return err;
> + }
> +
> + if (skb->ip_summed == CHECKSUM_NONE) {
> + __wsum csum;
> +
> + csum = csum_page(page, off, copy);
> + skb->csum = csum_block_add(skb->csum, csum, skb->len);
> + }
> +
> + skb_len_add(skb, copy);
> + refcount_add(copy, &sk->sk_wmem_alloc);
> + *pcopy = copy;
> + return 0;
> +}
> +
> static int __ip_append_data(struct sock *sk,
> struct flowi4 *fl4,
> struct sk_buff_head *queue,
> @@ -1047,6 +1082,15 @@ static int __ip_append_data(struct sock *sk,
> skb_zcopy_set(skb, uarg, &extra_uref);
> }
> }
> + } else if ((flags & MSG_SPLICE_PAGES) && length) {
> + if (inet->hdrincl)
> + return -EPERM;
> + if (rt->dst.dev->features & NETIF_F_SG) {
> + /* We need an empty buffer to attach stuff to */
> + paged = true;
> + } else {
> + flags &= ~MSG_SPLICE_PAGES;
> + }
> }
>
> cork->length += length;
> @@ -1206,6 +1250,10 @@ static int __ip_append_data(struct sock *sk,
> err = -EFAULT;
> goto error;
> }
> + } else if (flags & MSG_SPLICE_PAGES) {
> + err = __ip_splice_pages(sk, skb, from, &copy);
> + if (err < 0)
> + goto error;
> } else if (!zc) {
> int i = skb_shinfo(skb)->nr_frags;
>
>


2023-04-04 17:19:20

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

Willem de Bruijn <[email protected]> wrote:

> > Okay. How about the attached? This seems to work. Just setting "paged" to
> > true seems to do the right thing in __ip_append_data() when allocating /
> > setting up the skbuff, and then __ip_splice_pages() is called to add the
> > pages.
>
> If this works, much preferred. Looks great to me.

:-)

> As said, then __ip_splice_pages() probably no longer needs the
> preamble to copy initial header bytes.

Sorry, what? It only attaches pages extracted from the iterator.

David

2023-04-04 17:56:19

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES

David Howells wrote:
> Willem de Bruijn <[email protected]> wrote:
>
> > > Okay. How about the attached? This seems to work. Just setting "paged" to
> > > true seems to do the right thing in __ip_append_data() when allocating /
> > > setting up the skbuff, and then __ip_splice_pages() is called to add the
> > > pages.
> >
> > If this works, much preferred. Looks great to me.
>
> :-)
>
> > As said, then __ip_splice_pages() probably no longer needs the
> > preamble to copy initial header bytes.
>
> Sorry, what? It only attaches pages extracted from the iterator.

Ehm indeed. Never mind.

2023-04-05 08:32:11

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v3 38/55] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit

Bernard Metzler <[email protected]> wrote:

> > if (c_tx->state == SIW_SEND_HDR) {
> > if (c_tx->use_sendpage) {
> > @@ -457,10 +350,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> > struct socket *s)
> >
>
> Couldn't we now collapse the two header handling paths
> into one, avoiding extra
> 'if (c_tx->use_sendpage) {} else {}' conditions?

Okay, see the attached incremental change.

Note that the calls to page_frag_memdup() I previously added are probably not
going to be necessary as copying unspliceable data is now done in the
protocols (TCP, IP/UDP, UNIX, etc.). See patch 08 for the TCP version.

David
---
diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 28076832da20..edf66a97cf5f 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -335,7 +335,7 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
struct bio_vec bvec[MAX_ARRAY];
struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
- void *trl, *t;
+ void *trl;

int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
@@ -343,25 +343,11 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
pbl_idx = c_tx->pbl_idx;

if (c_tx->state == SIW_SEND_HDR) {
- if (c_tx->use_sendpage) {
- rv = siw_tx_ctrl(c_tx, s, MSG_DONTWAIT | MSG_MORE);
- if (rv)
- goto done;
+ void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent;

- c_tx->state = SIW_SEND_DATA;
- } else {
- const void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent;
- void *h;
-
- rv = -ENOMEM;
- hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent;
- h = page_frag_memdup(NULL, hdr, hdr_len, GFP_NOFS,
- ULONG_MAX);
- if (!h)
- goto done;
- bvec_set_virt(&bvec[0], h, hdr_len);
- seg = 1;
- }
+ hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent;
+ bvec_set_virt(&bvec[0], hdr, hdr_len);
+ seg = 1;
}

wqe->processed += data_len;
@@ -466,12 +452,7 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
trl = &c_tx->trailer.pad[c_tx->ctrl_sent];
trl_len = MAX_TRAILER - c_tx->ctrl_sent;
}
-
- rv = -ENOMEM;
- t = page_frag_memdup(NULL, trl, trl_len, GFP_NOFS, ULONG_MAX);
- if (!t)
- goto done_crc;
- bvec_set_virt(&bvec[seg], t, trl_len);
+ bvec_set_virt(&bvec[seg], trl, trl_len);

data_len = c_tx->bytes_unsent;

@@ -480,7 +461,6 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1,
hdr_len + data_len + trl_len);
rv = sock_sendmsg(s, &msg);
-
if (rv < (int)hdr_len) {
/* Not even complete hdr pushed or negative rv */
wqe->processed -= data_len;
@@ -541,10 +521,6 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
}
done_crc:
c_tx->do_crc = 0;
- if (c_tx->state == SIW_SEND_HDR)
- folio_put(page_folio(bvec[0].bv_page));
- folio_put(page_folio(bvec[seg].bv_page));
-done:
return rv;
}

2023-04-05 15:06:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 06/55] mm: Make the page_frag_cache allocator use per-cpu

On Fri, Mar 31, 2023 at 05:08:25PM +0100, David Howells wrote:
> Make the page_frag_cache allocator have a separate allocation bucket for
> each cpu to avoid racing. This means that no lock is required, other than
> preempt disablement, to allocate from it, though if a softirq wants to
> access it, then softirq disablement will need to be added.

Can you split this into a separte series? Right now I only see this
patch in mbox and miss the context on why wed want this.

> Make the NVMe and mediatek drivers pass in NULL to page_frag_cache() and
> use the default allocation buckets rather than defining their own.

Can you explain why?

2023-04-10 12:27:24

by Xiubo Li

[permalink] [raw]
Subject: Re: [PATCH v3 45/55] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()

David,

With these patches I can reproduce the hang every time, and I haven't
gotten a chance to debug it yet, the logs:


<3>[  727.042312] INFO: task kworker/u20:4:78 blocked for more than 245
seconds.
<3>[  727.042381]       Tainted: G        W 6.3.0-rc3+ #9
<3>[  727.042417] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6>[  727.042449] task:kworker/u20:4   state:D stack:0 pid:78   
ppid:2      flags:0x00004000
<6>[  727.042496] Workqueue: writeback wb_workfn (flush-ceph-1)
<6>[  727.042558] Call Trace:
<6>[  727.042576]  <TASK>
<6>[  727.042630]  __schedule+0x4a8/0xa80
<6>[  727.042942]  ? __pfx___schedule+0x10/0x10
<6>[  727.042997]  ? __pfx___lock_release+0x10/0x10
<6>[  727.043118]  ? check_chain_key+0x205/0x2b0
<6>[  727.043299]  schedule+0x8e/0x120
<6>[  727.043372]  schedule_preempt_disabled+0x11/0x20
<6>[  727.043420]  __mutex_lock+0x97b/0x1270
<6>[  727.043532]  ? ceph_con_send+0xa4/0x310 [libceph]
<6>[  727.044348]  ? __pfx___mutex_lock+0x10/0x10
<6>[  727.044410]  ? encode_oloc+0x16e/0x1b0 [libceph]
<6>[  727.045473]  ? ceph_con_send+0xa4/0x310 [libceph]
<6>[  727.046298]  ceph_con_send+0xa4/0x310 [libceph]
<6>[  727.047200]  send_request+0x2b1/0x760 [libceph]
<6>[  727.048122]  ? __pfx___lock_release+0x10/0x10
<6>[  727.048186]  ? __lock_acquired+0x1ef/0x3d0
<6>[  727.048264]  ? __pfx_send_request+0x10/0x10 [libceph]
<6>[  727.049104]  ? check_chain_key+0x205/0x2b0
<6>[  727.049215]  ? link_request+0xcd/0x1a0 [libceph]
<6>[  727.050047]  ? do_raw_spin_unlock+0x99/0x100
<6>[  727.050159]  ? _raw_spin_unlock+0x1f/0x40
<6>[  727.050279]  __submit_request+0x2b7/0x4e0 [libceph]
<6>[  727.051259]  ceph_osdc_start_request+0x31/0x40 [libceph]
<6>[  727.052141]  ceph_writepages_start+0x1d13/0x2490 [ceph]
<6>[  727.053258]  ? __pfx_ceph_writepages_start+0x10/0x10 [ceph]
<6>[  727.053872]  ? validate_chain+0xa1/0x760
<6>[  727.054023]  ? __pfx_validate_chain+0x10/0x10
<6>[  727.054171]  ? check_chain_key+0x205/0x2b0
<6>[  727.054308]  ? __lock_acquire+0x7b3/0xfc0
<6>[  727.054726]  ? reacquire_held_locks+0x18b/0x290
<6>[  727.054790]  ? writeback_sb_inodes+0x263/0x7c0
<6>[  727.054990]  do_writepages+0x106/0x320
<6>[  727.055068]  ? __pfx_do_writepages+0x10/0x10
<6>[  727.055152]  ? __pfx___lock_release+0x10/0x10
<6>[  727.055205]  ? __lock_acquired+0x1ef/0x3d0
<6>[  727.055314]  ? check_chain_key+0x205/0x2b0
<6>[  727.055594]  __writeback_single_inode+0x95/0x450
<6>[  727.055725]  writeback_sb_inodes+0x392/0x7c0
<6>[  727.055912]  ? __pfx_writeback_sb_inodes+0x10/0x10
<6>[  727.055994]  ? __pfx_lock_acquire+0x10/0x10
<6>[  727.056045]  ? do_raw_spin_unlock+0x99/0x100
<6>[  727.056221]  ? __pfx_move_expired_inodes+0x10/0x10
<6>[  727.056541]  __writeback_inodes_wb+0x6a/0x130
<6>[  727.056674]  wb_writeback+0x45b/0x530
<6>[  727.056780]  ? __pfx_wb_writeback+0x10/0x10
<6>[  727.056910]  ? _find_next_bit+0x37/0xc0
<6>[  727.057102]  wb_do_writeback+0x434/0x4f0
<6>[  727.057216]  ? __pfx_wb_do_writeback+0x10/0x10
<6>[  727.057445]  ? __lock_acquire+0x7b3/0xfc0
<6>[  727.057614]  wb_workfn+0xe0/0x400
<6>[  727.057692]  ? __pfx_wb_workfn+0x10/0x10
<6>[  727.057752]  ? lock_acquire+0x15c/0x3e0
<6>[  727.057802]  ? process_one_work+0x436/0x990
<6>[  727.057903]  ? __pfx_lock_acquire+0x10/0x10
<6>[  727.058039]  ? check_chain_key+0x205/0x2b0
<6>[  727.058086]  ? __pfx_try_to_wake_up+0x10/0x10
<6>[  727.058138]  ? mark_held_locks+0x23/0x90
<6>[  727.058306]  process_one_work+0x505/0x990
<6>[  727.058610]  ? __pfx_process_one_work+0x10/0x10
<6>[  727.058731]  ? mark_held_locks+0x23/0x90
<6>[  727.058827]  ? worker_thread+0xce/0x670
<6>[  727.058958]  worker_thread+0x2dd/0x670
<6>[  727.059145]  ? __pfx_worker_thread+0x10/0x10
<6>[  727.059201]  kthread+0x16f/0x1a0
<6>[  727.059252]  ? __pfx_kthread+0x10/0x10
<6>[  727.059341]  ret_from_fork+0x2c/0x50
<6>[  727.059683]  </TASK>
<3>[  727.059964] INFO: task kworker/9:6:8586 blocked for more than 245
seconds.
<3>[  727.060031]       Tainted: G        W 6.3.0-rc3+ #9
<3>[  727.060086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6>[  727.060140] task:kworker/9:6     state:D stack:0 pid:8586 
ppid:2      flags:0x00004000
<6>[  727.060211] Workqueue: events handle_osds_timeout [libceph]
<6>[  727.060836] Call Trace:
<6>[  727.060892]  <TASK>
<6>[  727.061089]  __schedule+0x4a8/0xa80
<6>[  727.061243]  ? __pfx___schedule+0x10/0x10
<6>[  727.061488]  ? mark_held_locks+0x6b/0x90
<6>[  727.061581]  ? lockdep_hardirqs_on_prepare.part.0+0xea/0x1b0
<6>[  727.061696]  schedule+0x8e/0x120
<6>[  727.061767]  schedule_preempt_disabled+0x11/0x20
<6>[  727.061812]  rwsem_down_write_slowpath+0x2d6/0x840
<6>[  727.061944]  ? __pfx_rwsem_down_write_slowpath+0x10/0x10
<6>[  727.062240]  down_write+0x1bf/0x1d0
<6>[  727.062307]  ? __pfx_down_write+0x10/0x10
<6>[  727.062517]  ? check_chain_key+0x205/0x2b0
<6>[  727.062635]  handle_osds_timeout+0x6f/0x1b0 [libceph]
<6>[  727.063487]  process_one_work+0x505/0x990
<6>[  727.063657]  ? __pfx_process_one_work+0x10/0x10
<6>[  727.063789]  ? mark_held_locks+0x23/0x90
<6>[  727.063913]  ? worker_thread+0xce/0x670
<6>[  727.064024]  worker_thread+0x2dd/0x670
<6>[  727.064161]  ? __kthread_parkme+0xc9/0xe0
<6>[  727.064353]  ? __pfx_worker_thread+0x10/0x10
<6>[  727.064411]  kthread+0x16f/0x1a0
<6>[  727.064462]  ? __pfx_kthread+0x10/0x10
<6>[  727.064556]  ret_from_fork+0x2c/0x50
<6>[  727.064816]  </TASK>
<3>[  727.064853] INFO: task kworker/6:6:8587 blocked for more than 245
seconds.
<3>[  727.064917]       Tainted: G        W 6.3.0-rc3+ #9
<3>[  727.064970] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6>[  727.065022] task:kworker/6:6     state:D stack:0 pid:8587 
ppid:2      flags:0x00004000
<6>[  727.065081] Workqueue: events delayed_work [ceph]
<6>[  727.065693] Call Trace:
<6>[  727.065724]  <TASK>
<6>[  727.065833]  __schedule+0x4a8/0xa80
<6>[  727.065932]  ? __pfx___schedule+0x10/0x10
<6>[  727.065983]  ? rwsem_down_read_slowpath+0x294/0x940
<6>[  727.066112]  ? mark_lock.part.0+0xd7/0x6c0
<6>[  727.066292]  ? _raw_spin_unlock_irq+0x24/0x40
<6>[  727.066404]  ? rwsem_down_read_slowpath+0x294/0x940
<6>[  727.066460]  schedule+0x8e/0x120
<6>[  727.066529]  schedule_preempt_disabled+0x11/0x20
<6>[  727.066574]  rwsem_down_read_slowpath+0x486/0x940
<6>[  727.066686]  ? __pfx_rwsem_down_read_slowpath+0x10/0x10
<6>[  727.066885]  ? lock_acquire+0x15c/0x3e0
<6>[  727.066949]  ? find_held_lock+0x8c/0xa0
<6>[  727.066994]  ? kvm_clock_read+0x14/0x30
<6>[  727.067036]  ? kvm_sched_clock_read+0x5/0x20
<6>[  727.067269]  __down_read_common+0xad/0x310
<6>[  727.067344]  ? __pfx___down_read_common+0x10/0x10
<6>[  727.067450]  ? ceph_send_cap_releases+0xbe/0x6a0 [ceph]
<6>[  727.068071]  down_read+0x7a/0x90
<6>[  727.068140]  ceph_send_cap_releases+0xbe/0x6a0 [ceph]
<6>[  727.068869]  ? mark_held_locks+0x6b/0x90
<6>[  727.068990]  ? __pfx_ceph_send_cap_releases+0x10/0x10 [ceph]
<6>[  727.069733]  delayed_work+0x30b/0x310 [ceph]
<6>[  727.070445]  process_one_work+0x505/0x990
<6>[  727.070625]  ? __pfx_process_one_work+0x10/0x10
<6>[  727.070769]  ? mark_held_locks+0x23/0x90
<6>[  727.070838]  ? worker_thread+0xce/0x670
<6>[  727.070916]  worker_thread+0x2dd/0x670
<6>[  727.071016]  ? __kthread_parkme+0xc9/0xe0
<6>[  727.071076]  ? __pfx_worker_thread+0x10/0x10
<6>[  727.071115]  kthread+0x16f/0x1a0
<6>[  727.071150]  ? __pfx_kthread+0x10/0x10
<6>[  727.071368]  ret_from_fork+0x2c/0x50
<6>[  727.071532]  </TASK>
<3>[  727.071593] INFO: task kworker/9:18:9987 blocked for more than 245
seconds.
<3>[  727.071642]       Tainted: G        W 6.3.0-rc3+ #9
<3>[  727.071711] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6>[  727.071750] task:kworker/9:18    state:D stack:0 pid:9987 
ppid:2      flags:0x00004000
<6>[  727.071797] Workqueue: events handle_timeout [libceph]
<6>[  727.072375] Call Trace:
<6>[  727.072397]  <TASK>
<6>[  727.072457]  __schedule+0x4a8/0xa80
<6>[  727.072536]  ? __pfx___schedule+0x10/0x10
<6>[  727.072659]  ? mark_lock.part.0+0xd7/0x6c0
<6>[  727.072711]  ? _raw_spin_unlock_irq+0x24/0x40
<6>[  727.072790]  schedule+0x8e/0x120
<6>[  727.072839]  schedule_preempt_disabled+0x11/0x20
<6>[  727.072871]  rwsem_down_write_slowpath+0x2d6/0x840
<6>[  727.072945]  ? __pfx_rwsem_down_write_slowpath+0x10/0x10
<6>[  727.073161]  down_write+0x1bf/0x1d0
<6>[  727.073424]  ? __pfx_down_write+0x10/0x10
<6>[  727.073551]  handle_timeout+0x12a/0x6f0 [libceph]
<6>[  727.074072]  ? __pfx_lock_acquire+0x10/0x10
<6>[  727.074188]  ? __pfx_handle_timeout+0x10/0x10 [libceph]
<6>[  727.075342]  process_one_work+0x505/0x990
<6>[  727.075473]  ? __pfx_process_one_work+0x10/0x10
<6>[  727.075558]  ? mark_held_locks+0x23/0x90
<6>[  727.075655]  ? worker_thread+0xce/0x670
<6>[  727.075716]  worker_thread+0x2dd/0x670
<6>[  727.075820]  ? __kthread_parkme+0xc9/0xe0
<6>[  727.075865]  ? __pfx_worker_thread+0x10/0x10
<6>[  727.075895]  kthread+0x16f/0x1a0
<6>[  727.075923]  ? __pfx_kthread+0x10/0x10
<6>[  727.076036]  ret_from_fork+0x2c/0x50
<6>[  727.076169]  </TASK>
<3>[  727.076201] INFO: task ffsb:13945 blocked for more than 245 seconds.
<3>[  727.076239]       Tainted: G        W 6.3.0-rc3+ #9
<3>[  727.076272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
<6>[  727.076302] task:ffsb            state:D stack:0 pid:13945
ppid:10657  flags:0x00000002
<6>[  727.076340] Call Trace:
<6>[  727.076357]  <TASK>
<6>[  727.076400]  __schedule+0x4a8/0xa80
<6>[  727.076459]  ? __pfx___schedule+0x10/0x10
<6>[  727.076489]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
<6>[  727.076629]  ? rwsem_down_read_slowpath+0x294/0x940
<6>[  727.076663]  schedule+0x8e/0x120
<6>[  727.076701]  schedule_preempt_disabled+0x11/0x20
<6>[  727.076726]  rwsem_down_read_slowpath+0x486/0x940
<6>[  727.076789]  ? __pfx_rwsem_down_read_slowpath+0x10/0x10
<6>[  727.076887]  ? lock_acquire+0x15c/0x3e0
<6>[  727.076921]  ? find_held_lock+0x8c/0xa0
<6>[  727.077014]  ? kvm_clock_read+0x14/0x30
<6>[  727.077047]  ? kvm_sched_clock_read+0x5/0x20
<6>[  727.077110]  __down_read_common+0xad/0x310
<6>[  727.077147]  ? __pfx___down_read_common+0x10/0x10
<6>[  727.077182]  ? __pfx_generic_write_checks+0x10/0x10
<6>[  727.077231]  ? ceph_write_iter+0x33c/0xad0 [ceph]
<6>[  727.077557]  down_read+0x7a/0x90
<6>[  727.077596]  ceph_write_iter+0x33c/0xad0 [ceph]
<6>[  727.078041]  ? __pfx_ceph_write_iter+0x10/0x10 [ceph]
<6>[  727.078324]  ? __pfx_lock_acquire+0x10/0x10
<6>[  727.078358]  ? __might_resched+0x213/0x300
<6>[  727.078413]  ? inode_security+0x6d/0x90
<6>[  727.078454]  ? selinux_file_permission+0x1d5/0x210
<6>[  727.078558]  vfs_write+0x567/0x750
<6>[  727.078608]  ? __pfx_vfs_write+0x10/0x10
<6>[  727.078786]  ksys_write+0xc9/0x170
<6>[  727.078824]  ? __pfx_ksys_write+0x10/0x10
<6>[  727.078857]  ? ktime_get_coarse_real_ts64+0x100/0x110
<6>[  727.078883]  ? ktime_get_coarse_real_ts64+0xa4/0x110
<6>[  727.079029]  do_syscall_64+0x5c/0x90
<6>[  727.079071]  ? do_syscall_64+0x69/0x90
<6>[  727.079105]  ? lockdep_hardirqs_on_prepare.part.0+0xea/0x1b0
<6>[  727.079151]  ? do_syscall_64+0x69/0x90
<6>[  727.079184]  ? lockdep_hardirqs_on_prepare.part.0+0xea/0x1b0
<6>[  727.079234]  ? do_syscall_64+0x69/0x90
<6>[  727.079257]  ? lockdep_hardirqs_on_prepare.part.0+0xea/0x1b0
<6>[  727.079303]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
<6>[  727.079336] RIP: 0033:0x7fa52913ebcf
<6>[  727.079361] RSP: 002b:00007fa522dfba30 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
<6>[  727.079396] RAX: ffffffffffffffda RBX: 0000000000001000 RCX:
00007fa52913ebcf
<6>[  727.079418] RDX: 0000000000001000 RSI: 00007fa51c001450 RDI:
0000000000000004
<6>[  727.079438] RBP: 0000000000000004 R08: 0000000000000000 R09:
0000000001cc3580
<6>[  727.079456] R10: 00000000000001c0 R11: 0000000000000293 R12:
0000000000000000
<6>[  727.079474] R13: 0000000001cc3580 R14: 0000000000001000 R15:
00007fa51c001450
<6>[  727.079614]  </TASK>
<6>[  727.079633] Future hung task reports are suppressed, see sysctl
kernel.hung_task_warnings
<4>[  727.079652]
<4>[  727.079652] Showing all locks held in the system:
<4>[  727.079671] 1 lock held by rcu_tasks_kthre/12:
<4>[  727.079691]  #0: ffffffff8bf1fc80
(rcu_tasks.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x32/0x280
<4>[  727.079788] 1 lock held by rcu_tasks_rude_/13:
<4>[  727.079805]  #0: ffffffff8bf1f9a0
(rcu_tasks_rude.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x32/0x280
<4>[  727.079896] 1 lock held by rcu_tasks_trace/14:
<4>[  727.079976]  #0: ffffffff8bf1f660
(rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at:
rcu_tasks_one_gp+0x32/0x280
<4>[  727.080085] 6 locks held by kworker/u20:4/78:
<4>[  727.080105]  #0: ffff8884644d3148
((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.080192]  #1: ffff888101027de8
((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.080282]  #2: ffff8881022740e8
(&type->s_umount_key#62){.+.+}-{4:4}, at: trylock_super+0x16/0x70
<4>[  727.080385]  #3: ffff88811e378d70 (&osdc->lock){++++}-{4:4}, at:
ceph_osdc_start_request+0x17/0x40 [libceph]
<4>[  727.080760]  #4: ffff88810c116980 (&osd->lock){+.+.}-{4:4}, at:
__submit_request+0xfa/0x4e0 [libceph]
<4>[  727.081139]  #5: ffff88810c116178 (&con->mutex){+.+.}-{4:4}, at:
ceph_con_send+0xa4/0x310 [libceph]
<4>[  727.081458] 3 locks held by kworker/3:1/90:
<4>[  727.081517]  #0: ffff888130cced48
((wq_completion)ceph-msgr){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.081593]  #1: ffff888101117de8
((work_completion)(&(&con->work)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.081668]  #2: ffff88810c116178 (&con->mutex){+.+.}-{4:4}, at:
ceph_con_workfn+0x3c/0x5c0 [libceph]
<4>[  727.082041] 1 lock held by khungtaskd/96:
<4>[  727.082057]  #0: ffffffff8bf20840 (rcu_read_lock){....}-{1:3}, at:
debug_show_all_locks+0x29/0x230
<4>[  727.082149] 1 lock held by systemd-journal/687:
<4>[  727.082179] 5 locks held by in:imjournal/882:
<4>[  727.082270] 1 lock held by sshd/3493:
<4>[  727.082320] 3 locks held by kworker/3:6/8369:
<4>[  727.082337]  #0: ffff888130cced48
((wq_completion)ceph-msgr){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.082413]  #1: ffff8881105cfde8
((work_completion)(&(&con->work)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.082502]  #2: ffff8881803ce098 (&s->s_mutex){+.+.}-{4:4}, at:
send_mds_reconnect+0x13e/0x7c0 [ceph]
<4>[  727.082791] 3 locks held by kworker/9:6/8586:
<4>[  727.082807]  #0: ffff888100063548
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.082927]  #1: ffff88812a767de8
((work_completion)(&(&osdc->osds_timeout_work)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.083002]  #2: ffff88811e378d70 (&osdc->lock){++++}-{4:4}, at:
handle_osds_timeout+0x6f/0x1b0 [libceph]
<4>[  727.083336] 4 locks held by kworker/6:6/8587:
<4>[  727.083352]  #0: ffff888100063548
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.083424]  #1: ffff8881248a7de8
((work_completion)(&(&mdsc->delayed_work)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.083509]  #2: ffff8881803ce098 (&s->s_mutex){+.+.}-{4:4}, at:
delayed_work+0x1d8/0x310 [ceph]
<4>[  727.083794]  #3: ffff88811e378d70 (&osdc->lock){++++}-{4:4}, at:
ceph_send_cap_releases+0xbe/0x6a0 [ceph]
<4>[  727.084151] 3 locks held by kworker/8:16/8663:
<4>[  727.084170]  #0: ffff888130cced48
((wq_completion)ceph-msgr){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.084245]  #1: ffff88810aac7de8
((work_completion)(&(&con->work)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.084316]  #2: ffff888464998178 (&con->mutex){+.+.}-{4:4}, at:
ceph_con_workfn+0x3c/0x5c0 [libceph]
<4>[  727.084684] 3 locks held by kworker/9:18/9987:
<4>[  727.084701]  #0: ffff888100063548
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x436/0x990
<4>[  727.084776]  #1: ffff888112b17de8
((work_completion)(&(&osdc->timeout_work)->work)){+.+.}-{0:0}, at:
process_one_work+0x436/0x990
<4>[  727.084938]  #2: ffff88811e378d70 (&osdc->lock){++++}-{4:4}, at:
handle_timeout+0x12a/0x6f0 [libceph]
<4>[  727.085274] 2 locks held by less/11897:
<4>[  727.085291]  #0: ffff8881138800a0 (&tty->ldisc_sem){++++}-{0:0},
at: tty_ldisc_ref_wait+0x24/0x70
<4>[  727.085370]  #1: ffffc900019122f8
(&ldata->atomic_read_lock){+.+.}-{4:4}, at: n_tty_read+0x849/0xa70
<4>[  727.085469] 3 locks held by cat/13283:
<4>[  727.085484] 4 locks held by ffsb/13945:
<4>[  727.085497]  #0: ffff88812b95ba78 (&f->f_pos_lock){+.+.}-{4:4},
at: __fdget_pos+0x75/0x80
<4>[  727.085558]  #1: ffff888102274480 (sb_writers#12){.+.+}-{0:0}, at:
ksys_write+0xc9/0x170
<4>[  727.085654]  #2: ffff88812aeca1b8
(&sb->s_type->i_mutex_key#15){++++}-{4:4}, at:
ceph_start_io_write+0x15/0x30 [ceph]
<4>[  727.085963]  #3: ffff88811e378d70 (&osdc->lock){++++}-{4:4}, at:
ceph_write_iter+0x33c/0xad0 [ceph]
<4>[  727.086198] 5 locks held by kworker/3:0/13962:
<4>[  727.086213]

Thanks

- Xiubo


On 4/1/23 00:09, David Howells wrote:
> Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
> transmitting data. For the moment, this can only transmit one page at a
> time because of the architecture of net/ceph/, but if
> write_partial_message_data() can be given a bvec[] at a time by the
> iteration code, this would allow pages to be sent in a batch.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Ilya Dryomov <[email protected]>
> cc: Xiubo Li <[email protected]>
> cc: Jeff Layton <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
> net/ceph/messenger_v2.c | 89 +++++++++--------------------------------
> 1 file changed, 18 insertions(+), 71 deletions(-)
>
> diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
> index 301a991dc6a6..1637a0c21126 100644
> --- a/net/ceph/messenger_v2.c
> +++ b/net/ceph/messenger_v2.c
> @@ -117,91 +117,38 @@ static int ceph_tcp_recv(struct ceph_connection *con)
> return ret;
> }
>
> -static int do_sendmsg(struct socket *sock, struct iov_iter *it)
> -{
> - struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
> - int ret;
> -
> - msg.msg_iter = *it;
> - while (iov_iter_count(it)) {
> - ret = sock_sendmsg(sock, &msg);
> - if (ret <= 0) {
> - if (ret == -EAGAIN)
> - ret = 0;
> - return ret;
> - }
> -
> - iov_iter_advance(it, ret);
> - }
> -
> - WARN_ON(msg_data_left(&msg));
> - return 1;
> -}
> -
> -static int do_try_sendpage(struct socket *sock, struct iov_iter *it)
> -{
> - struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
> - struct bio_vec bv;
> - int ret;
> -
> - if (WARN_ON(!iov_iter_is_bvec(it)))
> - return -EINVAL;
> -
> - while (iov_iter_count(it)) {
> - /* iov_iter_iovec() for ITER_BVEC */
> - bvec_set_page(&bv, it->bvec->bv_page,
> - min(iov_iter_count(it),
> - it->bvec->bv_len - it->iov_offset),
> - it->bvec->bv_offset + it->iov_offset);
> -
> - /*
> - * sendpage cannot properly handle pages with
> - * page_count == 0, we need to fall back to sendmsg if
> - * that's the case.
> - *
> - * Same goes for slab pages: skb_can_coalesce() allows
> - * coalescing neighboring slab objects into a single frag
> - * which triggers one of hardened usercopy checks.
> - */
> - if (sendpage_ok(bv.bv_page)) {
> - ret = sock->ops->sendpage(sock, bv.bv_page,
> - bv.bv_offset, bv.bv_len,
> - CEPH_MSG_FLAGS);
> - } else {
> - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, bv.bv_len);
> - ret = sock_sendmsg(sock, &msg);
> - }
> - if (ret <= 0) {
> - if (ret == -EAGAIN)
> - ret = 0;
> - return ret;
> - }
> -
> - iov_iter_advance(it, ret);
> - }
> -
> - return 1;
> -}
> -
> /*
> * Write as much as possible. The socket is expected to be corked,
> * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
> *
> * Return:
> - * 1 - done, nothing (else) to write
> + * >0 - done, nothing (else) to write
> * 0 - socket is full, need to wait
> * <0 - error
> */
> static int ceph_tcp_send(struct ceph_connection *con)
> {
> + struct msghdr msg = {
> + .msg_iter = con->v2.out_iter,
> + .msg_flags = CEPH_MSG_FLAGS,
> + };
> int ret;
>
> + if (WARN_ON(!iov_iter_is_bvec(&con->v2.out_iter)))
> + return -EINVAL;
> +
> + if (con->v2.out_iter_sendpage)
> + msg.msg_flags |= MSG_SPLICE_PAGES;
> +
> dout("%s con %p have %zu try_sendpage %d\n", __func__, con,
> iov_iter_count(&con->v2.out_iter), con->v2.out_iter_sendpage);
> - if (con->v2.out_iter_sendpage)
> - ret = do_try_sendpage(con->sock, &con->v2.out_iter);
> - else
> - ret = do_sendmsg(con->sock, &con->v2.out_iter);
> +
> + ret = sock_sendmsg(con->sock, &msg);
> + if (ret > 0)
> + iov_iter_advance(&con->v2.out_iter, ret);
> + else if (ret == -EAGAIN)
> + ret = 0;
> +
> dout("%s con %p ret %d left %zu\n", __func__, con, ret,
> iov_iter_count(&con->v2.out_iter));
> return ret;
>

2023-04-24 17:34:37

by Fabio M. De Francesco

[permalink] [raw]
Subject: Re: [PATCH v3 41/55] iscsi: Assume "sendpage" is okay in iscsi_tcp_segment_map()

On venerd? 31 marzo 2023 18:09:00 CEST David Howells wrote:
> As iscsi is now using sendmsg() with MSG_SPLICE_PAGES rather than sendpage,
> assume that sendpage_ok() will return true in iscsi_tcp_segment_map() and
> leave it to TCP to copy the data if not.
>
> Signed-off-by: David Howells <[email protected]>
> cc: "Martin K. Petersen" <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> cc: [email protected]
> ---
> drivers/scsi/libiscsi_tcp.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/scsi/libiscsi_tcp.c b/drivers/scsi/libiscsi_tcp.c
> index c182aa83f2c9..07ba0d864820 100644
> --- a/drivers/scsi/libiscsi_tcp.c
> +++ b/drivers/scsi/libiscsi_tcp.c
> @@ -128,18 +128,11 @@ static void iscsi_tcp_segment_map(struct iscsi_segment
> *segment, int recv) * coalescing neighboring slab objects into a single frag
> which
> * triggers one of hardened usercopy checks.
> */
> - if (!recv && sendpage_ok(sg_page(sg)))
> + if (!recv)
> return;
>
> - if (recv) {
> - segment->atomic_mapped = true;
> - segment->sg_mapped = kmap_atomic(sg_page(sg));
> - } else {
> - segment->atomic_mapped = false;
> - /* the xmit path can sleep with the page mapped so use
kmap */
> - segment->sg_mapped = kmap(sg_page(sg));
> - }
> -
> + segment->atomic_mapped = true;
> + segment->sg_mapped = kmap_atomic(sg_page(sg));

As you probably know, kmap_atomic() is deprecated.

I must admit that I'm not an expert of this code, however, it looks like the
mapping has no need to rely on the side effects of kmap_atomic() (i.e.,
pagefault_disable() and preempt_disable() - but I'm not entirely sure about
the possibility that preemption should be explicitly disabled along with the
replacement with kmap_local_page()).

Last year I've been working on several conversions from kmap{,_atomic}() to
kmap_local_page(), however I'm still not sure to understand what's happening
here...

Am I missing any important details? Can you please explain why we still need
that kmap_atomic() instead of kmap_local_page()?

Thanks in advance,

Fabio

> segment->data = segment->sg_mapped + sg->offset + segment-
>sg_offset;
> }




2023-04-25 08:47:53

by David Howells

[permalink] [raw]
Subject: Re: [PATCH v3 41/55] iscsi: Assume "sendpage" is okay in iscsi_tcp_segment_map()

Fabio M. De Francesco <[email protected]> wrote:

> > - if (recv) {
> > - segment->atomic_mapped = true;
> > - segment->sg_mapped = kmap_atomic(sg_page(sg));
> > - } else {
> > - segment->atomic_mapped = false;
> > - /* the xmit path can sleep with the page mapped so use
> kmap */
> > - segment->sg_mapped = kmap(sg_page(sg));
> > - }
> > -
> > + segment->atomic_mapped = true;
> > + segment->sg_mapped = kmap_atomic(sg_page(sg));
>
> As you probably know, kmap_atomic() is deprecated.
>
> I must admit that I'm not an expert of this code, however, it looks like the
> mapping has no need to rely on the side effects of kmap_atomic() (i.e.,
> pagefault_disable() and preempt_disable() - but I'm not entirely sure about
> the possibility that preemption should be explicitly disabled along with the
> replacement with kmap_local_page()).
>
> Last year I've been working on several conversions from kmap{,_atomic}() to
> kmap_local_page(), however I'm still not sure to understand what's happening
> here...
>
> Am I missing any important details? Can you please explain why we still need
> that kmap_atomic() instead of kmap_local_page()?

Actually, it might be worth dropping segment->sg_mapped and segment->data and
only doing the kmap_local when necessary.

And this:

struct msghdr msg = { .msg_flags = flags };
struct kvec iov = {
.iov_base = segment->data + offset,
.iov_len = copy
};

r = kernel_sendmsg(sk, &msg, &iov, 1, copy);

should really be using struct bvec, not struct kvec - then the mapping isn't
necessary. It looks like this might be the only place the mapping is used,
but I'm not 100% certain.

David

2023-04-25 13:17:31

by Fabio M. De Francesco

[permalink] [raw]
Subject: Re: [PATCH v3 41/55] iscsi: Assume "sendpage" is okay in iscsi_tcp_segment_map()

On marted? 25 aprile 2023 10:30:30 CEST David Howells wrote:
> Fabio M. De Francesco <[email protected]> wrote:
> > > - if (recv) {
> > > - segment->atomic_mapped = true;
> > > - segment->sg_mapped = kmap_atomic(sg_page(sg));
> > > - } else {
> > > - segment->atomic_mapped = false;
> > > - /* the xmit path can sleep with the page mapped so use
> >
> > kmap */
> >
> > > - segment->sg_mapped = kmap(sg_page(sg));
> > > - }
> > > -
> > > + segment->atomic_mapped = true;
> > > + segment->sg_mapped = kmap_atomic(sg_page(sg));
> >
> > As you probably know, kmap_atomic() is deprecated.
> >
> > I must admit that I'm not an expert of this code, however, it looks like
the
> > mapping has no need to rely on the side effects of kmap_atomic() (i.e.,
> > pagefault_disable() and preempt_disable() - but I'm not entirely sure
about
> > the possibility that preemption should be explicitly disabled along with
the
> > replacement with kmap_local_page()).
> >
> > Last year I've been working on several conversions from kmap{,_atomic}()
to
> > kmap_local_page(), however I'm still not sure to understand what's
happening
> > here...
> >
> > Am I missing any important details? Can you please explain why we still
need
> > that kmap_atomic() instead of kmap_local_page()?
>
> Actually, it might be worth dropping segment->sg_mapped and segment->data
and
> only doing the kmap_local when necessary.
>
> And this:
>
> struct msghdr msg = { .msg_flags = flags };
> struct kvec iov = {
> .iov_base = segment->data + offset,
> .iov_len = copy
> };
>
> r = kernel_sendmsg(sk, &msg, &iov, 1, copy);
>
> should really be using struct bvec, not struct kvec - then the mapping isn't
> necessary.

FWIW, struct bvec looks better suited (despite I have very little knowledge of
this code).

I assume that you noticed that we also have the unmapping counterpart
(iscsi_tcp_segment_unmap()) which should also be addressed accordingly.

> It looks like this might be the only place the mapping is used,
> but I'm not 100% certain.

It seems that kmap_atomic() (as well as kmap(), which you deleted) is only
called by iscsi_tcp_segment_map(), which in turn is called only by
iscsi_tcp_segment_done(). I can't see any other places where the mapping is
used.

I hope that this dialogue may help you somehow to choose the best suited way
to get rid of that deprecated kmap_atomic().

Thanks for taking time to address questions from newcomers :-)

Fabio

>
> David




2023-04-26 13:15:00

by D. Wythe

[permalink] [raw]
Subject: Re: [PATCH v3 51/55] smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES


Hi David,

Fallback is one of the most important features of SMC, which
automatically downgrades to TCP
when SMC discovers that the peer does not support SMC. After fallback,
SMC hopes the the ability can be
consistent with that of TCP sock. If you delete the smc_sendpage, when
fallback occurs, it means that the sock after the fallback
loses the ability of  sendpage( tcp_sendpage).

Thanks
D. Wythe

On 4/1/23 12:09 AM, David Howells wrote:
> Drop the smc_sendpage() code as smc_sendmsg() just passes the call down to
> the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
> its sendmsg implementation.
>
> Signed-off-by: David Howells <[email protected]>
> cc: Karsten Graul <[email protected]>
> cc: Wenjia Zhang <[email protected]>
> cc: Jan Karcher <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
> net/smc/af_smc.c | 29 -----------------------------
> net/smc/smc_stats.c | 2 +-
> net/smc/smc_stats.h | 1 -
> net/smc/smc_tx.c | 16 ----------------
> net/smc/smc_tx.h | 2 --
> 5 files changed, 1 insertion(+), 49 deletions(-)
>
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index a4cccdfdc00a..d4113c8a7cda 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -3125,34 +3125,6 @@ static int smc_ioctl(struct socket *sock, unsigned int cmd,
> return put_user(answ, (int __user *)arg);
> }
>
> -static ssize_t smc_sendpage(struct socket *sock, struct page *page,
> - int offset, size_t size, int flags)
> -{
> - struct sock *sk = sock->sk;
> - struct smc_sock *smc;
> - int rc = -EPIPE;
> -
> - smc = smc_sk(sk);
> - lock_sock(sk);
> - if (sk->sk_state != SMC_ACTIVE) {
> - release_sock(sk);
> - goto out;
> - }
> - release_sock(sk);
> - if (smc->use_fallback) {
> - rc = kernel_sendpage(smc->clcsock, page, offset,
> - size, flags);
> - } else {
> - lock_sock(sk);
> - rc = smc_tx_sendpage(smc, page, offset, size, flags);
> - release_sock(sk);
> - SMC_STAT_INC(smc, sendpage_cnt);
> - }
> -
> -out:
> - return rc;
> -}
> -
> /* Map the affected portions of the rmbe into an spd, note the number of bytes
> * to splice in conn->splice_pending, and press 'go'. Delays consumer cursor
> * updates till whenever a respective page has been fully processed.
> @@ -3224,7 +3196,6 @@ static const struct proto_ops smc_sock_ops = {
> .sendmsg = smc_sendmsg,
> .recvmsg = smc_recvmsg,
> .mmap = sock_no_mmap,
> - .sendpage = smc_sendpage,
> .splice_read = smc_splice_read,
> };
>
> diff --git a/net/smc/smc_stats.c b/net/smc/smc_stats.c
> index e80e34f7ac15..ca14c0f3a07d 100644
> --- a/net/smc/smc_stats.c
> +++ b/net/smc/smc_stats.c
> @@ -227,7 +227,7 @@ static int smc_nl_fill_stats_tech_data(struct sk_buff *skb,
> SMC_NLA_STATS_PAD))
> goto errattr;
> if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SENDPAGE_CNT,
> - smc_tech->sendpage_cnt,
> + 0,
> SMC_NLA_STATS_PAD))
> goto errattr;
> if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CORK_CNT,
> diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h
> index 84b7ecd8c05c..b60fe1eb37ab 100644
> --- a/net/smc/smc_stats.h
> +++ b/net/smc/smc_stats.h
> @@ -71,7 +71,6 @@ struct smc_stats_tech {
> u64 clnt_v2_succ_cnt;
> u64 srv_v1_succ_cnt;
> u64 srv_v2_succ_cnt;
> - u64 sendpage_cnt;
> u64 urg_data_cnt;
> u64 splice_cnt;
> u64 cork_cnt;
> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
> index f4b6a71ac488..d31ce8209fa2 100644
> --- a/net/smc/smc_tx.c
> +++ b/net/smc/smc_tx.c
> @@ -298,22 +298,6 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
> return rc;
> }
>
> -int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
> - size_t size, int flags)
> -{
> - struct msghdr msg = {.msg_flags = flags};
> - char *kaddr = kmap(page);
> - struct kvec iov;
> - int rc;
> -
> - iov.iov_base = kaddr + offset;
> - iov.iov_len = size;
> - iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
> - rc = smc_tx_sendmsg(smc, &msg, size);
> - kunmap(page);
> - return rc;
> -}
> -
> /***************************** sndbuf consumer *******************************/
>
> /* sndbuf consumer: actual data transfer of one target chunk with ISM write */
> diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
> index 34b578498b1f..a59f370b8b43 100644
> --- a/net/smc/smc_tx.h
> +++ b/net/smc/smc_tx.h
> @@ -31,8 +31,6 @@ void smc_tx_pending(struct smc_connection *conn);
> void smc_tx_work(struct work_struct *work);
> void smc_tx_init(struct smc_sock *smc);
> int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
> -int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int offset,
> - size_t size, int flags);
> int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
> void smc_tx_sndbuf_nonfull(struct smc_sock *smc);
> void smc_tx_consumer_update(struct smc_connection *conn, bool force);

2023-04-28 03:13:13

by D. Wythe

[permalink] [raw]
Subject: Re: [PATCH v3 51/55] smc: Drop smc_sendpage() in favour of smc_sendmsg() + MSG_SPLICE_PAGES



On 4/26/23 9:07 PM, D. Wythe wrote:
>
> Hi David,
>
> Fallback is one of the most important features of SMC, which
> automatically downgrades to TCP
> when SMC discovers that the peer does not support SMC. After fallback,
> SMC hopes the the ability can be
> consistent with that of TCP sock. If you delete the smc_sendpage, when
> fallback occurs, it means that the sock after the fallback
> loses the ability of  sendpage( tcp_sendpage).
>
> Thanks
> D. Wythe

Sorry, I missed the key email context. The problem mentioned here does
not exist ...

>
> On 4/1/23 12:09 AM, David Howells wrote:
>> Drop the smc_sendpage() code as smc_sendmsg() just passes the call
>> down to
>> the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
>> its sendmsg implementation.
>> Signed-off-by: David Howells <[email protected]>
>> cc: Karsten Graul <[email protected]>
>> cc: Wenjia Zhang <[email protected]>
>> cc: Jan Karcher <[email protected]>
>> cc: "David S. Miller" <[email protected]>
>> cc: Eric Dumazet <[email protected]>
>> cc: Jakub Kicinski <[email protected]>
>> cc: Paolo Abeni <[email protected]>
>> cc: Jens Axboe <[email protected]>
>> cc: Matthew Wilcox <[email protected]>
>> cc: [email protected]
>> cc: [email protected]
>> ---
>>   net/smc/af_smc.c    | 29 -----------------------------
>>   net/smc/smc_stats.c |  2 +-
>>   net/smc/smc_stats.h |  1 -
>>   net/smc/smc_tx.c    | 16 ----------------
>>   net/smc/smc_tx.h    |  2 --
>>   5 files changed, 1 insertion(+), 49 deletions(-)
>>
>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
>> index a4cccdfdc00a..d4113c8a7cda 100644
>> --- a/net/smc/af_smc.c
>> +++ b/net/smc/af_smc.c
>> @@ -3125,34 +3125,6 @@ static int smc_ioctl(struct socket *sock,
>> unsigned int cmd,
>>       return put_user(answ, (int __user *)arg);
>>   }
>>   -static ssize_t smc_sendpage(struct socket *sock, struct page *page,
>> -                int offset, size_t size, int flags)
>> -{
>> -    struct sock *sk = sock->sk;
>> -    struct smc_sock *smc;
>> -    int rc = -EPIPE;
>> -
>> -    smc = smc_sk(sk);
>> -    lock_sock(sk);
>> -    if (sk->sk_state != SMC_ACTIVE) {
>> -        release_sock(sk);
>> -        goto out;
>> -    }
>> -    release_sock(sk);
>> -    if (smc->use_fallback) {
>> -        rc = kernel_sendpage(smc->clcsock, page, offset,
>> -                     size, flags);
>> -    } else {
>> -        lock_sock(sk);
>> -        rc = smc_tx_sendpage(smc, page, offset, size, flags);
>> -        release_sock(sk);
>> -        SMC_STAT_INC(smc, sendpage_cnt);
>> -    }
>> -
>> -out:
>> -    return rc;
>> -}
>> -
>>   /* Map the affected portions of the rmbe into an spd, note the
>> number of bytes
>>    * to splice in conn->splice_pending, and press 'go'. Delays
>> consumer cursor
>>    * updates till whenever a respective page has been fully processed.
>> @@ -3224,7 +3196,6 @@ static const struct proto_ops smc_sock_ops = {
>>       .sendmsg    = smc_sendmsg,
>>       .recvmsg    = smc_recvmsg,
>>       .mmap        = sock_no_mmap,
>> -    .sendpage    = smc_sendpage,
>>       .splice_read    = smc_splice_read,
>>   };
>>   diff --git a/net/smc/smc_stats.c b/net/smc/smc_stats.c
>> index e80e34f7ac15..ca14c0f3a07d 100644
>> --- a/net/smc/smc_stats.c
>> +++ b/net/smc/smc_stats.c
>> @@ -227,7 +227,7 @@ static int smc_nl_fill_stats_tech_data(struct
>> sk_buff *skb,
>>                     SMC_NLA_STATS_PAD))
>>           goto errattr;
>>       if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SENDPAGE_CNT,
>> -                  smc_tech->sendpage_cnt,
>> +                  0,
>>                     SMC_NLA_STATS_PAD))
>>           goto errattr;
>>       if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CORK_CNT,
>> diff --git a/net/smc/smc_stats.h b/net/smc/smc_stats.h
>> index 84b7ecd8c05c..b60fe1eb37ab 100644
>> --- a/net/smc/smc_stats.h
>> +++ b/net/smc/smc_stats.h
>> @@ -71,7 +71,6 @@ struct smc_stats_tech {
>>       u64            clnt_v2_succ_cnt;
>>       u64            srv_v1_succ_cnt;
>>       u64            srv_v2_succ_cnt;
>> -    u64            sendpage_cnt;
>>       u64            urg_data_cnt;
>>       u64            splice_cnt;
>>       u64            cork_cnt;
>> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>> index f4b6a71ac488..d31ce8209fa2 100644
>> --- a/net/smc/smc_tx.c
>> +++ b/net/smc/smc_tx.c
>> @@ -298,22 +298,6 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct
>> msghdr *msg, size_t len)
>>       return rc;
>>   }
>>   -int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int
>> offset,
>> -            size_t size, int flags)
>> -{
>> -    struct msghdr msg = {.msg_flags = flags};
>> -    char *kaddr = kmap(page);
>> -    struct kvec iov;
>> -    int rc;
>> -
>> -    iov.iov_base = kaddr + offset;
>> -    iov.iov_len = size;
>> -    iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &iov, 1, size);
>> -    rc = smc_tx_sendmsg(smc, &msg, size);
>> -    kunmap(page);
>> -    return rc;
>> -}
>> -
>>   /***************************** sndbuf consumer
>> *******************************/
>>     /* sndbuf consumer: actual data transfer of one target chunk with
>> ISM write */
>> diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
>> index 34b578498b1f..a59f370b8b43 100644
>> --- a/net/smc/smc_tx.h
>> +++ b/net/smc/smc_tx.h
>> @@ -31,8 +31,6 @@ void smc_tx_pending(struct smc_connection *conn);
>>   void smc_tx_work(struct work_struct *work);
>>   void smc_tx_init(struct smc_sock *smc);
>>   int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t
>> len);
>> -int smc_tx_sendpage(struct smc_sock *smc, struct page *page, int
>> offset,
>> -            size_t size, int flags);
>>   int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
>>   void smc_tx_sndbuf_nonfull(struct smc_sock *smc);
>>   void smc_tx_consumer_update(struct smc_connection *conn, bool force);
>