2023-06-07 14:56:05

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 00/14] splice, net: Rewrite splice-to-socket, fix SPLICE_F_MORE and handle MSG_SPLICE_PAGES in AF_TLS

Here are patches to do the following:

(1) Block MSG_SENDPAGE_* flags from leaking into ->sendmsg() from
userspace, whilst allowing splice_to_socket() to pass them in.

(2) Allow MSG_SPLICE_PAGES to be passed into tls_*_sendmsg(). Until
support is added, it will be ignored and a splice-driven sendmsg()
will be treated like a normal sendmsg(). TCP, UDP, AF_UNIX and
Chelsio-TLS already handle the flag in net-next.

(3) Replace a chain of functions to splice-to-sendpage with a single
function to splice via sendmsg() with MSG_SPLICE_PAGES. This allows a
bunch of pages to be spliced from a pipe in a single call using a
bio_vec[] and pushes the main processing loop down into the bowels of
the protocol driver rather than repeatedly calling in with a page at a
time.

(4) Provide a ->splice_eof() op[2] that allows splice to signal to its
output that the input observed a premature EOF and that the caller
didn't flag SPLICE_F_MORE, thereby allowing a corked socket to be
flushed. This attempts to maintain the current behaviour. It is also
not called if we didn't manage to read any data and so didn't called
the actor function.

This needs routing though several layers to get it down to the network
protocol.

[!] Note that I chose not to pass in any flags - I'm not sure it's
particularly useful to pass in the splice flags; I also elected
not to return any error code - though we might actually want to do
that.

(5) Provide tls_{device,sw}_splice_eof() to flush a pending TLS record if
there is one.

(6) Provide splice_eof() for UDP, TCP, Chelsio-TLS and AF_KCM. AF_UNIX
doesn't seem to pay attention to the MSG_MORE or MSG_SENDPAGE_NOTLAST
flags.

(7) Alter the behaviour of sendfile() and fix SPLICE_F_MORE/MSG_MORE
signalling[1] such SPLICE_F_MORE is always signalled until we have
read sufficient data to finish the request. If we get a zero-length
before we've managed to splice sufficient data, we now leave the
socket expecting more data and leave it to userspace to deal with it.

(8) Make AF_TLS handle the MSG_SPLICE_PAGES internal sendmsg flag.
MSG_SPLICE_PAGES is an internal hint that tells the protocol that it
should splice the pages supplied if it can. Its sendpage
implementations are then turned into wrappers around that.


I've pushed the patches here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-2-tls

David

Changes
=======
ver #5)
- In splice_to_socket(), preclear ret in case len == 0.
- Provide ->splice_eof() for UDP, TCP, Chelsio-TLS and AF_KCM.

ver #4)
- Switch to using ->splice_eof() to signal premature EOF to the splice
output[2].

ver #3)
- Include the splice-to-socket rewrite patch.
- Fix SPLICE_F_MORE/MSG_MORE signalling.
- Allow AF_TLS to accept sendmsg() with MSG_SPLICE_PAGES before it is
handled.
- Allow a zero-length send() to a TLS socket to flush an outstanding
record.
- Address TLS kselftest failure.

ver #2)
- Dropped the slab data copying.
- "rls_" should be "tls_".
- Attempted to fix splice_direct_to_actor().
- Blocked MSG_SENDPAGE_* from being set by userspace.

Link: https://lore.kernel.org/r/[email protected]/ [1]
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ [2]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1
Link: https://lore.kernel.org/r/[email protected]/ # v1

David Howells (14):
net: Block MSG_SENDPAGE_* from being passed to sendmsg() by userspace
tls: Allow MSG_SPLICE_PAGES but treat it as normal sendmsg
splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
splice, net: Add a splice_eof op to file-ops and socket-ops
tls/sw: Use splice_eof() to flush
tls/device: Use splice_eof() to flush
ipv4, ipv6: Use splice_eof() to flush
chelsio/chtls: Use splice_eof() to flush
kcm: Use splice_eof() to flush
splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()
tls/sw: Support MSG_SPLICE_PAGES
tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
tls/device: Support MSG_SPLICE_PAGES
tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

.../chelsio/inline_crypto/chtls/chtls.h | 1 +
.../chelsio/inline_crypto/chtls/chtls_io.c | 9 +
.../chelsio/inline_crypto/chtls/chtls_main.c | 1 +
fs/splice.c | 207 +++++++++++---
include/linux/fs.h | 3 +-
include/linux/net.h | 1 +
include/linux/socket.h | 4 +-
include/linux/splice.h | 3 +
include/net/inet_common.h | 1 +
include/net/sock.h | 1 +
include/net/tcp.h | 1 +
include/net/udp.h | 1 +
net/ipv4/af_inet.c | 18 ++
net/ipv4/tcp.c | 16 ++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/udp.c | 16 ++
net/ipv6/af_inet6.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
net/ipv6/udp.c | 18 ++
net/kcm/kcmsock.c | 15 ++
net/socket.c | 36 +--
net/tls/tls.h | 2 +
net/tls/tls_device.c | 110 ++++----
net/tls/tls_main.c | 4 +
net/tls/tls_sw.c | 253 ++++++++++--------
25 files changed, 485 insertions(+), 239 deletions(-)



2023-06-07 14:56:14

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 04/14] splice, net: Add a splice_eof op to file-ops and socket-ops

Add an optional method, ->splice_eof(), to allow splice to indicate the
premature termination of a splice to struct file_operations and struct
proto_ops.

This is called if sendfile() or splice() encounters all of the following
conditions inside splice_direct_to_actor():

(1) the user did not set SPLICE_F_MORE (splice only), and

(2) an EOF condition occurred (->splice_read() returned 0), and

(3) we haven't read enough to fulfill the request (ie. len > 0 still), and

(4) we have already spliced at least one byte.

A further patch will modify the behaviour of SPLICE_F_MORE to always be
passed to the actor if either the user set it or we haven't yet read
sufficient data to fulfill the request.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Al Viro <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: Jan Kara <[email protected]>
cc: Jeff Layton <[email protected]>
cc: David Hildenbrand <[email protected]>
cc: Christian Brauner <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
fs/splice.c | 31 ++++++++++++++++++++++++++++++-
include/linux/fs.h | 1 +
include/linux/net.h | 1 +
include/linux/splice.h | 1 +
include/net/sock.h | 1 +
net/socket.c | 10 ++++++++++
6 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index e337630aed64..67dbd85db207 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -969,6 +969,17 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
return out->f_op->splice_write(pipe, out, ppos, len, flags);
}

+/*
+ * Indicate to the caller that there was a premature EOF when reading from the
+ * source and the caller didn't indicate they would be sending more data after
+ * this.
+ */
+static void do_splice_eof(struct splice_desc *sd)
+{
+ if (sd->splice_eof)
+ sd->splice_eof(sd);
+}
+
/*
* Attempt to initiate a splice from a file to a pipe.
*/
@@ -1068,7 +1079,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,

ret = do_splice_to(in, &pos, pipe, len, flags);
if (unlikely(ret <= 0))
- goto out_release;
+ goto read_failure;

read_len = ret;
sd->total_len = read_len;
@@ -1108,6 +1119,15 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
file_accessed(in);
return bytes;

+read_failure:
+ /*
+ * If the user did *not* set SPLICE_F_MORE *and* we didn't hit that
+ * "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
+ * "->splice_in()" that returned EOF (ie zero) *and* we have sent at
+ * least 1 byte *then* we will also do the ->splice_eof() call.
+ */
+ if (ret == 0 && !more && len > 0 && bytes)
+ do_splice_eof(sd);
out_release:
/*
* If we did an incomplete transfer we must release
@@ -1136,6 +1156,14 @@ static int direct_splice_actor(struct pipe_inode_info *pipe,
sd->flags);
}

+static void direct_file_splice_eof(struct splice_desc *sd)
+{
+ struct file *file = sd->u.file;
+
+ if (file->f_op->splice_eof)
+ file->f_op->splice_eof(file);
+}
+
/**
* do_splice_direct - splices data directly between two files
* @in: file to splice from
@@ -1161,6 +1189,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
.flags = flags,
.pos = *ppos,
.u.file = out,
+ .splice_eof = direct_file_splice_eof,
.opos = opos,
};
long ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index df92f4b3d122..de2cb1132f07 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1796,6 +1796,7 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
+ void (*splice_eof)(struct file *file);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
diff --git a/include/linux/net.h b/include/linux/net.h
index b73ad8e3c212..8defc8f1d82e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -210,6 +210,7 @@ struct proto_ops {
int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+ void (*splice_eof)(struct socket *sock);
int (*set_peek_off)(struct sock *sk, int val);
int (*peek_len)(struct socket *sock);

diff --git a/include/linux/splice.h b/include/linux/splice.h
index 991ae318b6eb..4fab18a6e371 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -38,6 +38,7 @@ struct splice_desc {
struct file *file; /* file to read/write */
void *data; /* cookie */
} u;
+ void (*splice_eof)(struct splice_desc *sd); /* Unexpected EOF handler */
loff_t pos; /* file position */
loff_t *opos; /* sendfile: output position */
size_t num_spliced; /* number of bytes already spliced */
diff --git a/include/net/sock.h b/include/net/sock.h
index b418425d7230..ae2d74a0bc4c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1271,6 +1271,7 @@ struct proto {
size_t len, int flags, int *addr_len);
int (*sendpage)(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
+ void (*splice_eof)(struct socket *sock);
int (*bind)(struct sock *sk,
struct sockaddr *addr, int addr_len);
int (*bind_add)(struct sock *sk,
diff --git a/net/socket.c b/net/socket.c
index c4d9104418c8..b778fc03c6e0 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -130,6 +130,7 @@ static int sock_fasync(int fd, struct file *filp, int on);
static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
+static void sock_splice_eof(struct file *file);

#ifdef CONFIG_PROC_FS
static void sock_show_fdinfo(struct seq_file *m, struct file *f)
@@ -163,6 +164,7 @@ static const struct file_operations socket_file_ops = {
.fasync = sock_fasync,
.splice_write = splice_to_socket,
.splice_read = sock_splice_read,
+ .splice_eof = sock_splice_eof,
.show_fdinfo = sock_show_fdinfo,
};

@@ -1076,6 +1078,14 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
return sock->ops->splice_read(sock, ppos, pipe, len, flags);
}

+static void sock_splice_eof(struct file *file)
+{
+ struct socket *sock = file->private_data;
+
+ if (sock->ops->splice_eof)
+ sock->ops->splice_eof(sock);
+}
+
static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;


2023-06-07 14:56:57

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 07/14] ipv4, ipv6: Use splice_eof() to flush

Allow splice to undo the effects of MSG_MORE after prematurely ending a
splice/sendfile due to getting an EOF condition (->splice_read() returned
0) after splice had called sendmsg() with MSG_MORE set when the user didn't
set MSG_MORE.

For UDP, a pending packet will not be emitted if the socket is closed
before it is flushed; with this change, it be flushed by ->splice_eof().

For TCP, it's not clear that MSG_MORE is actually effective.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: David Ahern <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
include/net/inet_common.h | 1 +
include/net/tcp.h | 1 +
include/net/udp.h | 1 +
net/ipv4/af_inet.c | 18 ++++++++++++++++++
net/ipv4/tcp.c | 16 ++++++++++++++++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/udp.c | 16 ++++++++++++++++
net/ipv6/af_inet6.c | 1 +
net/ipv6/tcp_ipv6.c | 1 +
net/ipv6/udp.c | 18 ++++++++++++++++++
10 files changed, 74 insertions(+)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 77f4b0ef5b92..a75333342c4e 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -35,6 +35,7 @@ void __inet_accept(struct socket *sock, struct socket *newsock,
struct sock *newsk);
int inet_send_prepare(struct sock *sk);
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
+void inet_splice_eof(struct socket *sock);
ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags);
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 68990a8f556a..49611af31bb7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size);
int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
size_t size, struct ubuf_info *uarg);
+void tcp_splice_eof(struct socket *sock);
int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
int flags);
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
diff --git a/include/net/udp.h b/include/net/udp.h
index 5cad44318d71..4ed0b47c5582 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -278,6 +278,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
int udp_err(struct sk_buff *, u32);
int udp_abort(struct sock *sk, int err);
int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+void udp_splice_eof(struct socket *sock);
int udp_push_pending_frames(struct sock *sk);
void udp_flush_pending_frames(struct sock *sk);
int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b5735b3551cf..6cfb78592836 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -831,6 +831,21 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(inet_sendmsg);

+void inet_splice_eof(struct socket *sock)
+{
+ const struct proto *prot;
+ struct sock *sk = sock->sk;
+
+ if (unlikely(inet_send_prepare(sk)))
+ return;
+
+ /* IPV6_ADDRFORM can change sk->sk_prot under us. */
+ prot = READ_ONCE(sk->sk_prot);
+ if (prot->splice_eof)
+ sk->sk_prot->splice_eof(sock);
+}
+EXPORT_SYMBOL_GPL(inet_splice_eof);
+
ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
@@ -1050,6 +1065,7 @@ const struct proto_ops inet_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
@@ -1084,6 +1100,7 @@ const struct proto_ops inet_dgram_ops = {
.read_skb = udp_read_skb,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
@@ -1115,6 +1132,7 @@ static const struct proto_ops inet_sockraw_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53b7751b68e1..09f03221a6f1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1371,6 +1371,22 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(tcp_sendmsg);

+void tcp_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct tcp_sock *tp = tcp_sk(sk);
+ int mss_now, size_goal;
+
+ if (!tcp_write_queue_tail(sk))
+ return;
+
+ lock_sock(sk);
+ mss_now = tcp_send_mss(sk, &size_goal, 0);
+ tcp_push(sk, 0, mss_now, tp->nonagle, size_goal);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(tcp_splice_eof);
+
/*
* Handle reading urgent data. BSD has very simple semantics for
* this, no blocking and very strange errors 8)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 53e9ce2f05bb..84a5d557dc1a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3116,6 +3116,7 @@ struct proto tcp_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
+ .splice_eof = tcp_splice_eof,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v4_do_rcv,
.release_cb = tcp_release_cb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fd3dae081f3a..df5e407286d7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1324,6 +1324,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udp_sendmsg);

+void udp_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct udp_sock *up = udp_sk(sk);
+
+ if (!up->pending || READ_ONCE(up->corkflag))
+ return;
+
+ lock_sock(sk);
+ if (up->pending && !READ_ONCE(up->corkflag))
+ udp_push_pending_frames(sk);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(udp_splice_eof);
+
int udp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
@@ -2918,6 +2933,7 @@ struct proto udp_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
+ .splice_eof = udp_splice_eof,
.sendpage = udp_sendpage,
.release_cb = ip4_datagram_release_cb,
.hash = udp_lib_hash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2bbf13216a3d..564942bee067 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -695,6 +695,7 @@ const struct proto_ops inet6_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.sendmsg_locked = tcp_sendmsg_locked,
.sendpage_locked = tcp_sendpage_locked,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d657713d1c71..c17c8ff94b79 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2150,6 +2150,7 @@ struct proto tcpv6_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
+ .splice_eof = tcp_splice_eof,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v6_do_rcv,
.release_cb = tcp_release_cb,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5a337e6b970..6c5975b13ae3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1653,6 +1653,23 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udpv6_sendmsg);

+static void udpv6_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct udp_sock *up = udp_sk(sk);
+
+ if (!up->pending || READ_ONCE(up->corkflag))
+ return;
+
+ if (up->pending == AF_INET)
+ udp_splice_eof(sock);
+
+ lock_sock(sk);
+ if (up->pending && !READ_ONCE(up->corkflag))
+ udp_push_pending_frames(sk);
+ release_sock(sk);
+}
+
void udpv6_destroy_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
@@ -1764,6 +1781,7 @@ struct proto udpv6_prot = {
.getsockopt = udpv6_getsockopt,
.sendmsg = udpv6_sendmsg,
.recvmsg = udpv6_recvmsg,
+ .splice_eof = udpv6_splice_eof,
.release_cb = ip6_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,


2023-06-07 15:00:39

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 12/14] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES

Convert tls_sw_sendpage() and tls_sw_sendpage_locked() to use sendmsg()
with MSG_SPLICE_PAGES rather than directly splicing in the pages itself.

[!] Note that tls_sw_sendpage_locked() appears to have the wrong locking
upstream. I think the caller will only hold the socket lock, but it
should hold tls_ctx->tx_lock too.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
---
net/tls/tls_sw.c | 173 ++++++++++-------------------------------------
1 file changed, 35 insertions(+), 138 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index fcbaf594c300..aafa02bb5723 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -963,7 +963,8 @@ static int tls_sw_sendmsg_splice(struct sock *sk, struct msghdr *msg,
return 0;
}

-int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
+ size_t size)
{
long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
struct tls_context *tls_ctx = tls_get_ctx(sk);
@@ -986,15 +987,6 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
int ret = 0;
int pending;

- if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
- MSG_CMSG_COMPAT | MSG_SPLICE_PAGES))
- return -EOPNOTSUPP;
-
- ret = mutex_lock_interruptible(&tls_ctx->tx_lock);
- if (ret)
- return ret;
- lock_sock(sk);
-
if (unlikely(msg->msg_controllen)) {
ret = tls_process_cmsg(sk, msg, &record_type);
if (ret) {
@@ -1195,10 +1187,27 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)

send_end:
ret = sk_stream_error(sk, msg->msg_flags, ret);
+ return copied > 0 ? copied : ret;
+}

+int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
+{
+ struct tls_context *tls_ctx = tls_get_ctx(sk);
+ int ret;
+
+ if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
+ MSG_CMSG_COMPAT | MSG_SPLICE_PAGES |
+ MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
+ return -EOPNOTSUPP;
+
+ ret = mutex_lock_interruptible(&tls_ctx->tx_lock);
+ if (ret)
+ return ret;
+ lock_sock(sk);
+ ret = tls_sw_sendmsg_locked(sk, msg, size);
release_sock(sk);
mutex_unlock(&tls_ctx->tx_lock);
- return copied > 0 ? copied : ret;
+ return ret;
}

/*
@@ -1275,151 +1284,39 @@ void tls_sw_splice_eof(struct socket *sock)
mutex_unlock(&tls_ctx->tx_lock);
}

-static int tls_sw_do_sendpage(struct sock *sk, struct page *page,
- int offset, size_t size, int flags)
-{
- long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
- struct tls_context *tls_ctx = tls_get_ctx(sk);
- struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
- struct tls_prot_info *prot = &tls_ctx->prot_info;
- unsigned char record_type = TLS_RECORD_TYPE_DATA;
- struct sk_msg *msg_pl;
- struct tls_rec *rec;
- int num_async = 0;
- ssize_t copied = 0;
- bool full_record;
- int record_room;
- int ret = 0;
- bool eor;
-
- eor = !(flags & MSG_SENDPAGE_NOTLAST);
- sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
-
- /* Call the sk_stream functions to manage the sndbuf mem. */
- while (size > 0) {
- size_t copy, required_size;
-
- if (sk->sk_err) {
- ret = -sk->sk_err;
- goto sendpage_end;
- }
-
- if (ctx->open_rec)
- rec = ctx->open_rec;
- else
- rec = ctx->open_rec = tls_get_rec(sk);
- if (!rec) {
- ret = -ENOMEM;
- goto sendpage_end;
- }
-
- msg_pl = &rec->msg_plaintext;
-
- full_record = false;
- record_room = TLS_MAX_PAYLOAD_SIZE - msg_pl->sg.size;
- copy = size;
- if (copy >= record_room) {
- copy = record_room;
- full_record = true;
- }
-
- required_size = msg_pl->sg.size + copy + prot->overhead_size;
-
- if (!sk_stream_memory_free(sk))
- goto wait_for_sndbuf;
-alloc_payload:
- ret = tls_alloc_encrypted_msg(sk, required_size);
- if (ret) {
- if (ret != -ENOSPC)
- goto wait_for_memory;
-
- /* Adjust copy according to the amount that was
- * actually allocated. The difference is due
- * to max sg elements limit
- */
- copy -= required_size - msg_pl->sg.size;
- full_record = true;
- }
-
- sk_msg_page_add(msg_pl, page, copy, offset);
- sk_mem_charge(sk, copy);
-
- offset += copy;
- size -= copy;
- copied += copy;
-
- tls_ctx->pending_open_record_frags = true;
- if (full_record || eor || sk_msg_full(msg_pl)) {
- ret = bpf_exec_tx_verdict(msg_pl, sk, full_record,
- record_type, &copied, flags);
- if (ret) {
- if (ret == -EINPROGRESS)
- num_async++;
- else if (ret == -ENOMEM)
- goto wait_for_memory;
- else if (ret != -EAGAIN) {
- if (ret == -ENOSPC)
- ret = 0;
- goto sendpage_end;
- }
- }
- }
- continue;
-wait_for_sndbuf:
- set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-wait_for_memory:
- ret = sk_stream_wait_memory(sk, &timeo);
- if (ret) {
- if (ctx->open_rec)
- tls_trim_both_msgs(sk, msg_pl->sg.size);
- goto sendpage_end;
- }
-
- if (ctx->open_rec)
- goto alloc_payload;
- }
-
- if (num_async) {
- /* Transmit if any encryptions have completed */
- if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) {
- cancel_delayed_work(&ctx->tx_work.work);
- tls_tx_records(sk, flags);
- }
- }
-sendpage_end:
- ret = sk_stream_error(sk, flags, ret);
- return copied > 0 ? copied : ret;
-}
-
int tls_sw_sendpage_locked(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
+
if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY |
MSG_NO_SHARED_FRAGS))
return -EOPNOTSUPP;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- return tls_sw_do_sendpage(sk, page, offset, size, flags);
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return tls_sw_sendmsg_locked(sk, &msg, size);
}

int tls_sw_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
- struct tls_context *tls_ctx = tls_get_ctx(sk);
- int ret;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
MSG_SENDPAGE_NOTLAST | MSG_SENDPAGE_NOPOLICY))
return -EOPNOTSUPP;
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ msg.msg_flags |= MSG_MORE;

- ret = mutex_lock_interruptible(&tls_ctx->tx_lock);
- if (ret)
- return ret;
- lock_sock(sk);
- ret = tls_sw_do_sendpage(sk, page, offset, size, flags);
- release_sock(sk);
- mutex_unlock(&tls_ctx->tx_lock);
- return ret;
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return tls_sw_sendmsg(sk, &msg, size);
}

static int


2023-06-07 15:01:05

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 06/14] tls/device: Use splice_eof() to flush

Allow splice to end a TLS record after prematurely ending a splice/sendfile
due to getting an EOF condition (->splice_read() returned 0) after splice
had called TLS with a sendmsg() with MSG_MORE set when the user didn't set
MSG_MORE.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls.h | 1 +
net/tls/tls_device.c | 23 +++++++++++++++++++++++
net/tls/tls_main.c | 2 ++
3 files changed, 26 insertions(+)

diff --git a/net/tls/tls.h b/net/tls/tls.h
index 4922668fefaa..d002c3af1966 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -116,6 +116,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
size_t len, unsigned int flags);

int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
+void tls_device_splice_eof(struct socket *sock);
int tls_device_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
int tls_tx_records(struct sock *sk, int flags);
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 9ef766e41c7a..439be833dcf9 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -590,6 +590,29 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
return rc;
}

+void tls_device_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct tls_context *tls_ctx = tls_get_ctx(sk);
+ union tls_iter_offset iter;
+ struct iov_iter iov_iter = {};
+
+ if (!tls_is_partially_sent_record(tls_ctx))
+ return;
+
+ mutex_lock(&tls_ctx->tx_lock);
+ lock_sock(sk);
+
+ if (tls_is_partially_sent_record(tls_ctx)) {
+ iov_iter_bvec(&iov_iter, ITER_SOURCE, NULL, 0, 0);
+ iter.msg_iter = &iov_iter;
+ tls_push_data(sk, iter, 0, 0, TLS_RECORD_TYPE_DATA, NULL);
+ }
+
+ release_sock(sk);
+ mutex_unlock(&tls_ctx->tx_lock);
+}
+
int tls_device_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 82ec5c654f32..7b9c83dd7de2 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -1044,10 +1044,12 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
#ifdef CONFIG_TLS_DEVICE
prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE];
prot[TLS_HW][TLS_BASE].sendmsg = tls_device_sendmsg;
+ prot[TLS_HW][TLS_BASE].splice_eof = tls_device_splice_eof;
prot[TLS_HW][TLS_BASE].sendpage = tls_device_sendpage;

prot[TLS_HW][TLS_SW] = prot[TLS_BASE][TLS_SW];
prot[TLS_HW][TLS_SW].sendmsg = tls_device_sendmsg;
+ prot[TLS_HW][TLS_SW].splice_eof = tls_device_splice_eof;
prot[TLS_HW][TLS_SW].sendpage = tls_device_sendpage;

prot[TLS_BASE][TLS_HW] = prot[TLS_BASE][TLS_SW];


2023-06-07 15:02:13

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 01/14] net: Block MSG_SENDPAGE_* from being passed to sendmsg() by userspace

It is necessary to allow MSG_SENDPAGE_* to be passed into ->sendmsg() to
allow sendmsg(MSG_SPLICE_PAGES) to replace ->sendpage(). Unblocking them
in the network protocol, however, allows these flags to be passed in by
userspace too[1].

Fix this by marking MSG_SENDPAGE_NOPOLICY, MSG_SENDPAGE_NOTLAST and
MSG_SENDPAGE_DECRYPTED as internal flags, which causes sendmsg() to object
if they are passed to sendmsg() by userspace. Network protocol ->sendmsg()
implementations can then allow them through.

Note that it should be possible to remove MSG_SENDPAGE_NOTLAST once
sendpage is removed as a whole slew of pages will be passed in in one go by
splice through sendmsg, with MSG_MORE being set if it has more data waiting
in the pipe.

Signed-off-by: David Howells <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ [1]
---
include/linux/socket.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index bd1cc3238851..3fd3436bc09f 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -339,7 +339,9 @@ struct ucred {
#endif

/* Flags to be cleared on entry by sendmsg and sendmmsg syscalls */
-#define MSG_INTERNAL_SENDMSG_FLAGS (MSG_SPLICE_PAGES)
+#define MSG_INTERNAL_SENDMSG_FLAGS \
+ (MSG_SPLICE_PAGES | MSG_SENDPAGE_NOPOLICY | MSG_SENDPAGE_NOTLAST | \
+ MSG_SENDPAGE_DECRYPTED)

/* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
#define SOL_IP 0


2023-06-07 15:13:09

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 14/14] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

Convert tls_device_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself. With that, the tls_iter_offset
union is no longer necessary and can be replaced with an iov_iter pointer
and the zc_page argument to tls_push_data() can also be removed.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_device.c | 92 +++++++++++---------------------------------
1 file changed, 23 insertions(+), 69 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index bb3bb523544e..b4864d55900f 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -422,16 +422,10 @@ static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
return 0;
}

-union tls_iter_offset {
- struct iov_iter *msg_iter;
- int offset;
-};
-
static int tls_push_data(struct sock *sk,
- union tls_iter_offset iter_offset,
+ struct iov_iter *iter,
size_t size, int flags,
- unsigned char record_type,
- struct page *zc_page)
+ unsigned char record_type)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_prot_info *prot = &tls_ctx->prot_info;
@@ -500,22 +494,13 @@ static int tls_push_data(struct sock *sk,
record = ctx->open_record;

copy = min_t(size_t, size, max_open_record_len - record->len);
- if (copy && zc_page) {
- struct page_frag zc_pfrag;
-
- zc_pfrag.page = zc_page;
- zc_pfrag.offset = iter_offset.offset;
- zc_pfrag.size = copy;
- tls_append_frag(record, &zc_pfrag, copy);
-
- iter_offset.offset += copy;
- } else if (copy && (flags & MSG_SPLICE_PAGES)) {
+ if (copy && (flags & MSG_SPLICE_PAGES)) {
struct page_frag zc_pfrag;
struct page **pages = &zc_pfrag.page;
size_t off;

- rc = iov_iter_extract_pages(iter_offset.msg_iter,
- &pages, copy, 1, 0, &off);
+ rc = iov_iter_extract_pages(iter, &pages,
+ copy, 1, 0, &off);
if (rc <= 0) {
if (rc == 0)
rc = -EIO;
@@ -524,7 +509,7 @@ static int tls_push_data(struct sock *sk,
copy = rc;

if (WARN_ON_ONCE(!sendpage_ok(zc_pfrag.page))) {
- iov_iter_revert(iter_offset.msg_iter, copy);
+ iov_iter_revert(iter, copy);
rc = -EIO;
goto handle_error;
}
@@ -537,7 +522,7 @@ static int tls_push_data(struct sock *sk,

rc = tls_device_copy_data(page_address(pfrag->page) +
pfrag->offset, copy,
- iter_offset.msg_iter);
+ iter);
if (rc)
goto handle_error;
tls_append_frag(record, pfrag, copy);
@@ -592,7 +577,6 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
unsigned char record_type = TLS_RECORD_TYPE_DATA;
struct tls_context *tls_ctx = tls_get_ctx(sk);
- union tls_iter_offset iter;
int rc;

if (!tls_ctx->zerocopy_sendfile)
@@ -607,8 +591,8 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
goto out;
}

- iter.msg_iter = &msg->msg_iter;
- rc = tls_push_data(sk, iter, size, msg->msg_flags, record_type, NULL);
+ rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags,
+ record_type);

out:
release_sock(sk);
@@ -620,8 +604,7 @@ void tls_device_splice_eof(struct socket *sock)
{
struct sock *sk = sock->sk;
struct tls_context *tls_ctx = tls_get_ctx(sk);
- union tls_iter_offset iter;
- struct iov_iter iov_iter = {};
+ struct iov_iter iter = {};

if (!tls_is_partially_sent_record(tls_ctx))
return;
@@ -630,9 +613,8 @@ void tls_device_splice_eof(struct socket *sock)
lock_sock(sk);

if (tls_is_partially_sent_record(tls_ctx)) {
- iov_iter_bvec(&iov_iter, ITER_SOURCE, NULL, 0, 0);
- iter.msg_iter = &iov_iter;
- tls_push_data(sk, iter, 0, 0, TLS_RECORD_TYPE_DATA, NULL);
+ iov_iter_bvec(&iter, ITER_SOURCE, NULL, 0, 0);
+ tls_push_data(sk, &iter, 0, 0, TLS_RECORD_TYPE_DATA);
}

release_sock(sk);
@@ -642,44 +624,18 @@ void tls_device_splice_eof(struct socket *sock)
int tls_device_sendpage(struct sock *sk, struct page *page,
int offset, size_t size, int flags)
{
- struct tls_context *tls_ctx = tls_get_ctx(sk);
- union tls_iter_offset iter_offset;
- struct iov_iter msg_iter;
- char *kaddr;
- struct kvec iov;
- int rc;
+ struct bio_vec bvec;
+ struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };

if (flags & MSG_SENDPAGE_NOTLAST)
- flags |= MSG_MORE;
-
- mutex_lock(&tls_ctx->tx_lock);
- lock_sock(sk);
+ msg.msg_flags |= MSG_MORE;

- if (flags & MSG_OOB) {
- rc = -EOPNOTSUPP;
- goto out;
- }
-
- if (tls_ctx->zerocopy_sendfile) {
- iter_offset.offset = offset;
- rc = tls_push_data(sk, iter_offset, size,
- flags, TLS_RECORD_TYPE_DATA, page);
- goto out;
- }
-
- kaddr = kmap(page);
- iov.iov_base = kaddr + offset;
- iov.iov_len = size;
- iov_iter_kvec(&msg_iter, ITER_SOURCE, &iov, 1, size);
- iter_offset.msg_iter = &msg_iter;
- rc = tls_push_data(sk, iter_offset, size, flags, TLS_RECORD_TYPE_DATA,
- NULL);
- kunmap(page);
+ if (flags & MSG_OOB)
+ return -EOPNOTSUPP;

-out:
- release_sock(sk);
- mutex_unlock(&tls_ctx->tx_lock);
- return rc;
+ bvec_set_page(&bvec, page, size, offset);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+ return tls_device_sendmsg(sk, &msg, size);
}

struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
@@ -744,12 +700,10 @@ EXPORT_SYMBOL(tls_get_record);

static int tls_device_push_pending_record(struct sock *sk, int flags)
{
- union tls_iter_offset iter;
- struct iov_iter msg_iter;
+ struct iov_iter iter;

- iov_iter_kvec(&msg_iter, ITER_SOURCE, NULL, 0, 0);
- iter.msg_iter = &msg_iter;
- return tls_push_data(sk, iter, 0, flags, TLS_RECORD_TYPE_DATA, NULL);
+ iov_iter_kvec(&iter, ITER_SOURCE, NULL, 0, 0);
+ return tls_push_data(sk, &iter, 0, flags, TLS_RECORD_TYPE_DATA);
}

void tls_device_write_space(struct sock *sk, struct tls_context *ctx)


2023-06-07 15:20:54

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 09/14] kcm: Use splice_eof() to flush

Allow splice to undo the effects of MSG_MORE after prematurely ending a
splice/sendfile due to getting an EOF condition (->splice_read() returned
0) after splice had called sendmsg() with MSG_MORE set when the user didn't
set MSG_MORE.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Tom Herbert <[email protected]>
cc: Cong Wang <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/kcm/kcmsock.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index ba22af16b96d..d0d8c54562d6 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -968,6 +968,19 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
return err;
}

+static void kcm_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct kcm_sock *kcm = kcm_sk(sk);
+
+ if (skb_queue_empty(&sk->sk_write_queue))
+ return;
+
+ lock_sock(sk);
+ kcm_write_msgs(kcm);
+ release_sock(sk);
+}
+
static ssize_t kcm_sendpage(struct socket *sock, struct page *page,
int offset, size_t size, int flags)

@@ -1773,6 +1786,7 @@ static const struct proto_ops kcm_dgram_ops = {
.sendmsg = kcm_sendmsg,
.recvmsg = kcm_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = kcm_splice_eof,
.sendpage = kcm_sendpage,
};

@@ -1794,6 +1808,7 @@ static const struct proto_ops kcm_seqpacket_ops = {
.sendmsg = kcm_sendmsg,
.recvmsg = kcm_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = kcm_splice_eof,
.sendpage = kcm_sendpage,
.splice_read = kcm_splice_read,
};


2023-06-07 15:23:31

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 02/14] tls: Allow MSG_SPLICE_PAGES but treat it as normal sendmsg

Allow MSG_SPLICE_PAGES to be specified to sendmsg() but treat it as normal
sendmsg for now. This means the data will just be copied until
MSG_SPLICE_PAGES is handled.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_device.c | 3 ++-
net/tls/tls_sw.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index a959572a816f..9ef766e41c7a 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -447,7 +447,8 @@ static int tls_push_data(struct sock *sk,
long timeo;

if (flags &
- ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST))
+ ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST |
+ MSG_SPLICE_PAGES))
return -EOPNOTSUPP;

if (unlikely(sk->sk_err))
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 1a53c8f481e9..38acc27a0dd0 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -955,7 +955,7 @@ int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
int pending;

if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
- MSG_CMSG_COMPAT))
+ MSG_CMSG_COMPAT | MSG_SPLICE_PAGES))
return -EOPNOTSUPP;

ret = mutex_lock_interruptible(&tls_ctx->tx_lock);


2023-06-07 15:55:51

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH net-next v5 07/14] ipv4, ipv6: Use splice_eof() to flush

From: David Howells <[email protected]>
Date: Wed, 7 Jun 2023 15:05:52 +0100
> Allow splice to undo the effects of MSG_MORE after prematurely ending a
> splice/sendfile due to getting an EOF condition (->splice_read() returned
> 0) after splice had called sendmsg() with MSG_MORE set when the user didn't
> set MSG_MORE.
>
> For UDP, a pending packet will not be emitted if the socket is closed
> before it is flushed; with this change, it be flushed by ->splice_eof().
>
> For TCP, it's not clear that MSG_MORE is actually effective.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
> Signed-off-by: David Howells <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Willem de Bruijn <[email protected]>
> cc: David Ahern <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> ---
> include/net/inet_common.h | 1 +
> include/net/tcp.h | 1 +
> include/net/udp.h | 1 +
> net/ipv4/af_inet.c | 18 ++++++++++++++++++
> net/ipv4/tcp.c | 16 ++++++++++++++++
> net/ipv4/tcp_ipv4.c | 1 +
> net/ipv4/udp.c | 16 ++++++++++++++++
> net/ipv6/af_inet6.c | 1 +
> net/ipv6/tcp_ipv6.c | 1 +
> net/ipv6/udp.c | 18 ++++++++++++++++++
> 10 files changed, 74 insertions(+)
>
> diff --git a/include/net/inet_common.h b/include/net/inet_common.h
> index 77f4b0ef5b92..a75333342c4e 100644
> --- a/include/net/inet_common.h
> +++ b/include/net/inet_common.h
> @@ -35,6 +35,7 @@ void __inet_accept(struct socket *sock, struct socket *newsock,
> struct sock *newsk);
> int inet_send_prepare(struct sock *sk);
> int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
> +void inet_splice_eof(struct socket *sock);
> ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
> size_t size, int flags);
> int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 68990a8f556a..49611af31bb7 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -327,6 +327,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
> int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size);
> int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
> size_t size, struct ubuf_info *uarg);
> +void tcp_splice_eof(struct socket *sock);
> int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
> int flags);
> int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 5cad44318d71..4ed0b47c5582 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -278,6 +278,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
> int udp_err(struct sk_buff *, u32);
> int udp_abort(struct sock *sk, int err);
> int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
> +void udp_splice_eof(struct socket *sock);
> int udp_push_pending_frames(struct sock *sk);
> void udp_flush_pending_frames(struct sock *sk);
> int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index b5735b3551cf..6cfb78592836 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -831,6 +831,21 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
> }
> EXPORT_SYMBOL(inet_sendmsg);
>
> +void inet_splice_eof(struct socket *sock)
> +{
> + const struct proto *prot;
> + struct sock *sk = sock->sk;
> +
> + if (unlikely(inet_send_prepare(sk)))
> + return;
> +
> + /* IPV6_ADDRFORM can change sk->sk_prot under us. */
> + prot = READ_ONCE(sk->sk_prot);
> + if (prot->splice_eof)
> + sk->sk_prot->splice_eof(sock);

We need to use prot here.


> +}
> +EXPORT_SYMBOL_GPL(inet_splice_eof);
> +
> ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
> size_t size, int flags)
> {
> @@ -1050,6 +1065,7 @@ const struct proto_ops inet_stream_ops = {
> #ifdef CONFIG_MMU
> .mmap = tcp_mmap,
> #endif
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> .splice_read = tcp_splice_read,
> .read_sock = tcp_read_sock,
> @@ -1084,6 +1100,7 @@ const struct proto_ops inet_dgram_ops = {
> .read_skb = udp_read_skb,
> .recvmsg = inet_recvmsg,
> .mmap = sock_no_mmap,
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> .set_peek_off = sk_set_peek_off,
> #ifdef CONFIG_COMPAT
> @@ -1115,6 +1132,7 @@ static const struct proto_ops inet_sockraw_ops = {
> .sendmsg = inet_sendmsg,
> .recvmsg = inet_recvmsg,
> .mmap = sock_no_mmap,
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> #ifdef CONFIG_COMPAT
> .compat_ioctl = inet_compat_ioctl,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53b7751b68e1..09f03221a6f1 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1371,6 +1371,22 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> }
> EXPORT_SYMBOL(tcp_sendmsg);
>
> +void tcp_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct tcp_sock *tp = tcp_sk(sk);
> + int mss_now, size_goal;
> +
> + if (!tcp_write_queue_tail(sk))
> + return;
> +
> + lock_sock(sk);
> + mss_now = tcp_send_mss(sk, &size_goal, 0);
> + tcp_push(sk, 0, mss_now, tp->nonagle, size_goal);
> + release_sock(sk);
> +}
> +EXPORT_SYMBOL_GPL(tcp_splice_eof);
> +
> /*
> * Handle reading urgent data. BSD has very simple semantics for
> * this, no blocking and very strange errors 8)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 53e9ce2f05bb..84a5d557dc1a 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3116,6 +3116,7 @@ struct proto tcp_prot = {
> .keepalive = tcp_set_keepalive,
> .recvmsg = tcp_recvmsg,
> .sendmsg = tcp_sendmsg,
> + .splice_eof = tcp_splice_eof,
> .sendpage = tcp_sendpage,
> .backlog_rcv = tcp_v4_do_rcv,
> .release_cb = tcp_release_cb,
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index fd3dae081f3a..df5e407286d7 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1324,6 +1324,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> }
> EXPORT_SYMBOL(udp_sendmsg);
>
> +void udp_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct udp_sock *up = udp_sk(sk);
> +
> + if (!up->pending || READ_ONCE(up->corkflag))
> + return;
> +
> + lock_sock(sk);
> + if (up->pending && !READ_ONCE(up->corkflag))
> + udp_push_pending_frames(sk);
> + release_sock(sk);
> +}
> +EXPORT_SYMBOL_GPL(udp_splice_eof);
> +
> int udp_sendpage(struct sock *sk, struct page *page, int offset,
> size_t size, int flags)
> {
> @@ -2918,6 +2933,7 @@ struct proto udp_prot = {
> .getsockopt = udp_getsockopt,
> .sendmsg = udp_sendmsg,
> .recvmsg = udp_recvmsg,
> + .splice_eof = udp_splice_eof,
> .sendpage = udp_sendpage,
> .release_cb = ip4_datagram_release_cb,
> .hash = udp_lib_hash,
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 2bbf13216a3d..564942bee067 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -695,6 +695,7 @@ const struct proto_ops inet6_stream_ops = {
> #ifdef CONFIG_MMU
> .mmap = tcp_mmap,
> #endif
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> .sendmsg_locked = tcp_sendmsg_locked,
> .sendpage_locked = tcp_sendpage_locked,
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index d657713d1c71..c17c8ff94b79 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -2150,6 +2150,7 @@ struct proto tcpv6_prot = {
> .keepalive = tcp_set_keepalive,
> .recvmsg = tcp_recvmsg,
> .sendmsg = tcp_sendmsg,
> + .splice_eof = tcp_splice_eof,
> .sendpage = tcp_sendpage,
> .backlog_rcv = tcp_v6_do_rcv,
> .release_cb = tcp_release_cb,
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index e5a337e6b970..6c5975b13ae3 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1653,6 +1653,23 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> }
> EXPORT_SYMBOL(udpv6_sendmsg);
>
> +static void udpv6_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct udp_sock *up = udp_sk(sk);
> +
> + if (!up->pending || READ_ONCE(up->corkflag))
> + return;
> +
> + if (up->pending == AF_INET)
> + udp_splice_eof(sock);

Do we need this ?


> +
> + lock_sock(sk);
> + if (up->pending && !READ_ONCE(up->corkflag))
> + udp_push_pending_frames(sk);

We should use udp_v6_push_pending_frames(sk) as up->pending
could be AF_INET even after the test above.


> + release_sock(sk);
> +}
> +
> void udpv6_destroy_sock(struct sock *sk)
> {
> struct udp_sock *up = udp_sk(sk);
> @@ -1764,6 +1781,7 @@ struct proto udpv6_prot = {
> .getsockopt = udpv6_getsockopt,
> .sendmsg = udpv6_sendmsg,
> .recvmsg = udpv6_recvmsg,
> + .splice_eof = udpv6_splice_eof,
> .release_cb = ip6_datagram_release_cb,
> .hash = udp_lib_hash,
> .unhash = udp_lib_unhash,

2023-06-07 16:03:47

by David Howells

[permalink] [raw]
Subject: Re: [PATCH net-next v5 07/14] ipv4, ipv6: Use splice_eof() to flush

Kuniyuki Iwashima <[email protected]> wrote:

> > + /* IPV6_ADDRFORM can change sk->sk_prot under us. */
> > + prot = READ_ONCE(sk->sk_prot);
> > + if (prot->splice_eof)
> > + sk->sk_prot->splice_eof(sock);
>
> We need to use prot here.

Yeah.

> > + if (up->pending == AF_INET)
> > + udp_splice_eof(sock);
>
> Do we need this ?

Actually, no. udp_v6_push_pending_frames() will do this.

> > + lock_sock(sk);
> > + if (up->pending && !READ_ONCE(up->corkflag))
> > + udp_push_pending_frames(sk);
>
> We should use udp_v6_push_pending_frames(sk) as up->pending
> could be AF_INET even after the test above.

Yeah.

Updated version attached for your perusal (I will post a v6 too).

David
---
commit 8b95b9cd654835eb2ff1ad24cd6de802836c4062
Author: David Howells <[email protected]>
Date: Wed Jun 7 14:44:34 2023 +0100

ipv4, ipv6: Use splice_eof() to flush

Allow splice to undo the effects of MSG_MORE after prematurely ending a
splice/sendfile due to getting an EOF condition (->splice_read() returned
0) after splice had called sendmsg() with MSG_MORE set when the user didn't
set MSG_MORE.

For UDP, a pending packet will not be emitted if the socket is closed
before it is flushed; with this change, it be flushed by ->splice_eof().

For TCP, it's not clear that MSG_MORE is actually effective.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <[email protected]>
cc: Kuniyuki Iwashima <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: David Ahern <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]

Notes:
ver #6)
- In inet_splice_eof(), use prot after deref of sk->sk_prot.
- In udpv6_splice_eof(), use udp_v6_push_pending_frames().
- In udpv6_splice_eof(), don't check for AF_INET.

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 77f4b0ef5b92..a75333342c4e 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -35,6 +35,7 @@ void __inet_accept(struct socket *sock, struct socket *newsock,
struct sock *newsk);
int inet_send_prepare(struct sock *sk);
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
+void inet_splice_eof(struct socket *sock);
ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags);
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 68990a8f556a..49611af31bb7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size);
int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
size_t size, struct ubuf_info *uarg);
+void tcp_splice_eof(struct socket *sock);
int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
int flags);
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
diff --git a/include/net/udp.h b/include/net/udp.h
index 5cad44318d71..4ed0b47c5582 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -278,6 +278,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
int udp_err(struct sk_buff *, u32);
int udp_abort(struct sock *sk, int err);
int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+void udp_splice_eof(struct socket *sock);
int udp_push_pending_frames(struct sock *sk);
void udp_flush_pending_frames(struct sock *sk);
int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b5735b3551cf..fd233c4195ac 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -831,6 +831,21 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(inet_sendmsg);

+void inet_splice_eof(struct socket *sock)
+{
+ const struct proto *prot;
+ struct sock *sk = sock->sk;
+
+ if (unlikely(inet_send_prepare(sk)))
+ return;
+
+ /* IPV6_ADDRFORM can change sk->sk_prot under us. */
+ prot = READ_ONCE(sk->sk_prot);
+ if (prot->splice_eof)
+ prot->splice_eof(sock);
+}
+EXPORT_SYMBOL_GPL(inet_splice_eof);
+
ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
@@ -1050,6 +1065,7 @@ const struct proto_ops inet_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
@@ -1084,6 +1100,7 @@ const struct proto_ops inet_dgram_ops = {
.read_skb = udp_read_skb,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
@@ -1115,6 +1132,7 @@ static const struct proto_ops inet_sockraw_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53b7751b68e1..09f03221a6f1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1371,6 +1371,22 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(tcp_sendmsg);

+void tcp_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct tcp_sock *tp = tcp_sk(sk);
+ int mss_now, size_goal;
+
+ if (!tcp_write_queue_tail(sk))
+ return;
+
+ lock_sock(sk);
+ mss_now = tcp_send_mss(sk, &size_goal, 0);
+ tcp_push(sk, 0, mss_now, tp->nonagle, size_goal);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(tcp_splice_eof);
+
/*
* Handle reading urgent data. BSD has very simple semantics for
* this, no blocking and very strange errors 8)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 53e9ce2f05bb..84a5d557dc1a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3116,6 +3116,7 @@ struct proto tcp_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
+ .splice_eof = tcp_splice_eof,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v4_do_rcv,
.release_cb = tcp_release_cb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fd3dae081f3a..df5e407286d7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1324,6 +1324,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udp_sendmsg);

+void udp_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct udp_sock *up = udp_sk(sk);
+
+ if (!up->pending || READ_ONCE(up->corkflag))
+ return;
+
+ lock_sock(sk);
+ if (up->pending && !READ_ONCE(up->corkflag))
+ udp_push_pending_frames(sk);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(udp_splice_eof);
+
int udp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
@@ -2918,6 +2933,7 @@ struct proto udp_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
+ .splice_eof = udp_splice_eof,
.sendpage = udp_sendpage,
.release_cb = ip4_datagram_release_cb,
.hash = udp_lib_hash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2bbf13216a3d..564942bee067 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -695,6 +695,7 @@ const struct proto_ops inet6_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.sendmsg_locked = tcp_sendmsg_locked,
.sendpage_locked = tcp_sendpage_locked,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d657713d1c71..c17c8ff94b79 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2150,6 +2150,7 @@ struct proto tcpv6_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
+ .splice_eof = tcp_splice_eof,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v6_do_rcv,
.release_cb = tcp_release_cb,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5a337e6b970..3a592dc129e9 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1653,6 +1653,20 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udpv6_sendmsg);

+static void udpv6_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct udp_sock *up = udp_sk(sk);
+
+ if (!up->pending || READ_ONCE(up->corkflag))
+ return;
+
+ lock_sock(sk);
+ if (up->pending && !READ_ONCE(up->corkflag))
+ udp_push_pending_frames(sk);
+ release_sock(sk);
+}
+
void udpv6_destroy_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
@@ -1764,6 +1778,7 @@ struct proto udpv6_prot = {
.getsockopt = udpv6_getsockopt,
.sendmsg = udpv6_sendmsg,
.recvmsg = udpv6_recvmsg,
+ .splice_eof = udpv6_splice_eof,
.release_cb = ip6_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,


2023-06-07 16:06:19

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH net-next v5 07/14] ipv4, ipv6: Use splice_eof() to flush

From: David Howells <[email protected]>
Date: Wed, 07 Jun 2023 16:43:52 +0100
> Kuniyuki Iwashima <[email protected]> wrote:
>
> > > + /* IPV6_ADDRFORM can change sk->sk_prot under us. */
> > > + prot = READ_ONCE(sk->sk_prot);
> > > + if (prot->splice_eof)
> > > + sk->sk_prot->splice_eof(sock);
> >
> > We need to use prot here.
>
> Yeah.
>
> > > + if (up->pending == AF_INET)
> > > + udp_splice_eof(sock);
> >
> > Do we need this ?
>
> Actually, no. udp_v6_push_pending_frames() will do this.
>
> > > + lock_sock(sk);
> > > + if (up->pending && !READ_ONCE(up->corkflag))
> > > + udp_push_pending_frames(sk);
> >
> > We should use udp_v6_push_pending_frames(sk) as up->pending
> > could be AF_INET even after the test above.
>
> Yeah.
>
> Updated version attached for your perusal (I will post a v6 too).
>
> David
> ---
> commit 8b95b9cd654835eb2ff1ad24cd6de802836c4062
> Author: David Howells <[email protected]>
> Date: Wed Jun 7 14:44:34 2023 +0100
>
> ipv4, ipv6: Use splice_eof() to flush
>
> Allow splice to undo the effects of MSG_MORE after prematurely ending a
> splice/sendfile due to getting an EOF condition (->splice_read() returned
> 0) after splice had called sendmsg() with MSG_MORE set when the user didn't
> set MSG_MORE.
>
> For UDP, a pending packet will not be emitted if the socket is closed
> before it is flushed; with this change, it be flushed by ->splice_eof().
>
> For TCP, it's not clear that MSG_MORE is actually effective.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
> Signed-off-by: David Howells <[email protected]>
> cc: Kuniyuki Iwashima <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: Willem de Bruijn <[email protected]>
> cc: David Ahern <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
>
> Notes:
> ver #6)
> - In inet_splice_eof(), use prot after deref of sk->sk_prot.
> - In udpv6_splice_eof(), use udp_v6_push_pending_frames().

You missed this change ;)


> - In udpv6_splice_eof(), don't check for AF_INET.
>
> diff --git a/include/net/inet_common.h b/include/net/inet_common.h
> index 77f4b0ef5b92..a75333342c4e 100644
> --- a/include/net/inet_common.h
> +++ b/include/net/inet_common.h
> @@ -35,6 +35,7 @@ void __inet_accept(struct socket *sock, struct socket *newsock,
> struct sock *newsk);
> int inet_send_prepare(struct sock *sk);
> int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
> +void inet_splice_eof(struct socket *sock);
> ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
> size_t size, int flags);
> int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 68990a8f556a..49611af31bb7 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -327,6 +327,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
> int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size);
> int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
> size_t size, struct ubuf_info *uarg);
> +void tcp_splice_eof(struct socket *sock);
> int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
> int flags);
> int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 5cad44318d71..4ed0b47c5582 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -278,6 +278,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
> int udp_err(struct sk_buff *, u32);
> int udp_abort(struct sock *sk, int err);
> int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
> +void udp_splice_eof(struct socket *sock);
> int udp_push_pending_frames(struct sock *sk);
> void udp_flush_pending_frames(struct sock *sk);
> int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index b5735b3551cf..fd233c4195ac 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -831,6 +831,21 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
> }
> EXPORT_SYMBOL(inet_sendmsg);
>
> +void inet_splice_eof(struct socket *sock)
> +{
> + const struct proto *prot;
> + struct sock *sk = sock->sk;
> +
> + if (unlikely(inet_send_prepare(sk)))
> + return;
> +
> + /* IPV6_ADDRFORM can change sk->sk_prot under us. */
> + prot = READ_ONCE(sk->sk_prot);
> + if (prot->splice_eof)
> + prot->splice_eof(sock);
> +}
> +EXPORT_SYMBOL_GPL(inet_splice_eof);
> +
> ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
> size_t size, int flags)
> {
> @@ -1050,6 +1065,7 @@ const struct proto_ops inet_stream_ops = {
> #ifdef CONFIG_MMU
> .mmap = tcp_mmap,
> #endif
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> .splice_read = tcp_splice_read,
> .read_sock = tcp_read_sock,
> @@ -1084,6 +1100,7 @@ const struct proto_ops inet_dgram_ops = {
> .read_skb = udp_read_skb,
> .recvmsg = inet_recvmsg,
> .mmap = sock_no_mmap,
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> .set_peek_off = sk_set_peek_off,
> #ifdef CONFIG_COMPAT
> @@ -1115,6 +1132,7 @@ static const struct proto_ops inet_sockraw_ops = {
> .sendmsg = inet_sendmsg,
> .recvmsg = inet_recvmsg,
> .mmap = sock_no_mmap,
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> #ifdef CONFIG_COMPAT
> .compat_ioctl = inet_compat_ioctl,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53b7751b68e1..09f03221a6f1 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1371,6 +1371,22 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> }
> EXPORT_SYMBOL(tcp_sendmsg);
>
> +void tcp_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct tcp_sock *tp = tcp_sk(sk);
> + int mss_now, size_goal;
> +
> + if (!tcp_write_queue_tail(sk))
> + return;
> +
> + lock_sock(sk);
> + mss_now = tcp_send_mss(sk, &size_goal, 0);
> + tcp_push(sk, 0, mss_now, tp->nonagle, size_goal);
> + release_sock(sk);
> +}
> +EXPORT_SYMBOL_GPL(tcp_splice_eof);
> +
> /*
> * Handle reading urgent data. BSD has very simple semantics for
> * this, no blocking and very strange errors 8)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 53e9ce2f05bb..84a5d557dc1a 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3116,6 +3116,7 @@ struct proto tcp_prot = {
> .keepalive = tcp_set_keepalive,
> .recvmsg = tcp_recvmsg,
> .sendmsg = tcp_sendmsg,
> + .splice_eof = tcp_splice_eof,
> .sendpage = tcp_sendpage,
> .backlog_rcv = tcp_v4_do_rcv,
> .release_cb = tcp_release_cb,
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index fd3dae081f3a..df5e407286d7 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1324,6 +1324,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> }
> EXPORT_SYMBOL(udp_sendmsg);
>
> +void udp_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct udp_sock *up = udp_sk(sk);
> +
> + if (!up->pending || READ_ONCE(up->corkflag))
> + return;
> +
> + lock_sock(sk);
> + if (up->pending && !READ_ONCE(up->corkflag))
> + udp_push_pending_frames(sk);
> + release_sock(sk);
> +}
> +EXPORT_SYMBOL_GPL(udp_splice_eof);
> +
> int udp_sendpage(struct sock *sk, struct page *page, int offset,
> size_t size, int flags)
> {
> @@ -2918,6 +2933,7 @@ struct proto udp_prot = {
> .getsockopt = udp_getsockopt,
> .sendmsg = udp_sendmsg,
> .recvmsg = udp_recvmsg,
> + .splice_eof = udp_splice_eof,
> .sendpage = udp_sendpage,
> .release_cb = ip4_datagram_release_cb,
> .hash = udp_lib_hash,
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 2bbf13216a3d..564942bee067 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -695,6 +695,7 @@ const struct proto_ops inet6_stream_ops = {
> #ifdef CONFIG_MMU
> .mmap = tcp_mmap,
> #endif
> + .splice_eof = inet_splice_eof,
> .sendpage = inet_sendpage,
> .sendmsg_locked = tcp_sendmsg_locked,
> .sendpage_locked = tcp_sendpage_locked,
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index d657713d1c71..c17c8ff94b79 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -2150,6 +2150,7 @@ struct proto tcpv6_prot = {
> .keepalive = tcp_set_keepalive,
> .recvmsg = tcp_recvmsg,
> .sendmsg = tcp_sendmsg,
> + .splice_eof = tcp_splice_eof,
> .sendpage = tcp_sendpage,
> .backlog_rcv = tcp_v6_do_rcv,
> .release_cb = tcp_release_cb,
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index e5a337e6b970..3a592dc129e9 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1653,6 +1653,20 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> }
> EXPORT_SYMBOL(udpv6_sendmsg);
>
> +static void udpv6_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct udp_sock *up = udp_sk(sk);
> +
> + if (!up->pending || READ_ONCE(up->corkflag))
> + return;
> +
> + lock_sock(sk);
> + if (up->pending && !READ_ONCE(up->corkflag))
> + udp_push_pending_frames(sk);
> + release_sock(sk);
> +}
> +
> void udpv6_destroy_sock(struct sock *sk)
> {
> struct udp_sock *up = udp_sk(sk);
> @@ -1764,6 +1778,7 @@ struct proto udpv6_prot = {
> .getsockopt = udpv6_getsockopt,
> .sendmsg = udpv6_sendmsg,
> .recvmsg = udpv6_recvmsg,
> + .splice_eof = udpv6_splice_eof,
> .release_cb = ip6_datagram_release_cb,
> .hash = udp_lib_hash,
> .unhash = udp_lib_unhash,

2023-06-07 16:20:19

by Kuniyuki Iwashima

[permalink] [raw]
Subject: Re: [PATCH net-next v5 09/14] kcm: Use splice_eof() to flush

From: David Howells <[email protected]>
Date: Wed, 7 Jun 2023 15:05:54 +0100
> Allow splice to undo the effects of MSG_MORE after prematurely ending a
> splice/sendfile due to getting an EOF condition (->splice_read() returned
> 0) after splice had called sendmsg() with MSG_MORE set when the user didn't
> set MSG_MORE.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
> Signed-off-by: David Howells <[email protected]>
> cc: Tom Herbert <[email protected]>
> cc: Tom Herbert <[email protected]>
> cc: Cong Wang <[email protected]>
> cc: Jakub Kicinski <[email protected]>
> cc: Eric Dumazet <[email protected]>
> cc: "David S. Miller" <[email protected]>
> cc: Paolo Abeni <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: Matthew Wilcox <[email protected]>
> cc: [email protected]
> ---
> net/kcm/kcmsock.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index ba22af16b96d..d0d8c54562d6 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -968,6 +968,19 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
> return err;
> }
>
> +static void kcm_splice_eof(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> + struct kcm_sock *kcm = kcm_sk(sk);
> +
> + if (skb_queue_empty(&sk->sk_write_queue))

nit: would be better to use skb_queue_empty_lockless().


> + return;
> +
> + lock_sock(sk);
> + kcm_write_msgs(kcm);
> + release_sock(sk);
> +}
> +
> static ssize_t kcm_sendpage(struct socket *sock, struct page *page,
> int offset, size_t size, int flags)
>
> @@ -1773,6 +1786,7 @@ static const struct proto_ops kcm_dgram_ops = {
> .sendmsg = kcm_sendmsg,
> .recvmsg = kcm_recvmsg,
> .mmap = sock_no_mmap,
> + .splice_eof = kcm_splice_eof,
> .sendpage = kcm_sendpage,
> };
>
> @@ -1794,6 +1808,7 @@ static const struct proto_ops kcm_seqpacket_ops = {
> .sendmsg = kcm_sendmsg,
> .recvmsg = kcm_recvmsg,
> .mmap = sock_no_mmap,
> + .splice_eof = kcm_splice_eof,
> .sendpage = kcm_sendpage,
> .splice_read = kcm_splice_read,
> };

2023-06-07 16:22:53

by David Howells

[permalink] [raw]
Subject: Re: [PATCH net-next v5 07/14] ipv4, ipv6: Use splice_eof() to flush

Kuniyuki Iwashima <[email protected]> wrote:

> > - In udpv6_splice_eof(), use udp_v6_push_pending_frames().
>
> You missed this change ;)

No I didn't - I just forgot to save the buffer :-/

David
---
commit a630e96e3b1073dc39fd370d60ccb3d5367ce9e6
Author: David Howells <[email protected]>
Date: Wed Jun 7 14:44:34 2023 +0100

ipv4, ipv6: Use splice_eof() to flush

Allow splice to undo the effects of MSG_MORE after prematurely ending a
splice/sendfile due to getting an EOF condition (->splice_read() returned
0) after splice had called sendmsg() with MSG_MORE set when the user didn't
set MSG_MORE.

For UDP, a pending packet will not be emitted if the socket is closed
before it is flushed; with this change, it be flushed by ->splice_eof().

For TCP, it's not clear that MSG_MORE is actually effective.

Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <[email protected]>
cc: Kuniyuki Iwashima <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: David Ahern <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]

Notes:
ver #6)
- In inet_splice_eof(), use prot after deref of sk->sk_prot.
- In udpv6_splice_eof(), use udp_v6_push_pending_frames().
- In udpv6_splice_eof(), don't check for AF_INET.

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 77f4b0ef5b92..a75333342c4e 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -35,6 +35,7 @@ void __inet_accept(struct socket *sock, struct socket *newsock,
struct sock *newsk);
int inet_send_prepare(struct sock *sk);
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
+void inet_splice_eof(struct socket *sock);
ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags);
int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 68990a8f556a..49611af31bb7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size);
int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied,
size_t size, struct ubuf_info *uarg);
+void tcp_splice_eof(struct socket *sock);
int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
int flags);
int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
diff --git a/include/net/udp.h b/include/net/udp.h
index 5cad44318d71..4ed0b47c5582 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -278,6 +278,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
int udp_err(struct sk_buff *, u32);
int udp_abort(struct sock *sk, int err);
int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+void udp_splice_eof(struct socket *sock);
int udp_push_pending_frames(struct sock *sk);
void udp_flush_pending_frames(struct sock *sk);
int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index b5735b3551cf..fd233c4195ac 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -831,6 +831,21 @@ int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(inet_sendmsg);

+void inet_splice_eof(struct socket *sock)
+{
+ const struct proto *prot;
+ struct sock *sk = sock->sk;
+
+ if (unlikely(inet_send_prepare(sk)))
+ return;
+
+ /* IPV6_ADDRFORM can change sk->sk_prot under us. */
+ prot = READ_ONCE(sk->sk_prot);
+ if (prot->splice_eof)
+ prot->splice_eof(sock);
+}
+EXPORT_SYMBOL_GPL(inet_splice_eof);
+
ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
@@ -1050,6 +1065,7 @@ const struct proto_ops inet_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.splice_read = tcp_splice_read,
.read_sock = tcp_read_sock,
@@ -1084,6 +1100,7 @@ const struct proto_ops inet_dgram_ops = {
.read_skb = udp_read_skb,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.set_peek_off = sk_set_peek_off,
#ifdef CONFIG_COMPAT
@@ -1115,6 +1132,7 @@ static const struct proto_ops inet_sockraw_ops = {
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
.mmap = sock_no_mmap,
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
#ifdef CONFIG_COMPAT
.compat_ioctl = inet_compat_ioctl,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53b7751b68e1..09f03221a6f1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1371,6 +1371,22 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
}
EXPORT_SYMBOL(tcp_sendmsg);

+void tcp_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct tcp_sock *tp = tcp_sk(sk);
+ int mss_now, size_goal;
+
+ if (!tcp_write_queue_tail(sk))
+ return;
+
+ lock_sock(sk);
+ mss_now = tcp_send_mss(sk, &size_goal, 0);
+ tcp_push(sk, 0, mss_now, tp->nonagle, size_goal);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(tcp_splice_eof);
+
/*
* Handle reading urgent data. BSD has very simple semantics for
* this, no blocking and very strange errors 8)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 53e9ce2f05bb..84a5d557dc1a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3116,6 +3116,7 @@ struct proto tcp_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
+ .splice_eof = tcp_splice_eof,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v4_do_rcv,
.release_cb = tcp_release_cb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fd3dae081f3a..df5e407286d7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1324,6 +1324,21 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udp_sendmsg);

+void udp_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct udp_sock *up = udp_sk(sk);
+
+ if (!up->pending || READ_ONCE(up->corkflag))
+ return;
+
+ lock_sock(sk);
+ if (up->pending && !READ_ONCE(up->corkflag))
+ udp_push_pending_frames(sk);
+ release_sock(sk);
+}
+EXPORT_SYMBOL_GPL(udp_splice_eof);
+
int udp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
@@ -2918,6 +2933,7 @@ struct proto udp_prot = {
.getsockopt = udp_getsockopt,
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
+ .splice_eof = udp_splice_eof,
.sendpage = udp_sendpage,
.release_cb = ip4_datagram_release_cb,
.hash = udp_lib_hash,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 2bbf13216a3d..564942bee067 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -695,6 +695,7 @@ const struct proto_ops inet6_stream_ops = {
#ifdef CONFIG_MMU
.mmap = tcp_mmap,
#endif
+ .splice_eof = inet_splice_eof,
.sendpage = inet_sendpage,
.sendmsg_locked = tcp_sendmsg_locked,
.sendpage_locked = tcp_sendpage_locked,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d657713d1c71..c17c8ff94b79 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2150,6 +2150,7 @@ struct proto tcpv6_prot = {
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
+ .splice_eof = tcp_splice_eof,
.sendpage = tcp_sendpage,
.backlog_rcv = tcp_v6_do_rcv,
.release_cb = tcp_release_cb,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e5a337e6b970..317b01c9bc39 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1653,6 +1653,20 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
EXPORT_SYMBOL(udpv6_sendmsg);

+static void udpv6_splice_eof(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+ struct udp_sock *up = udp_sk(sk);
+
+ if (!up->pending || READ_ONCE(up->corkflag))
+ return;
+
+ lock_sock(sk);
+ if (up->pending && !READ_ONCE(up->corkflag))
+ udp_v6_push_pending_frames(sk);
+ release_sock(sk);
+}
+
void udpv6_destroy_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
@@ -1764,6 +1778,7 @@ struct proto udpv6_prot = {
.getsockopt = udpv6_getsockopt,
.sendmsg = udpv6_sendmsg,
.recvmsg = udpv6_recvmsg,
+ .splice_eof = udpv6_splice_eof,
.release_cb = ip6_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,


2023-06-07 16:53:23

by David Howells

[permalink] [raw]
Subject: [PATCH net-next v5 13/14] tls/device: Support MSG_SPLICE_PAGES

Make TLS's device sendmsg() support MSG_SPLICE_PAGES. This causes pages to
be spliced from the source iterator if possible.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <[email protected]>
cc: Chuck Lever <[email protected]>
cc: Boris Pismenny <[email protected]>
cc: John Fastabend <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
---
net/tls/tls_device.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 439be833dcf9..bb3bb523544e 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -509,6 +509,29 @@ static int tls_push_data(struct sock *sk,
tls_append_frag(record, &zc_pfrag, copy);

iter_offset.offset += copy;
+ } else if (copy && (flags & MSG_SPLICE_PAGES)) {
+ struct page_frag zc_pfrag;
+ struct page **pages = &zc_pfrag.page;
+ size_t off;
+
+ rc = iov_iter_extract_pages(iter_offset.msg_iter,
+ &pages, copy, 1, 0, &off);
+ if (rc <= 0) {
+ if (rc == 0)
+ rc = -EIO;
+ goto handle_error;
+ }
+ copy = rc;
+
+ if (WARN_ON_ONCE(!sendpage_ok(zc_pfrag.page))) {
+ iov_iter_revert(iter_offset.msg_iter, copy);
+ rc = -EIO;
+ goto handle_error;
+ }
+
+ zc_pfrag.offset = off;
+ zc_pfrag.size = copy;
+ tls_append_frag(record, &zc_pfrag, copy);
} else if (copy) {
copy = min_t(size_t, copy, pfrag->size - pfrag->offset);

@@ -572,6 +595,9 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
union tls_iter_offset iter;
int rc;

+ if (!tls_ctx->zerocopy_sendfile)
+ msg->msg_flags &= ~MSG_SPLICE_PAGES;
+
mutex_lock(&tls_ctx->tx_lock);
lock_sock(sk);



2023-06-07 17:11:07

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v5 04/14] splice, net: Add a splice_eof op to file-ops and socket-ops

On Wed, 7 Jun 2023 15:05:49 +0100 David Howells wrote:
> Add an optional method, ->splice_eof(), to allow splice to indicate the
> premature termination of a splice to struct file_operations and struct
> proto_ops.
>
> This is called if sendfile() or splice() encounters all of the following
> conditions inside splice_direct_to_actor():
>
> (1) the user did not set SPLICE_F_MORE (splice only), and
>
> (2) an EOF condition occurred (->splice_read() returned 0), and
>
> (3) we haven't read enough to fulfill the request (ie. len > 0 still), and
>
> (4) we have already spliced at least one byte.
>
> A further patch will modify the behaviour of SPLICE_F_MORE to always be
> passed to the actor if either the user set it or we haven't yet read
> sufficient data to fulfill the request.

Reviewed-by: Jakub Kicinski <[email protected]>

2023-06-07 17:23:43

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v5 06/14] tls/device: Use splice_eof() to flush

On Wed, 7 Jun 2023 15:05:51 +0100 David Howells wrote:
> Allow splice to end a TLS record after prematurely ending a splice/sendfile
> due to getting an EOF condition (->splice_read() returned 0) after splice
> had called TLS with a sendmsg() with MSG_MORE set when the user didn't set
> MSG_MORE.

Reviewed-by: Jakub Kicinski <[email protected]>

2023-06-07 17:26:46

by David Howells

[permalink] [raw]
Subject: Re: [PATCH net-next v5 09/14] kcm: Use splice_eof() to flush

Kuniyuki Iwashima <[email protected]> wrote:

> > + if (skb_queue_empty(&sk->sk_write_queue))
>
> nit: would be better to use skb_queue_empty_lockless().

Ok.

David


2023-06-07 17:34:02

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v5 12/14] tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES

On Wed, 7 Jun 2023 15:05:57 +0100 David Howells wrote:
> Convert tls_sw_sendpage() and tls_sw_sendpage_locked() to use sendmsg()
> with MSG_SPLICE_PAGES rather than directly splicing in the pages itself.
>
> [!] Note that tls_sw_sendpage_locked() appears to have the wrong locking
> upstream. I think the caller will only hold the socket lock, but it
> should hold tls_ctx->tx_lock too.
>
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.

Reviewed-by: Jakub Kicinski <[email protected]>

2023-06-07 17:40:40

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v5 14/14] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

On Wed, 7 Jun 2023 15:05:59 +0100 David Howells wrote:
> Convert tls_device_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather
> than directly splicing in the pages itself. With that, the tls_iter_offset
> union is no longer necessary and can be replaced with an iov_iter pointer
> and the zc_page argument to tls_push_data() can also be removed.
>
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.

Acked-by: Jakub Kicinski <[email protected]>

2023-06-07 17:45:17

by David Howells

[permalink] [raw]
Subject: Re: [PATCH net-next v5 14/14] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

Jakub Kicinski <[email protected]> wrote:

> Acked-by: Jakub Kicinski <[email protected]>

Did you mean Acked-by rather than Reviewed-by?


2023-06-07 17:50:39

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v5 13/14] tls/device: Support MSG_SPLICE_PAGES

On Wed, 7 Jun 2023 15:05:58 +0100 David Howells wrote:
> Make TLS's device sendmsg() support MSG_SPLICE_PAGES. This causes pages to
> be spliced from the source iterator if possible.
>
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.

Reviewed-by: Jakub Kicinski <[email protected]>

2023-06-07 18:11:41

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v5 14/14] tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

On Wed, 07 Jun 2023 18:35:22 +0100 David Howells wrote:
> Jakub Kicinski <[email protected]> wrote:
>
> > Acked-by: Jakub Kicinski <[email protected]>
>
> Did you mean Acked-by rather than Reviewed-by?

Yeah, looks mostly mechanical, I trust you did the right thing.