Greetings:
Welcome to RFC v2.
This is my first series that touches more than 1 subsystem; hope I got the
various subject lines and to/cc-lists correct.
Based on the feedback on RFC v1 [1], I've made a few changes:
- Removed the indirect calls.
- Simplified the code a bit by pushing logic down to a wrapper around
copyin.
- Added support for the 'MSG_NTCOPY' flag to udp, udp-lite, tcp, and unix.
I think this series is much closer to a v1 that can be submit for
consideration, but wanted to test the waters with an RFC first :)
This new set of code allows applications to request non-temporal copies on
individual calls to sendmsg for several socket types, not just unix.
The result is that:
1. Users don't need to specify no cache copy for the entire interface as
they had been doing previously with ethtool. There is more fine grained
control of which sendmsgs are non-temporal. I think it makes sense for
this to be application specific (vs interface-wide) since applications
will have a better idea of which copy is appropriate.
2. Previously, the ethool bit for enabling no-cache-copy only seems to have
affected TCP sockets, IIUC. This series supports UDP, UDP-Lite, TCP, and
Unix. This means the behavior and accessibility of non-temporal copies
is normalized bit more than it had been previously.
The performance results on my AMD Zen2 test system are identical to the
previous RFC, so I've included those results below.
As you'll see below, NT copies in the unix write path have a large
measureable impact on certain application architectures and CPUs.
Initial benchmarks are extremely encouraging. I wrote a simple C program to
benchmark this patchset, the program:
- Creates a unix socket pair
- Forks a child process
- The parent process writes to the unix socket using MSG_NTCOPY, or not,
depending on the command line flags
- The child process uses splice to move the data from the unix socket to
a pipe buffer, followed by a second splice call to move the data from
the pipe buffer to a file descriptor opened on /dev/null.
- taskset is used when launching the benchmark to ensure the parent and
child run on appropriate CPUs for various scenarios
The source of the test program is available for examination [2] and results
for three benchmarks I ran are provided below.
Test system: AMD EPYC 7662 64-Core Processor,
64 cores / 128 threads,
512kb L2 per core shared by sibling CPUs,
16mb L3 per NUMA zone,
AMD specific settings: NPS=1 and L3 as NUMA enabled
Test: 1048576 byte object,
100,000 iterations,
512kb pipe buffer size,
512kb unix socket send buffer size
Sample command lines for running the tests provided below. Note that the
command line shows how to run a "normal" copy benchmark. To run the
benchmark in MSG_NTCOPY mode, change command line argument 3 from 0 to 1.
Test pinned to CPUs 1 and 2 which do *not* share an L2 cache, but do share
an L3.
Command line for "normal" copy:
% time taskset -ac 1,2 ./unix-nt-bench 1048576 100000 0 524288 524288
Mode real time (sec.) throughput (Mb/s)
"Normal" copy 10.630 78,928
MSG_NTCOPY 7.429 112,935
Same test as above, but pinned to CPUs 1 and 65 which share an L2 (512kb)
and L3 cache (16mb).
Command line for "normal" copy:
% time taskset -ac 1,65 ./unix-nt-bench 1048576 100000 0 524288 524288
Mode real time (sec.) throughput (Mb/s)
"Normal" copy 12.532 66,941
MSG_NTCOPY 9.445 88,826
Same test as above, pinned to CPUs 1 and 65, but with 128kb unix send
buffer and pipe buffer sizes (to avoid spilling L2).
Command line for "normal" copy:
% time taskset -ac 1,65 ./unix-nt-bench 1048576 100000 0 131072 131072
Mode real time (sec.) throughput (Mb/s)
"Normal" copy 12.451 67,377
MSG_NTCOPY 9.451 88,768
Thanks,
Joe
[1]: https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
[2]: https://gist.githubusercontent.com/jdamato-fsly/03a2f0cd4e71ebe0fef97f7f2980d9e5/raw/19cfd3aca59109ebf5b03871d952ea1360f3e982/unix-nt-copy-bench.c
Joe Damato (8):
arch, x86, uaccess: Add nontemporal copy functions
iov_iter: Introduce iter_copy_type
iov_iter: add copyin_iovec helper
net: Add MSG_NTCOPY sendmsg flag
net: unix: Support MSG_NTCOPY
net: ip: Support MSG_NTCOPY
net: udplite: Support MSG_NTCOPY
net: tcp: Support MSG_NTCOPY
arch/x86/include/asm/uaccess_64.h | 6 ++++++
include/linux/socket.h | 9 +++++++++
include/linux/uaccess.h | 6 ++++++
include/linux/uio.h | 17 +++++++++++++++++
include/net/sock.h | 2 +-
include/net/udplite.h | 1 +
lib/iov_iter.c | 25 ++++++++++++++++++++-----
net/ipv4/ip_output.c | 1 +
net/ipv4/tcp.c | 2 ++
net/unix/af_unix.c | 4 ++++
10 files changed, 67 insertions(+), 6 deletions(-)
--
2.7.4
Support non-temporal copies in the TCP sendmsg path. Previously, the only
way to enable non-temporal copies was to enable them for the entire
interface (via ethtool).
This change allows user programs to request non-temporal copies for
specific sendmsg calls.
Signed-off-by: Joe Damato <[email protected]>
---
include/net/sock.h | 2 +-
net/ipv4/tcp.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 0063e84..b666ecd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2200,7 +2200,7 @@ static inline int skb_do_copy_data_nocache(struct sock *sk, struct sk_buff *skb,
if (!csum_and_copy_from_iter_full(to, copy, &csum, from))
return -EFAULT;
skb->csum = csum_block_add(skb->csum, csum, offset);
- } else if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY) {
+ } else if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY || iov_iter_copy_is_nt(from)) {
if (!copy_from_iter_full_nocache(to, copy, from))
return -EFAULT;
} else if (!copy_from_iter_full(to, copy, from))
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 14ebb4e..5b36e00 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1201,6 +1201,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
flags = msg->msg_flags;
+ msg_set_iter_copy_type(msg);
+
if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) {
skb = tcp_write_queue_tail(sk);
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
--
2.7.4