2022-11-05 10:01:01

by Max Kellermann

[permalink] [raw]
Subject: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

From: Max Kellermann <[email protected]>

If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.

Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.

However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.

That is undesirable, because it can block the calling process for too
long. I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:

sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
writev(4, [], 0) = 0 <0.000439>
epoll_wait(9, [], 512, 60000) = 0 <60.060430>
gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
close(12) = 0 <0.000063>
close(4) = 0 <0.000091>

In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:

https://github.com/nginx/nginx/commit/5636e7f7b4

Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.

There is prior art for this kind of behavior in filemap_read():

/*
* If we've already successfully copied some data, then we
* can no longer safely return -EIOCBQUEUED. Hence mark
* an async read NOWAIT at that point.
*/
if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
iocb->ki_flags |= IOCB_NOWAIT;

This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls. This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().

There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:

- checking whether the input file has O_NONBLOCK doesn't work because
this should be fixed even if the input file is not non-blocking

- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
only whether pipe operations are non-blocking, not whether
file/socket operations are non-blocking

Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice(). It translates to IOCB_NOWAIT for regular files. For now, I
have documented the flag to be kernel-internal with a high bit, like
io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
system call ABI may be a good idea as well.

To: Alexander Viro <[email protected]>
To: [email protected]
To: [email protected]
Signed-off-by: Max Kellermann <[email protected]>
---
fs/splice.c | 14 ++++++++++++++
include/linux/splice.h | 6 ++++++
2 files changed, 20 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 0878b852b355..7a8d5fee0965 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -306,6 +306,8 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
iov_iter_pipe(&to, READ, pipe, len);
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
+ if (flags & SPLICE_F_NOWAIT)
+ kiocb.ki_flags |= IOCB_NOWAIT;
ret = call_read_iter(in, &kiocb, &to);
if (ret > 0) {
*ppos = kiocb.ki_pos;
@@ -866,6 +868,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
if (unlikely(ret <= 0))
goto out_release;

+ /*
+ * After at least one byte was read from the input
+ * file, don't wait for blocking I/O in the following
+ * loop iterations; instead of blocking for arbitrary
+ * amounts of time in the kernel, let userspace decide
+ * how to proceed. This avoids excessive latency if
+ * the output is being consumed faster than the input
+ * file can fill it (e.g. sendfile() from a slow hard
+ * disk to a fast network).
+ */
+ flags |= SPLICE_F_NOWAIT;
+
read_len = ret;
sd->total_len = read_len;

diff --git a/include/linux/splice.h b/include/linux/splice.h
index a55179fd60fc..14021bba7829 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -23,6 +23,12 @@

#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)

+/*
+ * Don't wait for I/O (internal flag for the splice_direct_to_actor()
+ * loop).
+ */
+#define SPLICE_F_NOWAIT (1U << 30)
+
/*
* Passed to the actors
*/
--
2.30.2



2022-11-08 08:27:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

On Sat, Nov 05, 2022 at 10:04:21AM +0100, Max Kellermann wrote:
> Since there are no other parameters, I suggest adding the
> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
> affects the "non-pipe" file descriptor passed to sendfile() or
> splice(). It translates to IOCB_NOWAIT for regular files.

This looks reasonable to me and matches the read/write side.

>
> For now, I
> have documented the flag to be kernel-internal with a high bit, like
> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
> system call ABI may be a good idea as well.

Yeah, my only comment here is that I see no reason to make this
purely kernel internal.

And while looking at that: does anyone remember why the (public)
SPLICE_F_* aren't in a uapi header?

2022-11-20 13:42:05

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

Greeting,

FYI, we noticed a 3388.6% improvement of phoronix-test-suite.stress-ng.SENDFILE.bogo_ops_s due to commit:

commit: 86f00c46806e19ccce7fa238fbf3aaf0f1f2f531 ("[PATCH] fs/splice: don't block splice_direct_to_actor() after data was read")
url: https://github.com/intel-lab-lkp/linux/commits/Max-Kellermann/fs-splice-don-t-block-splice_direct_to_actor-after-data-was-read/20221105-171212
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git b208b9fbbcba743fb269d15cb46a4036b01936b1
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

in testcase: phoronix-test-suite
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
with following parameters:

test: stress-ng-1.3.1
option_a: SENDFILE
cpufreq_governor: performance

test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/


Details are as below:

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase:
gcc-11/performance/x86_64-rhel-8.3/SENDFILE/debian-x86_64-phoronix/lkp-csl-2sp7/stress-ng-1.3.1/phoronix-test-suite

commit:
b208b9fbbc ("Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux")
86f00c4680 ("fs/splice: don't block splice_direct_to_actor() after data was read")

b208b9fbbcba743f 86f00c46806e19ccce7fa238fbf
---------------- ---------------------------
%stddev %change %stddev
\ | \
105881 ? 10% +3388.6% 3693772 ? 13% phoronix-test-suite.stress-ng.SENDFILE.bogo_ops_s
2.16 ? 7% +1117.6% 26.36 ? 13% phoronix-test-suite.time.user_time
0.20 +0.2 0.40 ? 4% mpstat.cpu.all.usr%
97.92 ? 6% +14.9% 112.48 ? 5% sched_debug.cfs_rq:/.util_est_enqueued.stddev
2717 +2.5% 2783 turbostat.Bzy_MHz
254.74 -4.6% 243.11 turbostat.PkgWatt
2589095 ? 18% -37.5% 1617492 ? 28% numa-meminfo.node0.FilePages
788063 ? 62% +123.2% 1758697 ? 26% numa-meminfo.node1.FilePages
992948 ? 48% +92.6% 1912574 ? 24% numa-meminfo.node1.Inactive
1803548 ? 28% +52.3% 2746365 ? 17% numa-meminfo.node1.MemUsed
4.443e+09 ? 12% -36.0% 2.846e+09 ? 18% numa-numastat.node0.local_node
4.429e+09 ? 12% -35.8% 2.842e+09 ? 18% numa-numastat.node0.numa_hit
5.558e+09 ? 8% -54.2% 2.548e+09 ? 24% numa-numastat.node1.local_node
5.539e+09 ? 8% -54.0% 2.545e+09 ? 24% numa-numastat.node1.numa_hit
9.968e+09 ? 10% -46.0% 5.388e+09 ? 13% proc-vmstat.numa_hit
1e+10 ? 10% -46.1% 5.393e+09 ? 13% proc-vmstat.numa_local
9.759e+09 ? 10% -45.5% 5.32e+09 ? 13% proc-vmstat.pgalloc_normal
9.759e+09 ? 10% -45.5% 5.32e+09 ? 13% proc-vmstat.pgfree
647020 ? 19% -37.5% 404382 ? 28% numa-vmstat.node0.nr_file_pages
4.429e+09 ? 12% -35.8% 2.842e+09 ? 18% numa-vmstat.node0.numa_hit
4.443e+09 ? 12% -36.0% 2.846e+09 ? 18% numa-vmstat.node0.numa_local
196897 ? 62% +123.3% 439599 ? 26% numa-vmstat.node1.nr_file_pages
5.539e+09 ? 8% -54.0% 2.545e+09 ? 24% numa-vmstat.node1.numa_hit
5.558e+09 ? 8% -54.2% 2.548e+09 ? 24% numa-vmstat.node1.numa_local
2.602e+10 ? 10% -39.6% 1.573e+10 ? 13% perf-stat.i.branch-instructions
24921395 ? 12% +40.0% 34894184 ? 6% perf-stat.i.branch-misses
1.223e+08 ? 11% -37.8% 76029198 ? 13% perf-stat.i.cache-references
1.92 ? 18% +43.1% 2.75 ? 12% perf-stat.i.cpi
2.05e+11 +2.6% 2.103e+11 perf-stat.i.cpu-cycles
3.522e+10 ? 10% -37.8% 2.192e+10 ? 13% perf-stat.i.dTLB-loads
2.356e+10 ? 10% -40.7% 1.396e+10 ? 13% perf-stat.i.dTLB-stores
2507922 ? 14% +527.4% 15734908 ? 8% perf-stat.i.iTLB-load-misses
1.336e+11 ? 10% -39.6% 8.071e+10 ? 13% perf-stat.i.instructions
50055 ? 12% -90.8% 4589 ? 5% perf-stat.i.instructions-per-iTLB-miss
0.61 ? 10% -35.9% 0.39 ? 10% perf-stat.i.ipc
2135130 +2.7% 2191818 perf-stat.i.metric.GHz
8.845e+08 ? 10% -39.1% 5.387e+08 ? 13% perf-stat.i.metric.M/sec
60647 ? 2% -19.8% 48635 ? 6% perf-stat.i.node-stores
0.10 ? 8% +0.1 0.22 ? 7% perf-stat.overall.branch-miss-rate%
11.06 ? 35% +5.9 16.96 ? 25% perf-stat.overall.cache-miss-rate%
1.55 ? 12% +70.3% 2.65 ? 11% perf-stat.overall.cpi
0.00 ? 41% +0.0 0.00 ? 29% perf-stat.overall.dTLB-store-miss-rate%
86.38 ? 3% +10.8 97.16 perf-stat.overall.iTLB-load-miss-rate%
53972 ? 12% -90.5% 5111 ? 5% perf-stat.overall.instructions-per-iTLB-miss
0.65 ? 11% -41.2% 0.38 ? 12% perf-stat.overall.ipc
2.579e+10 ? 10% -39.5% 1.559e+10 ? 13% perf-stat.ps.branch-instructions
24705183 ? 12% +40.1% 34602922 ? 6% perf-stat.ps.branch-misses
1.212e+08 ? 11% -37.8% 75385416 ? 13% perf-stat.ps.cache-references
2.032e+11 +2.6% 2.085e+11 perf-stat.ps.cpu-cycles
3.491e+10 ? 10% -37.7% 2.173e+10 ? 13% perf-stat.ps.dTLB-loads
2.335e+10 ? 10% -40.7% 1.385e+10 ? 13% perf-stat.ps.dTLB-stores
2486067 ? 14% +527.6% 15601840 ? 8% perf-stat.ps.iTLB-load-misses
1.324e+11 ? 10% -39.5% 8.002e+10 ? 13% perf-stat.ps.instructions
60175 ? 2% -19.8% 48272 ? 6% perf-stat.ps.node-stores
1.529e+13 ? 10% -39.5% 9.254e+12 ? 12% perf-stat.total.instructions
55.97 ? 6% -13.5 42.46 ? 4% perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_read.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor
55.59 ? 6% -13.4 42.15 ? 4% perf-profile.calltrace.cycles-pp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter.generic_file_splice_read
29.58 ? 11% -12.7 16.90 ? 12% perf-profile.calltrace.cycles-pp.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter
57.68 ? 3% -11.7 45.98 perf-profile.calltrace.cycles-pp.do_iter_read.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct
17.32 ? 12% -7.4 9.89 ? 14% perf-profile.calltrace.cycles-pp.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
13.20 ? 12% -5.5 7.66 ? 15% perf-profile.calltrace.cycles-pp.__alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev
10.88 ? 11% -4.6 6.27 ? 9% perf-profile.calltrace.cycles-pp.memset_erms.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
15.34 ? 7% -4.3 11.02 ? 14% perf-profile.calltrace.cycles-pp.direct_splice_actor.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64
15.23 ? 7% -4.3 10.96 ? 14% perf-profile.calltrace.cycles-pp.splice_from_pipe.direct_splice_actor.splice_direct_to_actor.do_splice_direct.do_sendfile
15.05 ? 7% -4.2 10.86 ? 14% perf-profile.calltrace.cycles-pp.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor.do_splice_direct
8.26 ? 11% -3.2 5.06 ? 14% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter
8.38 ? 9% -2.4 6.01 ? 14% perf-profile.calltrace.cycles-pp.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor
5.64 ? 10% -1.9 3.70 ? 13% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.append_pipe.iov_iter_zero
99.74 -1.5 98.21 perf-profile.calltrace.cycles-pp.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64.do_syscall_64
99.75 -1.4 98.36 perf-profile.calltrace.cycles-pp.do_splice_direct.do_sendfile.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.30 ? 13% -1.1 1.20 ? 16% perf-profile.calltrace.cycles-pp.alloc_pages.append_pipe.iov_iter_zero.shmem_file_read_iter.do_iter_readv_writev
1.74 ? 12% -0.8 0.96 ? 11% perf-profile.calltrace.cycles-pp.xas_load.__filemap_get_folio.shmem_get_folio_gfp.shmem_file_read_iter.do_iter_readv_writev
99.79 -0.8 99.03 perf-profile.calltrace.cycles-pp.do_sendfile.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.52 ? 12% -0.7 0.79 ? 15% perf-profile.calltrace.cycles-pp.free_unref_page_commit.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor
4.13 ? 3% -0.7 3.46 ? 13% perf-profile.calltrace.cycles-pp.generic_pipe_buf_release.__splice_from_pipe.splice_from_pipe.direct_splice_actor.splice_direct_to_actor
1.01 ? 11% -0.6 0.40 ? 72% perf-profile.calltrace.cycles-pp.free_pcp_prepare.free_unref_page.__splice_from_pipe.splice_from_pipe.direct_splice_actor
99.80 -0.4 99.36 perf-profile.calltrace.cycles-pp.__x64_sys_sendfile64.do_syscall_64.entry_SYSCALL_64_after_hwframe
99.82 -0.4 99.47 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.79 ? 10% -0.3 0.48 ? 45% perf-profile.calltrace.cycles-pp.__list_del_entry_valid.rmqueue.get_page_from_freelist.__alloc_pages.append_pipe
0.90 ? 7% -0.3 0.59 ? 4% perf-profile.calltrace.cycles-pp.__might_resched.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.ovl_read_iter
99.83 -0.3 99.52 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
83.75 +2.5 86.27 ? 2% perf-profile.calltrace.cycles-pp.generic_file_splice_read.splice_direct_to_actor.do_splice_direct.do_sendfile.__x64_sys_sendfile64
83.42 +2.5 85.97 ? 2% perf-profile.calltrace.cycles-pp.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct.do_sendfile
11.81 ? 13% +6.4 18.22 perf-profile.calltrace.cycles-pp.override_creds.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct
13.34 ? 14% +7.7 21.02 ? 7% perf-profile.calltrace.cycles-pp.revert_creds.ovl_read_iter.generic_file_splice_read.splice_direct_to_actor.do_splice_direct
55.98 ? 6% -13.5 42.46 ? 4% perf-profile.children.cycles-pp.do_iter_readv_writev
55.85 ? 6% -13.5 42.34 ? 4% perf-profile.children.cycles-pp.shmem_file_read_iter
29.70 ? 11% -12.7 16.96 ? 12% perf-profile.children.cycles-pp.iov_iter_zero
57.69 ? 3% -11.7 45.98 perf-profile.children.cycles-pp.do_iter_read
17.38 ? 12% -7.5 9.93 ? 14% perf-profile.children.cycles-pp.append_pipe
13.64 ? 12% -5.7 7.90 ? 15% perf-profile.children.cycles-pp.__alloc_pages
10.94 ? 10% -4.6 6.30 ? 9% perf-profile.children.cycles-pp.memset_erms
15.34 ? 7% -4.3 11.02 ? 14% perf-profile.children.cycles-pp.direct_splice_actor
15.24 ? 7% -4.3 10.96 ? 14% perf-profile.children.cycles-pp.splice_from_pipe
15.13 ? 7% -4.2 10.90 ? 14% perf-profile.children.cycles-pp.__splice_from_pipe
8.36 ? 11% -3.2 5.11 ? 14% perf-profile.children.cycles-pp.get_page_from_freelist
8.62 ? 9% -2.5 6.13 ? 14% perf-profile.children.cycles-pp.free_unref_page
5.76 ? 10% -2.0 3.76 ? 13% perf-profile.children.cycles-pp.rmqueue
99.74 -1.5 98.22 perf-profile.children.cycles-pp.splice_direct_to_actor
99.75 -1.4 98.36 perf-profile.children.cycles-pp.do_splice_direct
2.39 ? 13% -1.1 1.24 ? 16% perf-profile.children.cycles-pp.alloc_pages
2.63 ? 9% -1.0 1.65 ? 8% perf-profile.children.cycles-pp.__might_resched
1.84 ? 12% -0.8 1.02 ? 12% perf-profile.children.cycles-pp.xas_load
1.61 ? 12% -0.8 0.84 ? 14% perf-profile.children.cycles-pp.free_unref_page_commit
99.79 -0.7 99.04 perf-profile.children.cycles-pp.do_sendfile
4.18 ? 3% -0.7 3.49 ? 13% perf-profile.children.cycles-pp.generic_pipe_buf_release
1.42 ? 11% -0.6 0.80 ? 8% perf-profile.children.cycles-pp.__cond_resched
1.06 ? 11% -0.5 0.59 ? 14% perf-profile.children.cycles-pp.free_pcp_prepare
0.94 ? 14% -0.4 0.50 ? 15% perf-profile.children.cycles-pp.__might_sleep
99.80 -0.4 99.37 perf-profile.children.cycles-pp.__x64_sys_sendfile64
0.75 ? 15% -0.4 0.38 ? 16% perf-profile.children.cycles-pp.policy_node
0.79 ? 12% -0.4 0.42 ? 15% perf-profile.children.cycles-pp.__folio_put
0.90 ? 10% -0.4 0.54 ? 12% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
99.95 -0.3 99.64 perf-profile.children.cycles-pp.do_syscall_64
99.95 -0.3 99.68 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
0.58 ? 12% -0.3 0.31 ? 10% perf-profile.children.cycles-pp.rcu_all_qs
0.62 ? 14% -0.3 0.36 ? 12% perf-profile.children.cycles-pp.xas_start
0.80 ? 10% -0.2 0.56 ? 10% perf-profile.children.cycles-pp.__list_del_entry_valid
0.40 ? 17% -0.2 0.20 ? 17% perf-profile.children.cycles-pp._find_first_bit
0.32 ? 14% -0.2 0.16 ? 17% perf-profile.children.cycles-pp.__list_add_valid
0.34 ? 12% -0.1 0.20 ? 8% perf-profile.children.cycles-pp.sanity
0.32 ? 11% -0.1 0.18 ? 14% perf-profile.children.cycles-pp.__page_cache_release
0.27 ? 12% -0.1 0.14 ? 15% perf-profile.children.cycles-pp.should_fail_alloc_page
0.25 ? 12% -0.1 0.13 ? 17% perf-profile.children.cycles-pp.__mem_cgroup_uncharge
0.20 ? 14% -0.1 0.10 ? 13% perf-profile.children.cycles-pp.policy_nodemask
0.12 ? 15% -0.0 0.07 ? 12% perf-profile.children.cycles-pp.pipe_to_null
0.04 ? 45% +0.0 0.07 perf-profile.children.cycles-pp.__libc_write
0.05 ? 7% +0.0 0.08 ? 11% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64
0.04 ? 71% +0.0 0.07 perf-profile.children.cycles-pp.generic_file_write_iter
0.04 ? 71% +0.0 0.07 ? 5% perf-profile.children.cycles-pp.generic_perform_write
0.04 ? 71% +0.0 0.07 perf-profile.children.cycles-pp.__generic_file_write_iter
0.00 +0.1 0.05 ? 8% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.00 +0.1 0.06 ? 11% perf-profile.children.cycles-pp.ovl_d_real
0.00 +0.1 0.07 ? 14% perf-profile.children.cycles-pp.__put_user_nocheck_8
0.16 ? 6% +0.1 0.23 ? 2% perf-profile.children.cycles-pp.current_time
0.00 +0.1 0.08 ? 14% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.28 ? 8% +0.1 0.37 ? 4% perf-profile.children.cycles-pp.atime_needs_update
0.09 ? 14% +0.1 0.18 ? 15% perf-profile.children.cycles-pp.__fsnotify_parent
0.35 ? 9% +0.1 0.45 ? 6% perf-profile.children.cycles-pp.touch_atime
0.00 +0.1 0.12 ? 12% perf-profile.children.cycles-pp.__might_fault
0.00 +0.1 0.14 ? 14% perf-profile.children.cycles-pp._copy_from_user
0.00 +0.1 0.14 ? 4% perf-profile.children.cycles-pp.__entry_text_start
0.00 +0.2 0.16 ? 19% perf-profile.children.cycles-pp.__fdget
83.76 +2.5 86.28 ? 2% perf-profile.children.cycles-pp.generic_file_splice_read
83.44 +2.5 85.99 ? 2% perf-profile.children.cycles-pp.ovl_read_iter
11.82 ? 13% +6.4 18.26 perf-profile.children.cycles-pp.override_creds
13.35 ? 14% +7.7 21.06 ? 7% perf-profile.children.cycles-pp.revert_creds
10.81 ? 10% -4.6 6.23 ? 9% perf-profile.self.cycles-pp.memset_erms
3.26 ? 13% -1.5 1.74 ? 17% perf-profile.self.cycles-pp.__alloc_pages
3.02 ? 12% -1.4 1.64 ? 14% perf-profile.self.cycles-pp.free_unref_page
2.96 ? 12% -1.4 1.60 ? 14% perf-profile.self.cycles-pp.rmqueue
2.52 ? 13% -1.2 1.30 ? 16% perf-profile.self.cycles-pp.get_page_from_freelist
2.58 ? 9% -1.0 1.62 ? 8% perf-profile.self.cycles-pp.__might_resched
1.46 ? 13% -0.7 0.78 ? 18% perf-profile.self.cycles-pp.alloc_pages
4.11 ? 3% -0.7 3.44 ? 13% perf-profile.self.cycles-pp.generic_pipe_buf_release
1.40 ? 13% -0.6 0.77 ? 15% perf-profile.self.cycles-pp.__splice_from_pipe
1.29 ? 12% -0.6 0.68 ? 14% perf-profile.self.cycles-pp.free_unref_page_commit
1.37 ? 10% -0.6 0.79 ? 9% perf-profile.self.cycles-pp.append_pipe
1.21 ? 12% -0.5 0.67 ? 11% perf-profile.self.cycles-pp.xas_load
1.02 ? 12% -0.5 0.54 ? 14% perf-profile.self.cycles-pp.iov_iter_zero
1.54 ? 8% -0.5 1.08 ? 3% perf-profile.self.cycles-pp.shmem_file_read_iter
0.99 ? 11% -0.4 0.55 ? 14% perf-profile.self.cycles-pp.free_pcp_prepare
0.80 ? 13% -0.4 0.43 ? 15% perf-profile.self.cycles-pp.__might_sleep
0.82 ? 9% -0.3 0.49 ? 5% perf-profile.self.cycles-pp.__cond_resched
0.78 ? 10% -0.3 0.46 ? 12% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
0.56 ? 13% -0.2 0.32 ? 12% perf-profile.self.cycles-pp.xas_start
0.75 ? 10% -0.2 0.54 ? 11% perf-profile.self.cycles-pp.__list_del_entry_valid
0.44 ? 13% -0.2 0.23 ? 13% perf-profile.self.cycles-pp.rcu_all_qs
0.34 ? 16% -0.2 0.17 ? 17% perf-profile.self.cycles-pp._find_first_bit
0.36 ? 15% -0.2 0.18 ? 16% perf-profile.self.cycles-pp.policy_node
0.32 ? 12% -0.2 0.17 ? 15% perf-profile.self.cycles-pp.__page_cache_release
0.34 ? 12% -0.1 0.19 ? 8% perf-profile.self.cycles-pp.sanity
0.28 ? 14% -0.1 0.14 ? 19% perf-profile.self.cycles-pp.__list_add_valid
0.24 ? 13% -0.1 0.12 ? 15% perf-profile.self.cycles-pp.__mem_cgroup_uncharge
0.18 ? 14% -0.1 0.10 ? 14% perf-profile.self.cycles-pp.__folio_put
0.16 ? 12% -0.1 0.08 ? 12% perf-profile.self.cycles-pp.policy_nodemask
0.12 ? 9% -0.1 0.06 ? 17% perf-profile.self.cycles-pp.should_fail_alloc_page
0.10 ? 13% -0.0 0.06 ? 13% perf-profile.self.cycles-pp.direct_splice_actor
0.10 ? 14% -0.0 0.06 ? 16% perf-profile.self.cycles-pp.splice_from_pipe
0.09 ? 14% -0.0 0.06 ? 14% perf-profile.self.cycles-pp.pipe_to_null
0.10 ? 10% +0.0 0.14 ? 6% perf-profile.self.cycles-pp.current_time
0.03 ? 70% +0.0 0.08 ? 12% perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64
0.00 +0.1 0.05 ? 8% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.07 ? 15% +0.1 0.13 ? 10% perf-profile.self.cycles-pp.generic_file_splice_read
0.00 +0.1 0.06 ? 11% perf-profile.self.cycles-pp.ovl_d_real
0.00 +0.1 0.07 ? 16% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.00 +0.1 0.07 ? 14% perf-profile.self.cycles-pp.__put_user_nocheck_8
0.09 ? 14% +0.1 0.18 ? 15% perf-profile.self.cycles-pp.__fsnotify_parent
0.00 +0.1 0.14 ? 4% perf-profile.self.cycles-pp.__entry_text_start
0.00 +0.2 0.15 ? 20% perf-profile.self.cycles-pp.__fdget
0.11 ? 14% +0.2 0.30 ? 14% perf-profile.self.cycles-pp.splice_direct_to_actor
0.00 +0.2 0.24 ? 11% perf-profile.self.cycles-pp.do_sendfile
11.40 ? 4% +1.8 13.16 ? 3% perf-profile.self.cycles-pp.__filemap_get_folio
11.73 ? 13% +6.4 18.12 perf-profile.self.cycles-pp.override_creds
13.25 ? 14% +7.6 20.88 ? 7% perf-profile.self.cycles-pp.revert_creds



To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (21.58 kB)
config-6.1.0-rc3-00289-g86f00c46806e (168.50 kB)
job-script (7.75 kB)
job.yaml (4.90 kB)
reproduce (311.00 B)
Download all attachments

2023-09-19 14:53:58

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

[+Cc Jens]

On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote:
> If userspace calls sendfile() with a very large "count" parameter, the
> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
> have been read from the hard disk and pushed into the socket buffer.
>
> Usually, that is not a problem, because the socket write buffer gets
> filled quickly, and if the socket is non-blocking, the last
> direct_splice_actor() call will return -EAGAIN, causing
> splice_direct_to_actor() to break from the loop, and sendfile() will
> return a partial transfer.
>
> However, if the network happens to be faster than the hard disk, and
> the socket buffer keeps getting drained between two
> generic_file_read_iter() calls, the sendfile() system call can keep
> running for a long time, blocking for disk I/O over and over.
>
> That is undesirable, because it can block the calling process for too
> long. I discovered a problem where nginx would block for so long that
> it would drop the HTTP connection because the kernel had just
> transferred 2 GiB in one call, and the HTTP socket was not writable
> (EPOLLOUT) for more than 60 seconds, resulting in a timeout:
>
> sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
> writev(4, [], 0) = 0 <0.000439>
> epoll_wait(9, [], 512, 60000) = 0 <60.060430>
> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
> close(12) = 0 <0.000063>
> close(4) = 0 <0.000091>
>
> In newer nginx versions (since 1.21.4), this problem was worked around
> by defaulting "sendfile_max_chunk" to 2 MiB:
>
> https://github.com/nginx/nginx/commit/5636e7f7b4
>
> Instead of asking userspace to provide an artificial upper limit, I'd
> like the kernel to block for disk I/O at most once, and then pass back
> control to userspace.
>
> There is prior art for this kind of behavior in filemap_read():
>
> /*
> * If we've already successfully copied some data, then we
> * can no longer safely return -EIOCBQUEUED. Hence mark
> * an async read NOWAIT at that point.
> */
> if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
> iocb->ki_flags |= IOCB_NOWAIT;
>
> This modifies the caller-provided "struct kiocb", which has an effect
> on repeated filemap_read() calls. This effect however vanishes
> because the "struct kiocb" is not persistent; splice_direct_to_actor()
> doesn't have one, and each generic_file_splice_read() call initializes
> a new one, losing the "IOCB_NOWAIT" flag that was injected by
> filemap_read().
>
> There was no way to make generic_file_splice_read() aware that
> IOCB_NOWAIT was desired because some data had already been transferred
> in a previous call:
>
> - checking whether the input file has O_NONBLOCK doesn't work because
> this should be fixed even if the input file is not non-blocking
>
> - the SPLICE_F_NONBLOCK flag is not appropriate because it affects
> only whether pipe operations are non-blocking, not whether
> file/socket operations are non-blocking
>
> Since there are no other parameters, I suggest adding the
> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
> affects the "non-pipe" file descriptor passed to sendfile() or
> splice(). It translates to IOCB_NOWAIT for regular files. For now, I
> have documented the flag to be kernel-internal with a high bit, like
> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
> system call ABI may be a good idea as well.
>
> Signed-off-by: Max Kellermann <[email protected]>
> ---
> fs/splice.c | 14 ++++++++++++++
> include/linux/splice.h | 6 ++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index d983d375ff11..c192321d5e37 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
> iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
> init_sync_kiocb(&kiocb, in);
> kiocb.ki_pos = *ppos;
> + if (flags & SPLICE_F_NOWAIT)
> + kiocb.ki_flags |= IOCB_NOWAIT;
> ret = call_read_iter(in, &kiocb, &to);
>
> if (ret > 0) {
> @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
> if (unlikely(ret <= 0))
> goto read_failure;
>
> + /*
> + * After at least one byte was read from the input
> + * file, don't wait for blocking I/O in the following
> + * loop iterations; instead of blocking for arbitrary
> + * amounts of time in the kernel, let userspace decide
> + * how to proceed. This avoids excessive latency if
> + * the output is being consumed faster than the input
> + * file can fill it (e.g. sendfile() from a slow hard
> + * disk to a fast network).
> + */
> + flags |= SPLICE_F_NOWAIT;
> +
> read_len = ret;
> sd->total_len = read_len;
>
> diff --git a/include/linux/splice.h b/include/linux/splice.h
> index 6c461573434d..abdf94759138 100644
> --- a/include/linux/splice.h
> +++ b/include/linux/splice.h
> @@ -23,6 +23,12 @@
>
> #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
>
> +/*
> + * Don't wait for I/O (internal flag for the splice_direct_to_actor()
> + * loop).
> + */
> +#define SPLICE_F_NOWAIT (1U << 30)
> +
> /*
> * Passed to the actors
> */
> --
> 2.39.2
>

2023-09-20 17:38:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

On 9/19/23 8:18 AM, Christian Brauner wrote:
> [+Cc Jens]
>
> On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote:
>> If userspace calls sendfile() with a very large "count" parameter, the
>> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
>> have been read from the hard disk and pushed into the socket buffer.
>>
>> Usually, that is not a problem, because the socket write buffer gets
>> filled quickly, and if the socket is non-blocking, the last
>> direct_splice_actor() call will return -EAGAIN, causing
>> splice_direct_to_actor() to break from the loop, and sendfile() will
>> return a partial transfer.
>>
>> However, if the network happens to be faster than the hard disk, and
>> the socket buffer keeps getting drained between two
>> generic_file_read_iter() calls, the sendfile() system call can keep
>> running for a long time, blocking for disk I/O over and over.
>>
>> That is undesirable, because it can block the calling process for too
>> long. I discovered a problem where nginx would block for so long that
>> it would drop the HTTP connection because the kernel had just
>> transferred 2 GiB in one call, and the HTTP socket was not writable
>> (EPOLLOUT) for more than 60 seconds, resulting in a timeout:
>>
>> sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
>> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
>> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
>> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
>> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
>> writev(4, [], 0) = 0 <0.000439>
>> epoll_wait(9, [], 512, 60000) = 0 <60.060430>
>> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
>> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
>> close(12) = 0 <0.000063>
>> close(4) = 0 <0.000091>
>>
>> In newer nginx versions (since 1.21.4), this problem was worked around
>> by defaulting "sendfile_max_chunk" to 2 MiB:
>>
>> https://github.com/nginx/nginx/commit/5636e7f7b4
>>
>> Instead of asking userspace to provide an artificial upper limit, I'd
>> like the kernel to block for disk I/O at most once, and then pass back
>> control to userspace.
>>
>> There is prior art for this kind of behavior in filemap_read():
>>
>> /*
>> * If we've already successfully copied some data, then we
>> * can no longer safely return -EIOCBQUEUED. Hence mark
>> * an async read NOWAIT at that point.
>> */
>> if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
>> iocb->ki_flags |= IOCB_NOWAIT;
>>
>> This modifies the caller-provided "struct kiocb", which has an effect
>> on repeated filemap_read() calls. This effect however vanishes
>> because the "struct kiocb" is not persistent; splice_direct_to_actor()
>> doesn't have one, and each generic_file_splice_read() call initializes
>> a new one, losing the "IOCB_NOWAIT" flag that was injected by
>> filemap_read().
>>
>> There was no way to make generic_file_splice_read() aware that
>> IOCB_NOWAIT was desired because some data had already been transferred
>> in a previous call:
>>
>> - checking whether the input file has O_NONBLOCK doesn't work because
>> this should be fixed even if the input file is not non-blocking
>>
>> - the SPLICE_F_NONBLOCK flag is not appropriate because it affects
>> only whether pipe operations are non-blocking, not whether
>> file/socket operations are non-blocking
>>
>> Since there are no other parameters, I suggest adding the
>> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
>> affects the "non-pipe" file descriptor passed to sendfile() or
>> splice(). It translates to IOCB_NOWAIT for regular files. For now, I
>> have documented the flag to be kernel-internal with a high bit, like
>> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
>> system call ABI may be a good idea as well.

I think adding the flag for this case makes sense, and also exposing it
on the UAPI side. My only concern is full coverage of it. We can't
really have a SPLICE_F_NOWAIT flag that only applies to some cases.

That said, asking for a 2G splice, and getting a 2G splice no matter how
slow it may be, is a bit of a "doctor it hurts when I..." scenario.

--
Jens Axboe

2023-09-21 02:15:56

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <[email protected]> wrote:
> I think adding the flag for this case makes sense, and also exposing it
> on the UAPI side.

OK. I suggest we get this patch merged first, and then I prepare a
patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
lowest free bit), add it to SPLICE_F_ALL and document it.

(If you prefer to have it all in this initial patch, I can amend and
resubmit it with the uapi feature.)

> My only concern is full coverage of it. We can't
> really have a SPLICE_F_NOWAIT flag that only applies to some cases.

The feature is already part of uapi - via RWF_NOWAIT, which maps to
IOCB_NOWAIT, just like my proposed SPLICE_F_NOWAIT flag. The semantics
(and the concerns) are the same, aren't they?

> That said, asking for a 2G splice, and getting a 2G splice no matter how
> slow it may be, is a bit of a "doctor it hurts when I..." scenario.

I understand this argument, but I disagree. Compare recv(socket) with
read(regular_file).
A read(regular_file) must block until the given buffer is filled
completely (or EOF is reached), which is good for some programs which
do not handle partial reads, but other programs might be happy with a
partial read and prefer lower latency. There is preadv2(RWF_NOWAIT),
but if it returns EAGAIN, userspace cannot know when data will be
available, can't epoll() regular files. There's no way that a read()
returns at least one byte, but doesn't wait for more (not even with
preadv2(), unfortunately).
recv(socket) (or reading on a pipe) behaves differently - it blocks
only until at least one byte arrives, and callers must be able to deal
with partial reads. That's good for latency - imagine recv() would
behave like read(); how much data do you ask the kernel to receive? If
it's too little, you need many system calls; if it's too much, your
process may block indefinitely.

read(regular_file) behaves that way for historical reasons and we
can't change it, only add new APIs like preadv2(); but splice() is a
modern API that we can optimize for how we want it to behave - and
that is: copy as much as the kernel already has, but don't block after
that (in order to avoid huge latencies).

My point is: splice(2G) is a very reasonable thing to do if userspace
wants the kernel to transfer as much as possible with a single system
call, because there's no way for userspace to know what the best
number is, so let's just pass the largest valid value.

Max

2023-09-25 13:22:51

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

On Wed, Sep 20, 2023 at 08:16:04PM +0200, Max Kellermann wrote:
> On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <[email protected]> wrote:
> > I think adding the flag for this case makes sense, and also exposing it
> > on the UAPI side.
>
> OK. I suggest we get this patch merged first, and then I prepare a
> patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
> lowest free bit), add it to SPLICE_F_ALL and document it.
>
> (If you prefer to have it all in this initial patch, I can amend and
> resubmit it with the uapi feature.)

Please resend it all in one patch. That's easier for all to review.

2023-09-26 10:22:24

by Max Kellermann

[permalink] [raw]
Subject: [PATCH v2] fs/splice: don't block splice_direct_to_actor() after data was read

If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.

Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.

However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.

That is undesirable, because it can block the calling process for too
long. I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:

sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
writev(4, [], 0) = 0 <0.000439>
epoll_wait(9, [], 512, 60000) = 0 <60.060430>
gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
close(12) = 0 <0.000063>
close(4) = 0 <0.000091>

In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:

https://github.com/nginx/nginx/commit/5636e7f7b4

Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.

There is prior art for this kind of behavior in filemap_read():

/*
* If we've already successfully copied some data, then we
* can no longer safely return -EIOCBQUEUED. Hence mark
* an async read NOWAIT at that point.
*/
if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
iocb->ki_flags |= IOCB_NOWAIT;

This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls. This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().

There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:

- checking whether the input file has O_NONBLOCK doesn't work because
this should be fixed even if the input file is not non-blocking

- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
only whether pipe operations are non-blocking, not whether
file/socket operations are non-blocking

Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice(). It translates to IOCB_NOWAIT for regular files, just like
RWF_NOWAIT does.

Changes v1 -> v2:
- value of SPLICE_F_NOWAIT changed to 0x10
- added SPLICE_F_NOWAIT to SPLICE_F_ALL to make it part of uapi

Cc: Jens Axboe <[email protected]>
Signed-off-by: Max Kellermann <[email protected]>
---
fs/splice.c | 14 ++++++++++++++
include/linux/splice.h | 4 +++-
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index d983d375ff11..c192321d5e37 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
+ if (flags & SPLICE_F_NOWAIT)
+ kiocb.ki_flags |= IOCB_NOWAIT;
ret = call_read_iter(in, &kiocb, &to);

if (ret > 0) {
@@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
if (unlikely(ret <= 0))
goto read_failure;

+ /*
+ * After at least one byte was read from the input
+ * file, don't wait for blocking I/O in the following
+ * loop iterations; instead of blocking for arbitrary
+ * amounts of time in the kernel, let userspace decide
+ * how to proceed. This avoids excessive latency if
+ * the output is being consumed faster than the input
+ * file can fill it (e.g. sendfile() from a slow hard
+ * disk to a fast network).
+ */
+ flags |= SPLICE_F_NOWAIT;
+
read_len = ret;
sd->total_len = read_len;

diff --git a/include/linux/splice.h b/include/linux/splice.h
index 6c461573434d..06ce58b1f408 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -21,7 +21,9 @@
#define SPLICE_F_MORE (0x04) /* expect more data */
#define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */

-#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
+#define SPLICE_F_NOWAIT (0x10) /* do not wait for data which is not immediately available */
+
+#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT|SPLICE_F_NOWAIT)

/*
* Passed to the actors
--
2.39.2

2023-09-26 10:39:31

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2] fs/splice: don't block splice_direct_to_actor() after data was read

> + /*
> + * After at least one byte was read from the input
> + * file, don't wait for blocking I/O in the following
> + * loop iterations; instead of blocking for arbitrary
> + * amounts of time in the kernel, let userspace decide
> + * how to proceed. This avoids excessive latency if
> + * the output is being consumed faster than the input
> + * file can fill it (e.g. sendfile() from a slow hard
> + * disk to a fast network).
> + */
> + flags |= SPLICE_F_NOWAIT;
> +

Hm, so the thing that is worrysome about this change is that this may
cause regressions afaict as this is a pretty significant change from
current behavior.

2023-09-26 11:51:21

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

On Mon, Sep 25, 2023 at 3:10 PM Christian Brauner <[email protected]> wrote:
> > OK. I suggest we get this patch merged first, and then I prepare a
> > patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
> > lowest free bit), add it to SPLICE_F_ALL and document it.
> >
> > (If you prefer to have it all in this initial patch, I can amend and
> > resubmit it with the uapi feature.)
>
> Please resend it all in one patch. That's easier for all to review.

Done. Though I was surprised there's no "uapi" header for splice, so I
only changed the value to 0x10 and added it to SPLICE_F_ALL.

Since the splice manpage is in a different repository, I'll submit a
separate patch for that; can't be one patch.

Max

2023-09-26 12:33:27

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2] fs/splice: don't block splice_direct_to_actor() after data was read

On Tue, Sep 26, 2023 at 12:41:42PM +0200, Max Kellermann wrote:
> On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <[email protected]> wrote:
> > Hm, so the thing that is worrysome about this change is that this may
> > cause regressions afaict as this is a pretty significant change from
> > current behavior.
>
> Would you prefer a new flag for explicitly selecting "wait until at
> least one byte was transferred, but don't wait further"? Because many

I had thought about it but afaict it'd be rather annoying as one can get
into that code from copy_file_range() as well so we'd need a new flag
for that system call as well afaict.

> applications need this behavior, and some (like nginx) have already
> worked around the problem by limiting the maximum transaction size,
> which I consider a bad workaround, because it leads to unnecessary
> system calls and still doesn't really solve the latency problem.
>
> On the other hand, what exactly would the absence of this flag mean...
> the old behavior, without my patch, can lead to partial transfers, and
> the absence of the flag doesn't mean it can't happen; my patch tackles
> just one corner case, but one that is important for me.
>
> We have been running this patch in production for nearly a year (and
> will continue to do so until upstream kernels have a proper solution)
> and never observed a problem, and I consider it safe, but I
> acknowledge the risk that this may reveal obscure application bugs if
> applied globally to all Linux kernels, so I understand your worries.

I think hanging for an insane amount of time is indeed a problem and
tweaking the code in this way might actually be useful but we'd need to
let this soak for quite a while to see whether this causes any issues.

@Jens, what do you think? Is this worth it?

2023-09-26 20:23:27

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH v2] fs/splice: don't block splice_direct_to_actor() after data was read

On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <[email protected]> wrote:
> Hm, so the thing that is worrysome about this change is that this may
> cause regressions afaict as this is a pretty significant change from
> current behavior.

Would you prefer a new flag for explicitly selecting "wait until at
least one byte was transferred, but don't wait further"? Because many
applications need this behavior, and some (like nginx) have already
worked around the problem by limiting the maximum transaction size,
which I consider a bad workaround, because it leads to unnecessary
system calls and still doesn't really solve the latency problem.

On the other hand, what exactly would the absence of this flag mean...
the old behavior, without my patch, can lead to partial transfers, and
the absence of the flag doesn't mean it can't happen; my patch tackles
just one corner case, but one that is important for me.

We have been running this patch in production for nearly a year (and
will continue to do so until upstream kernels have a proper solution)
and never observed a problem, and I consider it safe, but I
acknowledge the risk that this may reveal obscure application bugs if
applied globally to all Linux kernels, so I understand your worries.

Max