If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.
Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.
However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.
That is undesirable, because it can block the calling process for too
long. I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:
sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
writev(4, [], 0) = 0 <0.000439>
epoll_wait(9, [], 512, 60000) = 0 <60.060430>
gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
close(12) = 0 <0.000063>
close(4) = 0 <0.000091>
In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:
https://github.com/nginx/nginx/commit/5636e7f7b4
Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.
There is prior art for this kind of behavior in filemap_read():
/*
* If we've already successfully copied some data, then we
* can no longer safely return -EIOCBQUEUED. Hence mark
* an async read NOWAIT at that point.
*/
if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
iocb->ki_flags |= IOCB_NOWAIT;
This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls. This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().
There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:
- checking whether the input file has O_NONBLOCK doesn't work because
this should be fixed even if the input file is not non-blocking
- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
only whether pipe operations are non-blocking, not whether
file/socket operations are non-blocking
Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice(). It translates to IOCB_NOWAIT for regular files, just like
RWF_NOWAIT does.
---
Changes v1 -> v2:
- value of SPLICE_F_NOWAIT changed to 0x10
- added SPLICE_F_NOWAIT to SPLICE_F_ALL to make it part of uapi
v2 -> v3: repost and rebase on linus/master
Signed-off-by: Max Kellermann <[email protected]>
---
fs/splice.c | 14 ++++++++++++++
include/linux/splice.h | 4 +++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/fs/splice.c b/fs/splice.c
index 60aed8de21f8..6257fef93ec4 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -362,6 +362,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
+ if (flags & SPLICE_F_NOWAIT)
+ kiocb.ki_flags |= IOCB_NOWAIT;
ret = in->f_op->read_iter(&kiocb, &to);
if (ret > 0) {
@@ -1090,6 +1092,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
if (unlikely(ret <= 0))
goto read_failure;
+ /*
+ * After at least one byte was read from the input
+ * file, don't wait for blocking I/O in the following
+ * loop iterations; instead of blocking for arbitrary
+ * amounts of time in the kernel, let userspace decide
+ * how to proceed. This avoids excessive latency if
+ * the output is being consumed faster than the input
+ * file can fill it (e.g. sendfile() from a slow hard
+ * disk to a fast network).
+ */
+ flags |= SPLICE_F_NOWAIT;
+
read_len = ret;
sd->total_len = read_len;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 9dec4861d09f..0deb87fb8055 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -21,7 +21,9 @@
#define SPLICE_F_MORE (0x04) /* expect more data */
#define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */
-#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
+#define SPLICE_F_NOWAIT (0x10) /* do not wait for data which is not immediately available */
+
+#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT|SPLICE_F_NOWAIT)
/*
* Passed to the actors
--
2.39.2
On Tue 04-06-24 11:24:31, Max Kellermann wrote:
> If userspace calls sendfile() with a very large "count" parameter, the
> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
> have been read from the hard disk and pushed into the socket buffer.
>
> Usually, that is not a problem, because the socket write buffer gets
> filled quickly, and if the socket is non-blocking, the last
> direct_splice_actor() call will return -EAGAIN, causing
> splice_direct_to_actor() to break from the loop, and sendfile() will
> return a partial transfer.
>
> However, if the network happens to be faster than the hard disk, and
> the socket buffer keeps getting drained between two
> generic_file_read_iter() calls, the sendfile() system call can keep
> running for a long time, blocking for disk I/O over and over.
>
> That is undesirable, because it can block the calling process for too
> long. I discovered a problem where nginx would block for so long that
> it would drop the HTTP connection because the kernel had just
> transferred 2 GiB in one call, and the HTTP socket was not writable
> (EPOLLOUT) for more than 60 seconds, resulting in a timeout:
>
> sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
> writev(4, [], 0) = 0 <0.000439>
> epoll_wait(9, [], 512, 60000) = 0 <60.060430>
> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
> close(12) = 0 <0.000063>
> close(4) = 0 <0.000091>
>
> In newer nginx versions (since 1.21.4), this problem was worked around
> by defaulting "sendfile_max_chunk" to 2 MiB:
>
> https://github.com/nginx/nginx/commit/5636e7f7b4
Well, I can see your pain but after all the kernel does exactly what
userspace has asked for?
> Instead of asking userspace to provide an artificial upper limit, I'd
> like the kernel to block for disk I/O at most once, and then pass back
> control to userspace.
>
> There is prior art for this kind of behavior in filemap_read():
>
> /*
> * If we've already successfully copied some data, then we
> * can no longer safely return -EIOCBQUEUED. Hence mark
> * an async read NOWAIT at that point.
> */
> if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
> iocb->ki_flags |= IOCB_NOWAIT;
Do note the IOCB_WAITQ test - this means that the request is coming from
io_uring and that has a retry logic to handle short reads. I'm really
nervous about changing sendfile behavior to unconditionally returning short
writes. In particular see this in do_sendfile():
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
We have been through these discussions in the past and so far we have
always decided the chances for userspace breakage are too big. After all
there's no substantial difference between userspace issuing a 2GB read(2) and
2GB sendfile(2). In both cases you are going to be blocked for a long
time and there are too many userspace applications that depend on this
behavior...
So I don't think we want to change the current behavior. We could create a
new interface to provide the semantics you want (like a flag to sendfile(2)
- which would mean a new syscall) but I'm wondering whether these days
io_uring isn't a better answer if you want to more closely control IO
behavior of your application and are willing to change used interface.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Jun 4, 2024 at 12:41 PM Jan Kara <[email protected]> wrote:
> Well, I can see your pain but after all the kernel does exactly what
> userspace has asked for?
That is a valid point of view; indeed the kernel's behavior is correct
according to the specification, but that was not my point.
This is about an exotic problem that occurs only in very rare
circumstances (depending on hard disk speed, network speed and
timing), but when it occurs, it blocks the calling process for a very
long time, which can then cause problems more serious than user
unhappiness (e.g. expiring timeouts). (As I said, nginx had to work
around this problem.)
I'd like to optimize this special case, and adjust the kernel to
always behave like the common case.
> After all there's no substantial difference between userspace issuing a 2GB read(2) and 2GB sendfile(2).
I understand your fear of breaking userspace, but this doesn't apply
here, because yes, there is indeed a substantial difference: in the
normal case, sendfile() stops when the destination socket buffer is
full. That is the normal mode of operation, which all applications
must be prepared for, because short sendfile() calls happen all the
time, that's the common case.
My patch is ONLY about fixing that exotic special case where the
socket buffer is drained over and over while sendfile() still runs.
> there are too many userspace applications that depend on this behavior...
True for read() - but which application depends on this very special
behavior that only occurs in very rare exceptional cases? I think we
have a slight misunderstanding about the circumstances of the problem.
On Tue 04-06-24 13:14:05, Max Kellermann wrote:
> On Tue, Jun 4, 2024 at 12:41 PM Jan Kara <[email protected]> wrote:
> > Well, I can see your pain but after all the kernel does exactly what
> > userspace has asked for?
>
> That is a valid point of view; indeed the kernel's behavior is correct
> according to the specification, but that was not my point.
>
> This is about an exotic problem that occurs only in very rare
> circumstances (depending on hard disk speed, network speed and
> timing), but when it occurs, it blocks the calling process for a very
> long time, which can then cause problems more serious than user
> unhappiness (e.g. expiring timeouts). (As I said, nginx had to work
> around this problem.)
>
> I'd like to optimize this special case, and adjust the kernel to
> always behave like the common case.
>
> > After all there's no substantial difference between userspace issuing a 2GB read(2) and 2GB sendfile(2).
>
> I understand your fear of breaking userspace, but this doesn't apply
> here, because yes, there is indeed a substantial difference: in the
> normal case, sendfile() stops when the destination socket buffer is
> full. That is the normal mode of operation, which all applications
> must be prepared for, because short sendfile() calls happen all the
> time, that's the common case.
>
> My patch is ONLY about fixing that exotic special case where the
> socket buffer is drained over and over while sendfile() still runs.
OK, so that was not clear to me (and this may well be just my ignorance of
networking details). Do you say that your patch changes the behavior only
for this cornercase? Even if the socket fd is blocking? AFAIU with your
patch we'd return short write in that case as well (roughly 64k AFAICT
because that's the amount the internal splice pipe will take) but currently
we block waiting for more space in the socket bufs?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Jun 4, 2024 at 3:27 PM Jan Kara <[email protected]> wrote:
> OK, so that was not clear to me (and this may well be just my ignorance of
> networking details). Do you say that your patch changes the behavior only
> for this cornercase? Even if the socket fd is blocking? AFAIU with your
> patch we'd return short write in that case as well (roughly 64k AFAICT
> because that's the amount the internal splice pipe will take) but currently
> we block waiting for more space in the socket bufs?
My patch changes only the file-read side, not the socket-write side.
It adds IOCB_NOWAIT for reading from the file, just like
filemap_read() does. Therefore, it does not matter whether the socket
is non-blocking.
But thanks for the reply - this was very helpful input for me because
I have to admit that part of my explanation was wrong:
I misunderstood how sending to a blocking socket works. I thought that
send() and sendfile() would return after sending at least one byte
(only recv() works that way), but in fact both block until everything
has been submitted. That is a rather embarrassing misunderstanding of
socket basics, but, uh, it just shows I've never really used blocking
sockets!
That means my patch can indeed change the behavior of sendfile() in a
way that might surprise (badly written) applications and should NOT be
merged as-is.
Your concerns were correct and thanks again!
That leaves me wondering how to solve this. Of course, io_uring is the
proper solution, but that part of my software isn't ready for io_uring
yet.
I could change this to only use IOCB_NOWAIT if the destination is
non-blocking, but something about this sounds wrong - it changes the
read side just because the write side is non-blocking.
We can't change the behavior out of fear of breaking applications; but
can we have a per-file flag so applications can opt into partial
reads/writes? This would be useful for all I/O on regular files (and
sockets and everything else). There would not be any guarantees, just
allowing the kernel to use relaxed semantics for those who can deal
with partial I/O.
Maybe I'm overthinking things and I should just fast-track full
io_uring support in my code...
Max
On Tue 04-06-24 21:24:14, Max Kellermann wrote:
> On Tue, Jun 4, 2024 at 3:27 PM Jan Kara <[email protected]> wrote:
> > OK, so that was not clear to me (and this may well be just my ignorance of
> > networking details). Do you say that your patch changes the behavior only
> > for this cornercase? Even if the socket fd is blocking? AFAIU with your
> > patch we'd return short write in that case as well (roughly 64k AFAICT
> > because that's the amount the internal splice pipe will take) but currently
> > we block waiting for more space in the socket bufs?
>
> My patch changes only the file-read side, not the socket-write side.
> It adds IOCB_NOWAIT for reading from the file, just like
> filemap_read() does. Therefore, it does not matter whether the socket
> is non-blocking.
>
> But thanks for the reply - this was very helpful input for me because
> I have to admit that part of my explanation was wrong:
> I misunderstood how sending to a blocking socket works. I thought that
> send() and sendfile() would return after sending at least one byte
> (only recv() works that way), but in fact both block until everything
> has been submitted.
Yeah, this was exactly what I was trying to point at...
> I could change this to only use IOCB_NOWAIT if the destination is
> non-blocking, but something about this sounds wrong - it changes the
> read side just because the write side is non-blocking.
> We can't change the behavior out of fear of breaking applications; but
> can we have a per-file flag so applications can opt into partial
> reads/writes? This would be useful for all I/O on regular files (and
> sockets and everything else). There would not be any guarantees, just
> allowing the kernel to use relaxed semantics for those who can deal
> with partial I/O.
> Maybe I'm overthinking things and I should just fast-track full
> io_uring support in my code...
Adding open flags is messy (because they are different on different
architectures AFAIR 8-|) and I'm not sure we have much left. In principle
it is possible and I agree it seems useful. But I'm not sure how widespread
use it would get as, as you write above, these days the time may be better
spent on converting code with such needs to io_uring...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR