2020-02-14 16:02:12

by Sasha Levin

[permalink] [raw]
Subject: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

From: Linus Torvalds <[email protected]>

[ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ]

This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

#include <unistd.h>

int main(int argc, char **argv)
{
int fd[2], counters[2];

pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));

/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();

do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches

14.621273891 seconds time elapsed

0.018243000 seconds user
3.611468000 seconds sys

before this commit.

After this commit, I get

5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches

1.702148350 seconds time elapsed

0.004868000 seconds user
0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").

But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]

Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.

All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"

Tested-by: Josh Triplett <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
fs/coredump.c | 4 +--
fs/pipe.c | 67 +++++++++++++++++++++++++--------------
fs/splice.c | 8 ++---
include/linux/pipe_fs_i.h | 2 +-
4 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index b1ea7dfbd1494..f8296a82d01df 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file)
pipe_lock(pipe);
pipe->readers++;
pipe->writers--;
- wake_up_interruptible_sync(&pipe->wait);
+ wake_up_interruptible_sync(&pipe->rd_wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
pipe_unlock(pipe);

@@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file)
* We actually want wait_event_freezable() but then we need
* to clear TIF_SIGPENDING and improve dump_interrupted().
*/
- wait_event_interruptible(pipe->wait, pipe->readers == 1);
+ wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);

pipe_lock(pipe);
pipe->readers--;
diff --git a/fs/pipe.c b/fs/pipe.c
index 57502c3c0fba1..5a34d6c22d4ce 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
/* Drop the inode semaphore and wait for a pipe event, atomically */
void pipe_wait(struct pipe_inode_info *pipe)
{
- DEFINE_WAIT(wait);
+ DEFINE_WAIT(rdwait);
+ DEFINE_WAIT(wrwait);

/*
* Pipes are system-local resources, so sleeping on them
* is considered a noninteractive wait:
*/
- prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
+ prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
+ prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
pipe_unlock(pipe);
schedule();
- finish_wait(&pipe->wait, &wait);
+ finish_wait(&pipe->rd_wait, &rdwait);
+ finish_wait(&pipe->wr_wait, &wrwait);
pipe_lock(pipe);
}

@@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
size_t total_len = iov_iter_count(to);
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
- bool was_full;
+ bool was_full, wake_next_reader = false;
ssize_t ret;

/* Null read succeeds. */
@@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)

if (!buf->len) {
pipe_buf_release(pipe, buf);
- spin_lock_irq(&pipe->wait.lock);
+ spin_lock_irq(&pipe->rd_wait.lock);
tail++;
pipe->tail = tail;
- spin_unlock_irq(&pipe->wait.lock);
+ spin_unlock_irq(&pipe->rd_wait.lock);
}
total_len -= chars;
if (!total_len)
@@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
* no data.
*/
if (unlikely(was_full)) {
- wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+ wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}

@@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
* since we've done any required wakeups and there's no need
* to mark anything accessed. And we've dropped the lock.
*/
- if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0)
+ if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
return -ERESTARTSYS;

__pipe_lock(pipe);
was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
+ wake_next_reader = true;
}
+ if (pipe_empty(pipe->head, pipe->tail))
+ wake_next_reader = false;
__pipe_unlock(pipe);

if (was_full) {
- wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+ wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
+ if (wake_next_reader)
+ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
if (ret > 0)
file_accessed(filp);
return ret;
@@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
size_t total_len = iov_iter_count(from);
ssize_t chars;
bool was_empty = false;
+ bool wake_next_writer = false;

/* Null write succeeds. */
if (unlikely(total_len == 0))
@@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
* it, either the reader will consume it or it'll still
* be there for the next write.
*/
- spin_lock_irq(&pipe->wait.lock);
+ spin_lock_irq(&pipe->rd_wait.lock);

head = pipe->head;
if (pipe_full(head, pipe->tail, pipe->max_usage)) {
- spin_unlock_irq(&pipe->wait.lock);
+ spin_unlock_irq(&pipe->rd_wait.lock);
continue;
}

pipe->head = head + 1;
- spin_unlock_irq(&pipe->wait.lock);
+ spin_unlock_irq(&pipe->rd_wait.lock);

/* Insert it into the buffer array */
buf = &pipe->bufs[head & mask];
@@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
*/
__pipe_unlock(pipe);
if (was_empty) {
- wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
+ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}
- wait_event_interruptible(pipe->wait, pipe_writable(pipe));
+ wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
__pipe_lock(pipe);
was_empty = pipe_empty(pipe->head, pipe->tail);
+ wake_next_writer = true;
}
out:
+ if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+ wake_next_writer = false;
__pipe_unlock(pipe);

/*
@@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
* wake up pending jobs
*/
if (was_empty) {
- wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
+ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}
+ if (wake_next_writer)
+ wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
int err = file_update_time(filp);
if (err)
@@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait)
unsigned int head, tail;

/*
- * Reading only -- no need for acquiring the semaphore.
+ * Reading pipe state only -- no need for acquiring the semaphore.
*
* But because this is racy, the code has to add the
* entry to the poll table _first_ ..
*/
- poll_wait(filp, &pipe->wait, wait);
+ if (filp->f_mode & FMODE_READ)
+ poll_wait(filp, &pipe->rd_wait, wait);
+ if (filp->f_mode & FMODE_WRITE)
+ poll_wait(filp, &pipe->wr_wait, wait);

/*
* .. and only then can you do the racy tests. That way,
@@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file)
pipe->writers--;

if (pipe->readers || pipe->writers) {
- wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
+ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
+ wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void)
GFP_KERNEL_ACCOUNT);

if (pipe->bufs) {
- init_waitqueue_head(&pipe->wait);
+ init_waitqueue_head(&pipe->rd_wait);
+ init_waitqueue_head(&pipe->wr_wait);
pipe->r_counter = pipe->w_counter = 1;
pipe->max_usage = pipe_bufs;
pipe->ring_size = pipe_bufs;
@@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)

static void wake_up_partner(struct pipe_inode_info *pipe)
{
- wake_up_interruptible(&pipe->wait);
+ wake_up_interruptible(&pipe->rd_wait);
+ wake_up_interruptible(&pipe->wr_wait);
}

static int fifo_open(struct inode *inode, struct file *filp)
@@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp)

err_rd:
if (!--pipe->readers)
- wake_up_interruptible(&pipe->wait);
+ wake_up_interruptible(&pipe->wr_wait);
ret = -ERESTARTSYS;
goto err;

err_wr:
if (!--pipe->writers)
- wake_up_interruptible(&pipe->wait);
+ wake_up_interruptible(&pipe->rd_wait);
ret = -ERESTARTSYS;
goto err;

@@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
pipe->max_usage = nr_slots;
pipe->tail = tail;
pipe->head = head;
- wake_up_interruptible_all(&pipe->wait);
+ wake_up_interruptible_all(&pipe->rd_wait);
+ wake_up_interruptible_all(&pipe->wr_wait);
return pipe->max_usage * PAGE_SIZE;

out_revert_acct:
diff --git a/fs/splice.c b/fs/splice.c
index 3009652a41c85..d671936d0aad6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = {
static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
{
smp_mb();
- if (waitqueue_active(&pipe->wait))
- wake_up_interruptible(&pipe->wait);
+ if (waitqueue_active(&pipe->rd_wait))
+ wake_up_interruptible(&pipe->rd_wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}

@@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
{
smp_mb();
- if (waitqueue_active(&pipe->wait))
- wake_up_interruptible(&pipe->wait);
+ if (waitqueue_active(&pipe->wr_wait))
+ wake_up_interruptible(&pipe->wr_wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}

diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index dbcfa68923842..d5765039652a5 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -47,7 +47,7 @@ struct pipe_buffer {
**/
struct pipe_inode_info {
struct mutex mutex;
- wait_queue_head_t wait;
+ wait_queue_head_t rd_wait, wr_wait;
unsigned int head;
unsigned int tail;
unsigned int max_usage;
--
2.20.1


2020-02-18 09:53:06

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

Hi Linus,

This patch breaks one of CRIU tests. Here is a small reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main()
{
int p[2];
pid_t p1, p2;
int status;

if (pipe(p) == -1)
return 1;

p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);

return 0;
}

Here are two readers which are waiting for data but only one of them
will be woken up after closing the last writer.

The quick fix looks like this:

diff --git a/fs/pipe.c b/fs/pipe.c
index 5a34d6c22d4c..deaf67239a18 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -412,7 +412,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
- if (wake_next_reader)
+ if (!pipe->writers || wake_next_reader)
wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
if (ret > 0)
file_accessed(filp);

I've checked that it fixes the issue, but It is too late today to read
this code carefully, so I could skip something.

Thanks,
Andrei

On Fri, Feb 14, 2020 at 8:03 AM Sasha Levin <[email protected]> wrote:
>
> From: Linus Torvalds <[email protected]>
>
> [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ]
>
> This makes the pipe code use separate wait-queues and exclusive waiting
> for readers and writers, avoiding a nasty thundering herd problem when
> there are lots of readers waiting for data on a pipe (or, less commonly,
> lots of writers waiting for a pipe to have space).
>
> While this isn't a common occurrence in the traditional "use a pipe as a
> data transport" case, where you typically only have a single reader and
> a single writer process, there is one common special case: using a pipe
> as a source of "locking tokens" rather than for data communication.
>
> In particular, the GNU make jobserver code ends up using a pipe as a way
> to limit parallelism, where each job consumes a token by reading a byte
> from the jobserver pipe, and releases the token by writing a byte back
> to the pipe.
>
> This pattern is fairly traditional on Unix, and works very well, but
> will waste a lot of time waking up a lot of processes when only a single
> reader needs to be woken up when a writer releases a new token.
>
> A simplified test-case of just this pipe interaction is to create 64
> processes, and then pass a single token around between them (this
> test-case also intentionally passes another token that gets ignored to
> test the "wake up next" logic too, in case anybody wonders about it):
>
> #include <unistd.h>
>
> int main(int argc, char **argv)
> {
> int fd[2], counters[2];
>
> pipe(fd);
> counters[0] = 0;
> counters[1] = -1;
> write(fd[1], counters, sizeof(counters));
>
> /* 64 processes */
> fork(); fork(); fork(); fork(); fork(); fork();
>
> do {
> int i;
> read(fd[0], &i, sizeof(i));
> if (i < 0)
> continue;
> counters[0] = i+1;
> write(fd[1], counters, (1+(i & 1)) *sizeof(int));
> } while (counters[0] < 1000000);
> return 0;
> }
>
> and in a perfect world, passing that token around should only cause one
> context switch per transfer, when the writer of a token causes a
> directed wakeup of just a single reader.
>
> But with the "writer wakes all readers" model we traditionally had, on
> my test box the above case causes more than an order of magnitude more
> scheduling: instead of the expected ~1M context switches, "perf stat"
> shows
>
> 231,852.37 msec task-clock # 15.857 CPUs utilized
> 11,250,961 context-switches # 0.049 M/sec
> 616,304 cpu-migrations # 0.003 M/sec
> 1,648 page-faults # 0.007 K/sec
> 1,097,903,998,514 cycles # 4.735 GHz
> 120,781,778,352 instructions # 0.11 insn per cycle
> 27,997,056,043 branches # 120.754 M/sec
> 283,581,233 branch-misses # 1.01% of all branches
>
> 14.621273891 seconds time elapsed
>
> 0.018243000 seconds user
> 3.611468000 seconds sys
>
> before this commit.
>
> After this commit, I get
>
> 5,229.55 msec task-clock # 3.072 CPUs utilized
> 1,212,233 context-switches # 0.232 M/sec
> 103,951 cpu-migrations # 0.020 M/sec
> 1,328 page-faults # 0.254 K/sec
> 21,307,456,166 cycles # 4.074 GHz
> 12,947,819,999 instructions # 0.61 insn per cycle
> 2,881,985,678 branches # 551.096 M/sec
> 64,267,015 branch-misses # 2.23% of all branches
>
> 1.702148350 seconds time elapsed
>
> 0.004868000 seconds user
> 0.110786000 seconds sys
>
> instead. Much better.
>
> [ Note! This kernel improvement seems to be very good at triggering a
> race condition in the make jobserver (in GNU make 4.2.1) for me. It's
> a long known bug that was fixed back in June 2017 by GNU make commit
> b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
> avoid hangs.").
>
> But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
> so a number of distributions may still have the buggy version. Some
> have backported the fix to their 4.2.1 release, though, and even
> without the fix it's quite timing-dependent whether the bug actually
> is hit. ]
>
> Josh Triplett says:
> "I've been hammering on your pipe fix patch (switching to exclusive
> wait queues) for a month or so, on several different systems, and I've
> run into no issues with it. The patch *substantially* improves
> parallel build times on large (~100 CPU) systems, both with parallel
> make and with other things that use make's pipe-based jobserver.
>
> All current distributions (including stable and long-term stable
> distributions) have versions of GNU make that no longer have the
> jobserver bug"
>
> Tested-by: Josh Triplett <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Sasha Levin <[email protected]>
> ---
> fs/coredump.c | 4 +--
> fs/pipe.c | 67 +++++++++++++++++++++++++--------------
> fs/splice.c | 8 ++---
> include/linux/pipe_fs_i.h | 2 +-
> 4 files changed, 51 insertions(+), 30 deletions(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index b1ea7dfbd1494..f8296a82d01df 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -517,7 +517,7 @@ static void wait_for_dump_helpers(struct file *file)
> pipe_lock(pipe);
> pipe->readers++;
> pipe->writers--;
> - wake_up_interruptible_sync(&pipe->wait);
> + wake_up_interruptible_sync(&pipe->rd_wait);
> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> pipe_unlock(pipe);
>
> @@ -525,7 +525,7 @@ static void wait_for_dump_helpers(struct file *file)
> * We actually want wait_event_freezable() but then we need
> * to clear TIF_SIGPENDING and improve dump_interrupted().
> */
> - wait_event_interruptible(pipe->wait, pipe->readers == 1);
> + wait_event_interruptible(pipe->rd_wait, pipe->readers == 1);
>
> pipe_lock(pipe);
> pipe->readers--;
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 57502c3c0fba1..5a34d6c22d4ce 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -108,16 +108,19 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
> /* Drop the inode semaphore and wait for a pipe event, atomically */
> void pipe_wait(struct pipe_inode_info *pipe)
> {
> - DEFINE_WAIT(wait);
> + DEFINE_WAIT(rdwait);
> + DEFINE_WAIT(wrwait);
>
> /*
> * Pipes are system-local resources, so sleeping on them
> * is considered a noninteractive wait:
> */
> - prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
> + prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
> + prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
> pipe_unlock(pipe);
> schedule();
> - finish_wait(&pipe->wait, &wait);
> + finish_wait(&pipe->rd_wait, &rdwait);
> + finish_wait(&pipe->wr_wait, &wrwait);
> pipe_lock(pipe);
> }
>
> @@ -286,7 +289,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> size_t total_len = iov_iter_count(to);
> struct file *filp = iocb->ki_filp;
> struct pipe_inode_info *pipe = filp->private_data;
> - bool was_full;
> + bool was_full, wake_next_reader = false;
> ssize_t ret;
>
> /* Null read succeeds. */
> @@ -344,10 +347,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>
> if (!buf->len) {
> pipe_buf_release(pipe, buf);
> - spin_lock_irq(&pipe->wait.lock);
> + spin_lock_irq(&pipe->rd_wait.lock);
> tail++;
> pipe->tail = tail;
> - spin_unlock_irq(&pipe->wait.lock);
> + spin_unlock_irq(&pipe->rd_wait.lock);
> }
> total_len -= chars;
> if (!total_len)
> @@ -384,7 +387,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> * no data.
> */
> if (unlikely(was_full)) {
> - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
> + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> }
>
> @@ -394,18 +397,23 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> * since we've done any required wakeups and there's no need
> * to mark anything accessed. And we've dropped the lock.
> */
> - if (wait_event_interruptible(pipe->wait, pipe_readable(pipe)) < 0)
> + if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
> return -ERESTARTSYS;
>
> __pipe_lock(pipe);
> was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> + wake_next_reader = true;
> }
> + if (pipe_empty(pipe->head, pipe->tail))
> + wake_next_reader = false;
> __pipe_unlock(pipe);
>
> if (was_full) {
> - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
> + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> }
> + if (wake_next_reader)
> + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> if (ret > 0)
> file_accessed(filp);
> return ret;
> @@ -437,6 +445,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
> size_t total_len = iov_iter_count(from);
> ssize_t chars;
> bool was_empty = false;
> + bool wake_next_writer = false;
>
> /* Null write succeeds. */
> if (unlikely(total_len == 0))
> @@ -515,16 +524,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
> * it, either the reader will consume it or it'll still
> * be there for the next write.
> */
> - spin_lock_irq(&pipe->wait.lock);
> + spin_lock_irq(&pipe->rd_wait.lock);
>
> head = pipe->head;
> if (pipe_full(head, pipe->tail, pipe->max_usage)) {
> - spin_unlock_irq(&pipe->wait.lock);
> + spin_unlock_irq(&pipe->rd_wait.lock);
> continue;
> }
>
> pipe->head = head + 1;
> - spin_unlock_irq(&pipe->wait.lock);
> + spin_unlock_irq(&pipe->rd_wait.lock);
>
> /* Insert it into the buffer array */
> buf = &pipe->bufs[head & mask];
> @@ -576,14 +585,17 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
> */
> __pipe_unlock(pipe);
> if (was_empty) {
> - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
> + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> }
> - wait_event_interruptible(pipe->wait, pipe_writable(pipe));
> + wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> __pipe_lock(pipe);
> was_empty = pipe_empty(pipe->head, pipe->tail);
> + wake_next_writer = true;
> }
> out:
> + if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> + wake_next_writer = false;
> __pipe_unlock(pipe);
>
> /*
> @@ -596,9 +608,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
> * wake up pending jobs
> */
> if (was_empty) {
> - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
> + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> }
> + if (wake_next_writer)
> + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
> int err = file_update_time(filp);
> if (err)
> @@ -642,12 +656,15 @@ pipe_poll(struct file *filp, poll_table *wait)
> unsigned int head, tail;
>
> /*
> - * Reading only -- no need for acquiring the semaphore.
> + * Reading pipe state only -- no need for acquiring the semaphore.
> *
> * But because this is racy, the code has to add the
> * entry to the poll table _first_ ..
> */
> - poll_wait(filp, &pipe->wait, wait);
> + if (filp->f_mode & FMODE_READ)
> + poll_wait(filp, &pipe->rd_wait, wait);
> + if (filp->f_mode & FMODE_WRITE)
> + poll_wait(filp, &pipe->wr_wait, wait);
>
> /*
> * .. and only then can you do the racy tests. That way,
> @@ -706,7 +723,8 @@ pipe_release(struct inode *inode, struct file *file)
> pipe->writers--;
>
> if (pipe->readers || pipe->writers) {
> - wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLOUT | EPOLLRDNORM | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
> + wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
> + wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> }
> @@ -789,7 +807,8 @@ struct pipe_inode_info *alloc_pipe_info(void)
> GFP_KERNEL_ACCOUNT);
>
> if (pipe->bufs) {
> - init_waitqueue_head(&pipe->wait);
> + init_waitqueue_head(&pipe->rd_wait);
> + init_waitqueue_head(&pipe->wr_wait);
> pipe->r_counter = pipe->w_counter = 1;
> pipe->max_usage = pipe_bufs;
> pipe->ring_size = pipe_bufs;
> @@ -1007,7 +1026,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
>
> static void wake_up_partner(struct pipe_inode_info *pipe)
> {
> - wake_up_interruptible(&pipe->wait);
> + wake_up_interruptible(&pipe->rd_wait);
> + wake_up_interruptible(&pipe->wr_wait);
> }
>
> static int fifo_open(struct inode *inode, struct file *filp)
> @@ -1118,13 +1138,13 @@ static int fifo_open(struct inode *inode, struct file *filp)
>
> err_rd:
> if (!--pipe->readers)
> - wake_up_interruptible(&pipe->wait);
> + wake_up_interruptible(&pipe->wr_wait);
> ret = -ERESTARTSYS;
> goto err;
>
> err_wr:
> if (!--pipe->writers)
> - wake_up_interruptible(&pipe->wait);
> + wake_up_interruptible(&pipe->rd_wait);
> ret = -ERESTARTSYS;
> goto err;
>
> @@ -1251,7 +1271,8 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
> pipe->max_usage = nr_slots;
> pipe->tail = tail;
> pipe->head = head;
> - wake_up_interruptible_all(&pipe->wait);
> + wake_up_interruptible_all(&pipe->rd_wait);
> + wake_up_interruptible_all(&pipe->wr_wait);
> return pipe->max_usage * PAGE_SIZE;
>
> out_revert_acct:
> diff --git a/fs/splice.c b/fs/splice.c
> index 3009652a41c85..d671936d0aad6 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -165,8 +165,8 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = {
> static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
> {
> smp_mb();
> - if (waitqueue_active(&pipe->wait))
> - wake_up_interruptible(&pipe->wait);
> + if (waitqueue_active(&pipe->rd_wait))
> + wake_up_interruptible(&pipe->rd_wait);
> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> }
>
> @@ -462,8 +462,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
> static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
> {
> smp_mb();
> - if (waitqueue_active(&pipe->wait))
> - wake_up_interruptible(&pipe->wait);
> + if (waitqueue_active(&pipe->wr_wait))
> + wake_up_interruptible(&pipe->wr_wait);
> kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> }
>
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index dbcfa68923842..d5765039652a5 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -47,7 +47,7 @@ struct pipe_buffer {
> **/
> struct pipe_inode_info {
> struct mutex mutex;
> - wait_queue_head_t wait;
> + wait_queue_head_t rd_wait, wr_wait;
> unsigned int head;
> unsigned int tail;
> unsigned int max_usage;
> --
> 2.20.1
>

2020-02-18 17:37:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 1:51 AM Andrei Vagin <[email protected]> wrote:
>
> This patch breaks one of CRIU tests. Here is a small reproducer:

Good catch.

> The quick fix looks like this:

That one works, but is not really right.

The things that change the number of readers or writers should simply
use "wake_up_all()".

I thought we did that already, but no - there _was_ one place where we
did it, but that was for the pipe buffer size case, and in that case
it's actually pointless. That case acts just like a "new space or data
was added"

So I think the right fix is the attached patch. Since you had such a
lovely test-case, let me go test it too ;)

Linus


Attachments:
patch.diff (1.20 kB)

2020-02-18 17:56:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 9:36 AM Linus Torvalds
<[email protected]> wrote:
>
> The things that change the number of readers or writers should simply
> use "wake_up_all()".
>
> So I think the right fix is the attached patch. Since you had such a
> lovely test-case, let me go test it too ;)

Good that I did. I missed the _real_ case of this - pipe_release().
Because that used a different wakeup function.

In fact, that case uses wake_up_interruptible_sync_poll(), which
doesn't have the "all" version.

But it doesn't actually need that fancy thing, since it's only meant
for "let's avoid waking up things that don't need these poll keys",
and the whole point is that now we're closing the pipe so we should
wake up everybody.

And in fact the test for "are there readers or writers" was
nonsensical. We shouldn't wake up readers just because they still
exist. We should wake up readers only if they exist, _and_ there are
no writers left (and vice versa).

Anyway, new patch attached. This hasn't been tested either, but I'll
let you know if it's broken too ;)

Linus


Attachments:
patch.diff (1.87 kB)

2020-02-18 18:18:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 9:54 AM Linus Torvalds
<[email protected]> wrote:
>
> Anyway, new patch attached. This hasn't been tested either, but I'll
> let you know if it's broken too ;)

That one looks good. Some small cosmetic edits later, and with a
commit log it looks like the appended.

If you're testing this and it works for your full CRIU test case too,
I'll add your tested-by if I get it before I end up pushing things out
later today,

Linus


Attachments:
0001-pipe-make-sure-to-wake-up-everybody-when-the-last-re.patch (4.66 kB)

2020-02-18 18:22:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 10:17:30AM -0800, Linus Torvalds wrote:
> @@ -722,9 +722,10 @@ pipe_release(struct inode *inode, struct file *file)
> if (file->f_mode & FMODE_WRITE)
> pipe->writers--;
>
> - if (pipe->readers || pipe->writers) {
> - wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLHUP);
> - wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM | EPOLLERR | EPOLLHUP);
> + /* Was that the last reader or writer, but not the other side? */
> + if (!pipe->readers != !pipe->writers) {
> + wake_up_interruptible_all(&pipe->rd_wait);
> + wake_up_interruptible_all(&pipe->wr_wait);
> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> }
> @@ -1026,8 +1027,8 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
>
> static void wake_up_partner(struct pipe_inode_info *pipe)
> {
> - wake_up_interruptible(&pipe->rd_wait);
> - wake_up_interruptible(&pipe->wr_wait);
> + wake_up_interruptible_all(&pipe->rd_wait);
> + wake_up_interruptible_all(&pipe->wr_wait);
> }

You don't want to move wake_up_partner() up and call it from pipe_release()?

2020-02-18 18:30:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 10:20 AM Matthew Wilcox <[email protected]> wrote:
>
> You don't want to move wake_up_partner() up and call it from pipe_release()?

I was actually thinking of going the other way - two of three users of
wake_up_partner() are redundantly waking up the wrong side, and the
third user is pointlessly written too.

So I was _thinking_ of a patch like the appended (which is on top of
the previous patch), but ended up not doing it. Until you brought it
up.

But I won't bother committing this, since it shouldn't really matter.

Linus


Attachments:
patch.diff (1.62 kB)

2020-02-18 18:54:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Fri, Feb 14, 2020 at 8:00 AM Sasha Levin <[email protected]> wrote:
>
> From: Linus Torvalds <[email protected]>
>
> [ Upstream commit 0ddad21d3e99c743a3aa473121dc5561679e26bb ]
>
> This makes the pipe code use separate wait-queues and exclusive waiting
> for readers and writers, [..]

Oh, and since I didn't react initially, let me react now that Andrei
found a bug here: why was this patch auto-selected for 5.5 stable in
the first place?

It wasn't really a fix, and there's no Fixes: tag or stable tag in
there. Yeah, there's a reference to an old commit, but that one isn't
even a kernel commit.

Yeah, the performance improvements are quite nice if you hit the case
this matters for, but still..

Linus

2020-02-18 22:35:06

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 10:28:23AM -0800, Linus Torvalds wrote:
> On Tue, Feb 18, 2020 at 10:20 AM Matthew Wilcox <[email protected]> wrote:
> >
> > You don't want to move wake_up_partner() up and call it from pipe_release()?
>
> I was actually thinking of going the other way - two of three users of
> wake_up_partner() are redundantly waking up the wrong side, and the
> third user is pointlessly written too.
>
> So I was _thinking_ of a patch like the appended (which is on top of
> the previous patch), but ended up not doing it. Until you brought it
> up.
>
> But I won't bother committing this, since it shouldn't really matter.

I run CRIU tests on the kernel with both these patches. Everything work
as expected. Thank you for the fix.

>
> Linus

> fs/pipe.c | 18 ++++++------------
> 1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 2144507447c5..79ba61430f9c 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1025,12 +1025,6 @@ static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
> return cur == *cnt ? -ERESTARTSYS : 0;
> }
>
> -static void wake_up_partner(struct pipe_inode_info *pipe)
> -{
> - wake_up_interruptible_all(&pipe->rd_wait);
> - wake_up_interruptible_all(&pipe->wr_wait);
> -}
> -
> static int fifo_open(struct inode *inode, struct file *filp)
> {
> struct pipe_inode_info *pipe;
> @@ -1078,7 +1072,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
> */
> pipe->r_counter++;
> if (pipe->readers++ == 0)
> - wake_up_partner(pipe);
> + wake_up_interruptible_all(&pipe->wr_wait);
>
> if (!is_pipe && !pipe->writers) {
> if ((filp->f_flags & O_NONBLOCK)) {
> @@ -1104,7 +1098,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
>
> pipe->w_counter++;
> if (!pipe->writers++)
> - wake_up_partner(pipe);
> + wake_up_interruptible_all(&pipe->rd_wait);
>
> if (!is_pipe && !pipe->readers) {
> if (wait_for_partner(pipe, &pipe->r_counter))
> @@ -1120,12 +1114,12 @@ static int fifo_open(struct inode *inode, struct file *filp)
> * the process can at least talk to itself.
> */
>
> - pipe->readers++;
> - pipe->writers++;
> + if (pipe->readers++ == 0)
> + wake_up_interruptible_all(&pipe->wr_wait);
> + if (pipe->writers++ == 0)
> + wake_up_interruptible_all(&pipe->rd_wait);
> pipe->r_counter++;
> pipe->w_counter++;
> - if (pipe->readers == 1 || pipe->writers == 1)
> - wake_up_partner(pipe);
> break;
>
> default:

2020-02-18 23:04:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Tue, Feb 18, 2020 at 2:33 PM Andrei Vagin <[email protected]> wrote:
>
> I run CRIU tests on the kernel with both these patches. Everything work
> as expected.

Thanks. I've added your tested-by and pushed out the fix.

Linus

2020-03-05 18:20:41

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

Hello Linus,

After this change, one more criu test became flaky. This is due to one
of corner cases, so I am not sure that we need to fix something in the
kernel. I have fixed this issue in the test. I am not sure that this
will affect any real applications.

Here is the reproducer:

#include <unistd.h>
#include <stdio.h>

int main()
{
char buf[1<<20];
int pid, p[2], ret;

if (pipe(p) < 0)
return 1;
pid = fork();
if (pid == 0) {
close(p[1]);

ret = read(p[0], buf, sizeof(buf));
if (ret < 0)
return 1;
printf("read -> %d\n", ret);
return 0;
}
close(p[0]);
ret = write(p[1], buf, sizeof(buf));
if (ret < 0)
return 1;
printf("write -> %d\n", ret);
return 0;
}

Before this change:
[avagin@laptop fifo]$ uname -a
Linux laptop 5.3.7-200.fc30.x86_64 #1 SMP Fri Oct 18 20:13:59 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux
[avagin@laptop fifo]$ strace -e read,write,pipe -f ./pipe_bigbuf
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260r\2\0\0\0\0\0"...,
832) = 832
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0gZ\316<\240z\v\206=\360\37F\32{\t\204"...,
68) = 68
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0gZ\316<\240z\v\206=\360\37F\32{\t\204"...,
68) = 68
pipe([3, 4]) = 0
strace: Process 622350 attached
[pid 622349] write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576 <unfinished ...>
[pid 622350] read(3, <unfinished ...>
[pid 622349] <... write resumed> ) = 1048576
[pid 622350] <... read resumed>
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576) = 1048576
[pid 622349] write(1, "write -> 1048576\n", 17write -> 1048576
) = 17
[pid 622350] write(1, "read -> 1048576\n", 16read -> 1048576
) = 16
[pid 622349] +++ exited with 0 +++
+++ exited with 0 +++

After this change:
[root@fc24 ~]# strace -e read,write,pipe -f ./pipe_bigbuf
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260r\2\0\0\0\0\0"...,
832) = 832
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0L\355\265_\4c\17r@ix\305q\26W\242"...,
68) = 68
read(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"...,
784) = 784
read(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0",
32) = 32
read(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0L\355\265_\4c\17r@ix\305q\26W\242"...,
68) = 68
pipe([3, 4]) = 0
strace: Process 4946 attached
[pid 4945] write(4,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576 <unfinished ...>
[pid 4946] read(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576) = 65536
[pid 4946] write(1, "read -> 65536\n", 14read -> 65536
) = 14
[pid 4945] <... write resumed>) = 131072
[pid 4946] +++ exited with 0 +++
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=4945, si_uid=0} ---
+++ killed by SIGPIPE +++

On Tue, Feb 18, 2020 at 3:03 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Feb 18, 2020 at 2:33 PM Andrei Vagin <[email protected]> wrote:
> >
> > I run CRIU tests on the kernel with both these patches. Everything work
> > as expected.
>
> Thanks. I've added your tested-by and pushed out the fix.
>
> Linus

2020-03-05 18:41:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Thu, Mar 5, 2020 at 12:20 PM Andrei Vagin <[email protected]> wrote:
>
> After this change, one more criu test became flaky. This is due to one
> of corner cases, so I am not sure that we need to fix something in the
> kernel. I have fixed this issue in the test. I am not sure that this
> will affect any real applications.

It's an interesting test-case, but it's really not doing anything you
should rely on.

The code basically expects a pipe write() to be "atomic" for something
bigger than PIPE_BUF. We've never really guaranteed that (and POSIX
doesn't), but we had this special case where readers would continue to
read as long as there was an active writer, which kind of approximated
that for some cases (and your test-case in particular).

A reader that wants to read everything should do multiple read() calls
until it gets an EOF (or gets the expected buffer size). A regular
read() on a pipe can simply always return a partial buffer (it will
always do so in the case of signals, but what you're seeing is that it
will also now do it if the kernel buffers emptied even when there was
a writer that was ready to fill them again).

And I suspect it wasn't actually the commit you point to that changes
the behavior, I think it's actually the "pipe: remove
'waiting_writers' merging logic" that changed behavior for your test

But because it's then timing-dependent on whether the reader gets all
the data or not, it might bisect to any commit after that point.
Particularly since some of the other commits change timing too..

We could re-introduce the "continue reading while there are active
writers" logic, but if this is the only test that triggers that, I'd
prefer to wait until some real user notices...

But if CRIU itself depends on this behavior (rather than just a test),
then I guess we need to.

So is it just a test-case, or does CRIU itself depend on that "reads
get full buffers"? As mentioned, that really _is_ fundamentally broken
if there is any chance of signals..

Linus

2020-03-05 19:54:49

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH AUTOSEL 5.5 542/542] pipe: use exclusive waits when reading or writing

On Thu, Mar 5, 2020 at 10:41 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Mar 5, 2020 at 12:20 PM Andrei Vagin <[email protected]> wrote:
> >
> > After this change, one more criu test became flaky. This is due to one
> > of corner cases, so I am not sure that we need to fix something in the
> > kernel. I have fixed this issue in the test. I am not sure that this
> > will affect any real applications.
>
> It's an interesting test-case, but it's really not doing anything you
> should rely on.

I'm agree with this.

> But if CRIU itself depends on this behavior (rather than just a test),
> then I guess we need to.
>
> So is it just a test-case, or does CRIU itself depend on that "reads
> get full buffers"? As mentioned, that really _is_ fundamentally broken
> if there is any chance of signals..

No, it doesn't. I'm agree that we can wait an report from a real app.

Thanks!
Andrei