2019-12-05 22:31:26

by David Howells

[permalink] [raw]
Subject: [PATCH 0/2] pipe: Fixes [ver #2]


Hi Linus, Eric,

Here are a couple of patches to fix bugs syzbot found in the pipe changes:

(1) An assertion check will sometimes trip when polling a pipe because the
ring size and indices used are approximate and may be being changed
simultaneously.

An equivalent approximate calculation was done previously, but without
the assertion check, so I've just dropped the check. To make it
accurate, the pipe mutex would need to be taken or the spin lock could
be used - but usage of the spinlock would need to be rolled out into
splice, iov_iter and other places for that.

(2) The index mask and the max_usage values cannot be cached across
pipe_wait() as F_SETPIPE_SZ could have been called during the wait.
This can cause pipe_write() to break.

David
---
David Howells (2):
pipe: Remove assertion from pipe_poll()
pipe: Fix missing mask update after pipe_wait()


fs/pipe.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)


2019-12-05 22:31:36

by David Howells

[permalink] [raw]
Subject: [PATCH 1/2] pipe: Remove assertion from pipe_poll() [ver #2]

An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size. However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ to
intervene and muck up the calculation, thereby causing the oops.

Fix this by simply removing the assertion and accepting that the
calculation might be approximate.

Note that the previous code also had a similar issue, though there was no
assertion check, since the occupancy counter and the ring size were not
read with a lock held, so it's possible that the poll check might have
malfunctioned then too.

Also wake up all the waiters so that they can reissue their checks if there
was a competing read or write.

Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: [email protected]
Signed-off-by: David Howells <[email protected]>
cc: Eric Biggers <[email protected]>
---

fs/pipe.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 648ce440ca85..da782ee251d2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -579,8 +579,6 @@ pipe_poll(struct file *filp, poll_table *wait)

poll_wait(filp, &pipe->wait, wait);

- BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
-
/* Reading only -- no need for acquiring the semaphore. */
mask = 0;
if (filp->f_mode & FMODE_READ) {
@@ -1176,6 +1174,7 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
pipe->max_usage = nr_slots;
pipe->tail = tail;
pipe->head = head;
+ wake_up_interruptible_all(&pipe->wait);
return pipe->max_usage * PAGE_SIZE;

out_revert_acct:

2019-12-05 22:32:47

by David Howells

[permalink] [raw]
Subject: [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() [ver #2]

Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter function
drops the pipe lock, thereby allowing F_SETPIPE_SZ change them. Without
this, pipe_write() may subsequently miscalculate the array indices and pipe
fullness, leading to an oops like the following:

BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
...
CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
...
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x113/0x167 lib/dump_stack.c:113
print_address_description.constprop.8.cold.10+0x9/0x31d mm/kasan/report.c:374
__kasan_report.cold.11+0x1b/0x3a mm/kasan/report.c:506
kasan_report+0x12/0x20 mm/kasan/common.c:634
__asan_report_store8_noabort+0x17/0x20 mm/kasan/generic_report.c:137
pipe_write+0xc25/0xe10 fs/pipe.c:481
call_write_iter include/linux/fs.h:1895 [inline]
new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
__vfs_write+0x94/0x110 fs/read_write.c:496
vfs_write+0x18a/0x520 fs/read_write.c:558
ksys_write+0x105/0x220 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__x64_sys_write+0x6e/0xb0 fs/read_write.c:620
do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is not a problem for pipe_read() as the mask is recalculated on each
pass of the loop, after pipe_wait() has been called.

Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: [email protected]
Signed-off-by: David Howells <[email protected]>
cc: Eric Biggers <[email protected]>
---

fs/pipe.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index da782ee251d2..8061b093140d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -389,7 +389,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
{
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
- unsigned int head, max_usage, mask;
+ unsigned int head;
ssize_t ret = 0;
int do_wakeup = 0;
size_t total_len = iov_iter_count(from);
@@ -408,13 +408,12 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
}

head = pipe->head;
- max_usage = pipe->max_usage;
- mask = pipe->ring_size - 1;

/* We try to merge small writes */
chars = total_len & (PAGE_SIZE-1); /* size of the last buffer */
if (!pipe_empty(head, pipe->tail) && chars != 0) {
- struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
+ struct pipe_buffer *buf =
+ &pipe->bufs[(head - 1) & (pipe->ring_size - 1)];
int offset = buf->offset + buf->len;

if (pipe_buf_can_merge(buf) && offset + chars <= PAGE_SIZE) {
@@ -443,8 +442,9 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
}

head = pipe->head;
- if (!pipe_full(head, pipe->tail, max_usage)) {
- struct pipe_buffer *buf = &pipe->bufs[head & mask];
+ if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
+ struct pipe_buffer *buf =
+ &pipe->bufs[head & (pipe->ring_size - 1)];
struct page *page = pipe->tmp_page;
int copied;

@@ -465,7 +465,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
spin_lock_irq(&pipe->wait.lock);

head = pipe->head;
- if (pipe_full(head, pipe->tail, max_usage)) {
+ if (pipe_full(head, pipe->tail, pipe->max_usage)) {
spin_unlock_irq(&pipe->wait.lock);
continue;
}
@@ -484,7 +484,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);

/* Insert it into the buffer array */
- buf = &pipe->bufs[head & mask];
+ buf = &pipe->bufs[head & (pipe->ring_size - 1)];
buf->page = page;
buf->ops = &anon_pipe_buf_ops;
buf->offset = 0;
@@ -510,7 +510,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
break;
}

- if (!pipe_full(head, pipe->tail, max_usage))
+ if (!pipe_full(head, pipe->tail, pipe->max_usage))
continue;

/* Wait for buffer space to become available. */

2019-12-06 00:00:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] pipe: Fix missing mask update after pipe_wait() [ver #2]

On Thu, Dec 5, 2019 at 2:30 PM David Howells <[email protected]> wrote:
>
> - struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
> + struct pipe_buffer *buf =
> + &pipe->bufs[(head - 1) & (pipe->ring_size - 1)];

I changed the two occurrences of this to use a local temporary "mask"
variable, to avoid the long lines.

It's no longer _caching_ the value, but it makes the code more legible.

Linus

2019-12-06 13:57:06

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Thu, Dec 05, 2019 at 10:30:22PM +0000, David Howells wrote:
> David Howells (2):
> pipe: Remove assertion from pipe_poll()
> pipe: Fix missing mask update after pipe_wait()

For reference, I've retested current master (b0d4beaa5a4b7d), that
incldes the 2 pipe fixes, the test still hangs.

The stack now points to pipe_wait but otherwise seems to be the same:

[<0>] pipe_wait+0x72/0xc0
[<0>] pipe_write+0x217/0x4b0
[<0>] new_sync_write+0x11e/0x1c0
[<0>] vfs_write+0xc1/0x1d0
[<0>] kernel_write+0x2c/0x40
[<0>] send_cmd+0x78/0xf0 [btrfs]
[<0>] send_extent_data+0x4af/0x52a [btrfs]
[<0>] process_extent+0xe5d/0xeb4 [btrfs]
[<0>] changed_cb+0xcf5/0xd2f [btrfs]
[<0>] send_subvol+0x8af/0xc0d [btrfs]
[<0>] btrfs_ioctl_send+0xe2b/0xef0 [btrfs]
[<0>] _btrfs_ioctl_send+0x80/0x110 [btrfs]
[<0>] btrfs_ioctl+0x14b8/0x3120 [btrfs]
[<0>] do_vfs_ioctl+0xa1/0x750
[<0>] ksys_ioctl+0x70/0x80
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x56/0x240
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

(gdb) l *(pipe_wait+0x72)
0x622 is in pipe_wait (fs/pipe.c:120).
115 * is considered a noninteractive wait:
116 */
117 prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
118 pipe_unlock(pipe);
119 schedule();
120 finish_wait(&pipe->wait, &wait);
121 pipe_lock(pipe);
122 }
123
124 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,

2019-12-06 17:10:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 5:56 AM David Sterba <[email protected]> wrote:
>
> For reference, I've retested current master (b0d4beaa5a4b7d), that
> incldes the 2 pipe fixes, the test still hangs.

Yeah, they are the same fixes that you already tested, just slightly
reworded (and with a WARN_ON actually removed, but you never hit that
one).

Does the btrfs test that fails actually require a btrfs filesystem?

Linus

2019-12-06 17:43:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 9:09 AM Linus Torvalds
<[email protected]> wrote:
>
> Does the btrfs test that fails actually require a btrfs filesystem?

Nope.

DavidH, you can do this:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git
cd btrfs-progs
./autogen.sh
make

and it looks like that

make TEST=016\* test-misc

should work - all you need is btrfs and loop built as modules.

Linus

2019-12-06 19:00:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Hmm. I think I just saw this same bug with a plain kernel compile.

My "make -j32" suddenly came to a crawl, and seems to be entirely
single-threaded.

And that's almost certainly because the way 'make' handles load
distribution is with a network of pipes that has a token passed to the
sub-makes.

So there's most definitely something wrong with the new pipe rework.
Well, I can't _guarantee_ the pipes are the cause of this, but it does
smell like it.

Linus

2019-12-06 20:32:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 5:56 AM David Sterba <[email protected]> wrote:
>
> For reference, I've retested current master (b0d4beaa5a4b7d), that
> incldes the 2 pipe fixes, the test still hangs.

I think I found it.

TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be
completely wrong. And might not fix it.

The first hunk is purely syntactic sugar - use the normal head/tail
order. The second/third hunk is what I think fixes the problem:
iter_file_splice_write() had the same buggy "let's cache
head/tail/mask" pattern as pipe_write() had.

You can't cache them over a 'pipe_wait()' that drops the pipe lock,
and there's one in splice_from_pipe_next().

Linus

--- snip snip --

diff --git a/fs/splice.c b/fs/splice.c
index f2400ce7d528..fa1f3773c8cd 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -495,7 +495,7 @@ static int splice_from_pipe_feed(struct
pipe_inode_info *pipe, struct splice_des
unsigned int mask = pipe->ring_size - 1;
int ret;

- while (!pipe_empty(tail, head)) {
+ while (!pipe_empty(head, tail)) {
struct pipe_buffer *buf = &pipe->bufs[tail & mask];

sd->len = buf->len;
@@ -711,9 +711,7 @@ iter_file_splice_write(struct pipe_inode_info
*pipe, struct file *out,
splice_from_pipe_begin(&sd);
while (sd.total_len) {
struct iov_iter from;
- unsigned int head = pipe->head;
- unsigned int tail = pipe->tail;
- unsigned int mask = pipe->ring_size - 1;
+ unsigned int head, tail, mask;
size_t left;
int n;

@@ -732,6 +730,10 @@ iter_file_splice_write(struct pipe_inode_info
*pipe, struct file *out,
}
}

+ head = pipe->head;
+ tail = pipe->tail;
+ mask = pipe->ring_size - 1;
+
/* build the vector */
left = sd.total_len;
for (n = 0; !pipe_empty(head, tail) && left && n <
nbufs; tail++, n++) {

2019-12-06 21:05:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 12:28 PM Linus Torvalds
<[email protected]> wrote:
>
> I think I found it.
>
> TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be
> completely wrong. And might not fix it.

Well, it compiles, and it's obviously correct, so I've committed it.

It doesn't fix my "kernel compiles go single-threaded" issue. Which is
not surprising - make doesn't use splice(), it just reads and writes
single characters (the main make server writes a "+" character for
each available job, and parallel sub-makes will read one for each job
they start and write one when done - or something very close to that).

I think that is related to the other pipe changes, though - there were
some wakeup changes in there too. The btrfs problem was bisected to
the original commit, which is what I think my patch fixes.

Linus

2019-12-06 21:28:40

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Linus Torvalds <[email protected]> wrote:

> I think I found it.
>
> TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be
> completely wrong. And might not fix it.
>
> The first hunk is purely syntactic sugar - use the normal head/tail
> order. The second/third hunk is what I think fixes the problem:
> iter_file_splice_write() had the same buggy "let's cache
> head/tail/mask" pattern as pipe_write() had.
>
> You can't cache them over a 'pipe_wait()' that drops the pipe lock,
> and there's one in splice_from_pipe_next().

That seems to fix it.

Tested-by: David Howells <[email protected]>

2019-12-07 03:51:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 1:04 PM Linus Torvalds
<[email protected]> wrote:
>
> It doesn't fix my "kernel compiles go single-threaded" issue.

I ended up just bisecting it.

The "make goes slow" problem bisects down to b667b8673443 ("pipe:
Advance tail pointer inside of wait spinlock in pipe_read()").

I assume it's the added "do_wakeup = 0" (not the spinlock) that ends
up having some subtle issue.

Very strange, but there it is.

Linus

2019-12-07 04:03:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds
<[email protected]> wrote:
>
> I assume it's the added "do_wakeup = 0" (not the spinlock) that ends
> up having some subtle issue.

Ahh, and then later that is removed, but when it is removed it also
remote the wakeup before the pipe_wait(). So whatever issue that
commit introduces ends up remaining.

I wonder if the extra wakeups ended up hiding some other bug. We do
extra wakeups on the write side too, with a "FIXME! Is this really
true?" comment..

Linus

2019-12-07 21:32:02

by Akemi Yagi

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On 12/6/19 10:59 AM, Linus Torvalds wrote:
> Hmm. I think I just saw this same bug with a plain kernel compile.
>
> My "make -j32" suddenly came to a crawl, and seems to be entirely
> single-threaded.
>
> And that's almost certainly because the way 'make' handles load
> distribution is with a network of pipes that has a token passed to the
> sub-makes.
>
> So there's most definitely something wrong with the new pipe rework.
> Well, I can't _guarantee_ the pipes are the cause of this, but it does
> smell like it.
>
> Linus
>

We encountered what seems to be the same problem when building a kernel
under RHEL 8.

As it turned out, this was due to a bug in make (make-4.2.1-9.el8) and
a patch is available. Details are found in this Red Hat bugzilla report:

https://bugzilla.redhat.com/show_bug.cgi?id=1774790

Akemi

2019-12-07 22:50:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds
<[email protected]> wrote:
>
> The "make goes slow" problem bisects down to b667b8673443 ("pipe:
> Advance tail pointer inside of wait spinlock in pipe_read()").

I'm not entirely sure that ends up being 100% true. It did bisect to
that, but the behavior wasn't entirely stable. There definitely is
some nasty timing trigger.

But I did finally figure out what seems to have been going on with at
least the biggest part of the build performance regression. It's seems
to be a nasty interaction with the scheduler and the GNU make
jobserver, and in particular the pipe wakeups really _really_ do seem
to want to be synchronous both for the readers and the writers.

When a writer wakes up a reader, we want the reader to react quickly
and vice versa. The most obvious case was for the GNU make jobserver,
where sub-makes would do a single-byte write to the jobserver pipe,
and we want to wake up the reader *immediatly*, because the reader is
actually a lot more important than the writer. The reader is what gets
the next job going, the writer just got done with the last one.

And when a reader empties a full pipe, it's because the writer is
generating data, and you want to just get the writer going again asap.

Anyway, I've spent way too much time looking at this and wondering
about odd performance patterns. It seems to be mostly back up to
normal.

I say "mostly", because I still see times of "not as many concurrent
compiles going as I'd expect". It might be a kbuild problem, it might
be an issue with GNU make (I've seen problems with the make jobserver
wanting many more tokens than expected before and the kernel makefiles
- it migth be about deep subdirectories etc), and it might be some
remaining pipe issue. But my allmodconfig builds aren't _enormously_
slower than they used to be.

But there's definitely some unhappy interaction with the jobserver. I
have 16 threads (8 cores with HT), and I generally use "make -j32" to
keep them busy because the jobserver isn't great. The pipe rework made
even that 2x slop not work all that well. Something held on to tokens
too long, and there was definitely some interaction with the pipe
wakeup code. Using "-j64" hid the problem, but it was a problem.

It might be the new scheduler balancing changes that are interacting
with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc,
because I hadn't realized just how important the sync wakeup seems to
be for pipe performance even at a big level.

I've pushed out my pipe changes. I really didn't want to do that kind
of stuff at the end of the merge window, but I spent a lot more time
than I wanted looking at this code, because I was getting to the point
where the alternative was to just revert it all.

DavidH, give these a look:

85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO
a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic
f467a6a66419 pipe: fix and clarify pipe read wakeup logic
1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic
ad910e36da4c pipe: fix poll/select race introduced by the pipe rework

the top two of which are purely "I'm fed up looking at this code, this
needs to go" kind of changes.

In particular, that last change is because I think the GNU jobserver
problem is partly a thundering herd issue: when a job token becomes
free (ie somebody does a one-byte write to an empty jobserver pipe),
it wakes up *everybody* who is waiting for a token. One of them will
get it, and the others will go to sleep again. And then it repeats all
over. I didn't fix it, but it _could_ be fixed with exclusive waits
for readers/writers, but that means more smarts than pipe_wait() can
do. And because the jobserver isn't great at keeping everybody happy,
I'm using a much bigger "make -jX" value than the number of CPU's I
have, which makes the herd bigger. And I suspect none of this helps
the scheduler pick the _right_ process to run, which just makes
scheduling an even bigger problem.

Linus

2019-12-08 16:46:54

by Akemi Yagi

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Sat, Dec 7, 2019 at 1:31 PM Akemi Yagi <[email protected]> wrote:
>
> On 12/6/19 10:59 AM, Linus Torvalds wrote:
> > Hmm. I think I just saw this same bug with a plain kernel compile.
> >
> > My "make -j32" suddenly came to a crawl, and seems to be entirely
> > single-threaded.
>
> We encountered what seems to be the same problem when building a kernel
> under RHEL 8.
>
> As it turned out, this was due to a bug in make (make-4.2.1-9.el8) and
> a patch is available. Details are found in this Red Hat bugzilla report:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1774790

I forgot to mention that the aforementioned bug in make was originally
reported for Fedora. They now have a patched version of make
(make-4.2.1-15.fc32) in Rawhide.

Akemi

2019-12-08 18:06:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Sun, Dec 8, 2019 at 8:46 AM Akemi Yagi <[email protected]> wrote:
>
> I forgot to mention that the aforementioned bug in make was originally
> reported for Fedora. They now have a patched version of make
> (make-4.2.1-15.fc32) in Rawhide.

Yes, as mentioned I do expect it's some jobserver interaction - I've
seen bad jobserver behavior before. But my 'make' hasn't been upgraded
since May, as far as I can tell from my logs, so the huge performance
regression was new.

So the huge slowdown is real, and was triggered by the kernel changes.
But yes, it's entirely possible - even likely - that it's due to a
jobserver bug that ends up being sensitive to scheduling.

We do treat even "user space is buggy and shouldn't have done that" as
a regression.

It's annoying. Timing-related issues are some of the worst to debug.
Particularly when (at least I) primarily debug by looking at the code
and trying to see what's wrong that way.

Linus

2019-12-09 03:14:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Sun, Dec 8, 2019 at 10:04 AM Linus Torvalds
<[email protected]> wrote:
>
> On Sun, Dec 8, 2019 at 8:46 AM Akemi Yagi <[email protected]> wrote:
> >
> > I forgot to mention that the aforementioned bug in make was originally
> > reported for Fedora. They now have a patched version of make
> > (make-4.2.1-15.fc32) in Rawhide.
>
> Yes, as mentioned I do expect it's some jobserver interaction - I've
> seen bad jobserver behavior before. But my 'make' hasn't been upgraded
> since May, as far as I can tell from my logs, so the huge performance
> regression was new.

Interestingly, when I'm trying to do the non-thundering-herd pipe
reads and writes, a side effect of that is that the read target is
"fair" (ie when there are multiple concurrent waiting read() calls,
it's always the first one that gets it).

And that seems to really mess up the jobserver bug, and triggers it
every single time.

Fairness is often bad for locking throughput, but this is on another
level entirely, and yeah, I see a lot of "defunct" shell and make
processes, so it does seem to be the make bug.

(Adding scheduler people to the participants list, because this patch
is the "avoid using the return value of
wait_event_interruptible_exclusive()" version because I think
wait_event_interruptible_exclusive() is itself buggy).

I built a new version of 'make' from source, and that one seems to
work fine with the attached patch. But the standard Fedora 'make'
binary completely hates this patch.

Sad. The patch does seem to be the RightThing(tm) to do, but this make
bug makes it inadvisable to apply it unless you want to play with
things.

Linus


Attachments:
patch.diff (9.39 kB)

2019-12-09 10:04:59

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Sat, 7 Dec 2019 at 23:48, Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > The "make goes slow" problem bisects down to b667b8673443 ("pipe:
> > Advance tail pointer inside of wait spinlock in pipe_read()").
>
> I'm not entirely sure that ends up being 100% true. It did bisect to
> that, but the behavior wasn't entirely stable. There definitely is
> some nasty timing trigger.
>
> But I did finally figure out what seems to have been going on with at
> least the biggest part of the build performance regression. It's seems
> to be a nasty interaction with the scheduler and the GNU make
> jobserver, and in particular the pipe wakeups really _really_ do seem
> to want to be synchronous both for the readers and the writers.
>
> When a writer wakes up a reader, we want the reader to react quickly
> and vice versa. The most obvious case was for the GNU make jobserver,
> where sub-makes would do a single-byte write to the jobserver pipe,
> and we want to wake up the reader *immediatly*, because the reader is
> actually a lot more important than the writer. The reader is what gets
> the next job going, the writer just got done with the last one.
>
> And when a reader empties a full pipe, it's because the writer is
> generating data, and you want to just get the writer going again asap.
>
> Anyway, I've spent way too much time looking at this and wondering
> about odd performance patterns. It seems to be mostly back up to
> normal.
>
> I say "mostly", because I still see times of "not as many concurrent
> compiles going as I'd expect". It might be a kbuild problem, it might
> be an issue with GNU make (I've seen problems with the make jobserver
> wanting many more tokens than expected before and the kernel makefiles
> - it migth be about deep subdirectories etc), and it might be some
> remaining pipe issue. But my allmodconfig builds aren't _enormously_
> slower than they used to be.
>
> But there's definitely some unhappy interaction with the jobserver. I
> have 16 threads (8 cores with HT), and I generally use "make -j32" to
> keep them busy because the jobserver isn't great. The pipe rework made
> even that 2x slop not work all that well. Something held on to tokens
> too long, and there was definitely some interaction with the pipe
> wakeup code. Using "-j64" hid the problem, but it was a problem.
>
> It might be the new scheduler balancing changes that are interacting
> with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc,
> because I hadn't realized just how important the sync wakeup seems to
> be for pipe performance even at a big level.

Which version of make should I use to reproduce the problem ?
My setup is not the same and my make is a bit old but I haven't been
able to reproduce the problem described above on my arm64 octa cores
system and v5.5-rc1.
All cores are busy with -j16. And even -j8 keeps the cores almost always busy

>
> I've pushed out my pipe changes. I really didn't want to do that kind
> of stuff at the end of the merge window, but I spent a lot more time
> than I wanted looking at this code, because I was getting to the point
> where the alternative was to just revert it all.
>
> DavidH, give these a look:
>
> 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO
> a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic
> f467a6a66419 pipe: fix and clarify pipe read wakeup logic
> 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic
> ad910e36da4c pipe: fix poll/select race introduced by the pipe rework
>
> the top two of which are purely "I'm fed up looking at this code, this
> needs to go" kind of changes.
>
> In particular, that last change is because I think the GNU jobserver
> problem is partly a thundering herd issue: when a job token becomes
> free (ie somebody does a one-byte write to an empty jobserver pipe),
> it wakes up *everybody* who is waiting for a token. One of them will
> get it, and the others will go to sleep again. And then it repeats all
> over. I didn't fix it, but it _could_ be fixed with exclusive waits
> for readers/writers, but that means more smarts than pipe_wait() can
> do. And because the jobserver isn't great at keeping everybody happy,
> I'm using a much bigger "make -jX" value than the number of CPU's I
> have, which makes the herd bigger. And I suspect none of this helps
> the scheduler pick the _right_ process to run, which just makes
> scheduling an even bigger problem.
>
> Linus

2019-12-09 14:56:13

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Fri, Dec 06, 2019 at 01:04:01PM -0800, Linus Torvalds wrote:
> On Fri, Dec 6, 2019 at 12:28 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > I think I found it.
> >
> > TOTALLY UNTESTED patch appended. It's whitespace-damaged and may be
> > completely wrong. And might not fix it.
>
> Well, it compiles, and it's obviously correct, so I've committed it.

Thanks for the fix, works for me too.

2019-12-09 17:50:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

[ Added DJ to the participants, since he seems to be the Fedora make
maintainer - DJ, any chance that this absolutely horrid 'make' buf can
be fixed in older versions too, not just rawhide? The bugfix is two
and a half years old by now, and the bug looks real and very serious ]

On Mon, Dec 9, 2019 at 1:54 AM Vincent Guittot
<[email protected]> wrote:
>
> Which version of make should I use to reproduce the problem ?

So the problematic one is "make-4.2.1-13.fc30.x86_64" in Fedora 30.
I'm assuming it's fairly plain 4.2.1, but I didn't try to look into
the source rpm or anything like that.

The working one for me was just the top of -git from

https://git.savannah.gnu.org/git/make.git

which is 4.2.92 right now.

The fix is presumably commit b552b05 ("[SV 51159] Use a non-blocking
read with pselect to avoid hangs") as per Akemi. That is indeed after
4.2.1, and it looks real.

Before that commit the buggy jobserver code basically does

(1) use pselect() to wait for readable and see child deaths atomically
(2) use blocking read to get the token

and while (1) is atomic, if the child death happens between the two,
it goes into the blocking read and has SIGCHLD blocked, so it will try
to read the token from the token pipe, but it will never react to the
child death - and the child death is what is going to _release_ a
token.

So what seems to happen is that when the right timing triggers, you
end up with a lot of sub-makes waiting for a token, but they are also
all supposed to _release_ a token. So you don't have enough tokens to
go around. In the worst case, _everybody_ who has a token is also not
releasing it, and then you end up triggering the timeout code (after
one second), which will make things really go into a crawl.

And by a crawl I mean that worst-case you really end up with just one
job per second per sub-make. It will take _hours_ to compile the
kernel at that speed, when it normally finishes in 15 minutes on my
machine even when I do a from-scratch allmodconfig build.

It does seem to be a major bug in the jobserver code. In particular
with the trial fair and exclusive wakeup patch that I sent out in the
other thread, it seems to be _reliably_ much worse and triggers 100%
of the time for me.

It's possible that my trial patch is buggy, but everything else looks
fine, and with a fixed make the trial patch works for me.

I'll include the trial patch here too, I think I cc'd you on the other
thread too, but hey..

Anyway, it looks like the sync wakeup thing is more of a "get timing
right by luck" thing than anything else. Possibly it actually causes
the reverse order of reader wakeups more often (ie the most _recent_
reader is most likely to get woken up synchronously) and that may be
what really ends up masking the jobserver problem, since apparently
doing wakeups in the fair and proper order makes things much worse..

What a horrible pain that pipe rework ended up being. But I think
we're in better shape now than we used to be, it just had very
unfortunate timing issues and several real bugs.

But sadly, there's no way I can push that fair pipe wakeup thing as
long as this horribly buggy version of make is widespread.

Linus


Attachments:
patch.diff (9.39 kB)

2019-12-09 17:58:04

by Akemi Yagi

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Mon, Dec 9, 2019 at 9:49 AM Linus Torvalds
<[email protected]> wrote:
>
> [ Added DJ to the participants, since he seems to be the Fedora make
> maintainer - DJ, any chance that this absolutely horrid 'make' buf can
> be fixed in older versions too, not just rawhide? The bugfix is two
> and a half years old by now, and the bug looks real and very serious ]
(snip)
> But sadly, there's no way I can push that fair pipe wakeup thing as
> long as this horribly buggy version of make is widespread.
>
> Linus

In addition to the Fedora make-4.2.1-4.fc27 (1) mentioned by Linus,
RHEL 8 make-4.2.1-9.el8 (2) is affected. The patch applied to Fedora
make (3) has been confirmed to fix the issue in RHEL's make.

Those are the only real-world examples I know of. I have no idea how
widespread this thing is...

Akemi

(1) https://bugzilla.redhat.com/show_bug.cgi?id=1556839
(2) https://bugzilla.redhat.com/show_bug.cgi?id=1774790
(3) https://git.savannah.gnu.org/cgit/make.git/commit/?id=b552b05251980f693c729e251f93f5225b400714

2019-12-09 18:21:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Mon, Dec 9, 2019 at 9:57 AM Akemi Yagi <[email protected]> wrote:
>
> In addition to the Fedora make-4.2.1-4.fc27 (1) mentioned by Linus,
> RHEL 8 make-4.2.1-9.el8 (2) is affected. The patch applied to Fedora
> make (3) has been confirmed to fix the issue in RHEL's make.
>
> Those are the only real-world examples I know of. I have no idea how
> widespread this thing is...

Looks like opensuse and ubuntu are also on 4.2.1 according to

https://software.opensuse.org/package/make
https://packages.ubuntu.com/cosmic/make

so apparently the bug is almost universal with the big three sharing
this buggy version.

Linus

2019-12-09 18:25:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Mon, Dec 9, 2019 at 10:18 AM Linus Torvalds
<[email protected]> wrote:
>
> Looks like opensuse and ubuntu are also on 4.2.1 according to
>
> https://software.opensuse.org/package/make
> https://packages.ubuntu.com/cosmic/make
>
> so apparently the bug is almost universal with the big three sharing
> this buggy version.

And the reason seems to be is that it's considered the latest "release" version.

In the git tree, I see 4.2.92, but looking at

https://ftp.gnu.org/gnu/make/

it looks like 4.2.1 is the latest actual "release".

Oh well. I can't find a workaround for the bug, other than perhaps
using "make -LX" instead of "make -jX". Which is not the same thing at
all, of course.

Linus

2019-12-10 02:59:06

by DJ Delorie

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Linus Torvalds <[email protected]> writes:
> [ Added DJ to the participants, since he seems to be the Fedora make
> maintainer - DJ, any chance that this absolutely horrid 'make' buf can
> be fixed in older versions too, not just rawhide? The bugfix is two
> and a half years old by now, and the bug looks real and very serious ]

I've got builds ready for F30 and F31 but my local testing shows no
difference in kernel build times with/without. I was hoping someone
would test the rawhide build[*] and say "yup, that fixes it" in case
some other even older patch is also needed.

F29 just went EOL so I can't fix that one.

[*] https://koji.fedoraproject.org/koji/buildinfo?buildID=1420394

2019-12-10 14:39:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Mon, 9 Dec 2019 at 18:48, Linus Torvalds
<[email protected]> wrote:
>
> [ Added DJ to the participants, since he seems to be the Fedora make
> maintainer - DJ, any chance that this absolutely horrid 'make' buf can
> be fixed in older versions too, not just rawhide? The bugfix is two
> and a half years old by now, and the bug looks real and very serious ]
>
> On Mon, Dec 9, 2019 at 1:54 AM Vincent Guittot
> <[email protected]> wrote:
> >
> > Which version of make should I use to reproduce the problem ?
>
> So the problematic one is "make-4.2.1-13.fc30.x86_64" in Fedora 30.
> I'm assuming it's fairly plain 4.2.1, but I didn't try to look into
> the source rpm or anything like that.

I'm using Debian buster and the make package is version: 4.2.1-1.2 for
arm64. It doesn't have the commit you mentioned below but I don't see
the problem on my platform and all 8 cpus are used with -j 16 or even
-j 9

>
> The working one for me was just the top of -git from
>
> https://git.savannah.gnu.org/git/make.git
>
> which is 4.2.92 right now.
>
> The fix is presumably commit b552b05 ("[SV 51159] Use a non-blocking
> read with pselect to avoid hangs") as per Akemi. That is indeed after
> 4.2.1, and it looks real.
>
> Before that commit the buggy jobserver code basically does
>
> (1) use pselect() to wait for readable and see child deaths atomically
> (2) use blocking read to get the token
>
> and while (1) is atomic, if the child death happens between the two,
> it goes into the blocking read and has SIGCHLD blocked, so it will try
> to read the token from the token pipe, but it will never react to the
> child death - and the child death is what is going to _release_ a
> token.
>
> So what seems to happen is that when the right timing triggers, you

That can explain why I can't see the problem on my platform

> end up with a lot of sub-makes waiting for a token, but they are also
> all supposed to _release_ a token. So you don't have enough tokens to
> go around. In the worst case, _everybody_ who has a token is also not
> releasing it, and then you end up triggering the timeout code (after
> one second), which will make things really go into a crawl.
>
> And by a crawl I mean that worst-case you really end up with just one
> job per second per sub-make. It will take _hours_ to compile the
> kernel at that speed, when it normally finishes in 15 minutes on my
> machine even when I do a from-scratch allmodconfig build.
>
> It does seem to be a major bug in the jobserver code. In particular
> with the trial fair and exclusive wakeup patch that I sent out in the
> other thread, it seems to be _reliably_ much worse and triggers 100%
> of the time for me.
>
> It's possible that my trial patch is buggy, but everything else looks
> fine, and with a fixed make the trial patch works for me.
>
> I'll include the trial patch here too, I think I cc'd you on the other
> thread too, but hey..
>
> Anyway, it looks like the sync wakeup thing is more of a "get timing
> right by luck" thing than anything else. Possibly it actually causes
> the reverse order of reader wakeups more often (ie the most _recent_
> reader is most likely to get woken up synchronously) and that may be
> what really ends up masking the jobserver problem, since apparently
> doing wakeups in the fair and proper order makes things much worse..
>
> What a horrible pain that pipe rework ended up being. But I think
> we're in better shape now than we used to be, it just had very
> unfortunate timing issues and several real bugs.
>
> But sadly, there's no way I can push that fair pipe wakeup thing as
> long as this horribly buggy version of make is widespread.
>
> Linus

2019-12-10 17:41:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Tue, Dec 10, 2019 at 6:38 AM Vincent Guittot
<[email protected]> wrote:
>
> On Mon, 9 Dec 2019 at 18:48, Linus Torvalds
> <[email protected]> wrote:
> >
> > Before that commit the buggy jobserver code basically does
> >
> > (1) use pselect() to wait for readable and see child deaths atomically
> > (2) use blocking read to get the token
> >
> > and while (1) is atomic, if the child death happens between the two,
> > it goes into the blocking read and has SIGCHLD blocked, so it will try
> > to read the token from the token pipe, but it will never react to the
> > child death - and the child death is what is going to _release_ a
> > token.
> >
> > So what seems to happen is that when the right timing triggers, you
>
> That can explain why I can't see the problem on my platform

Note that the above is kind of simplified.

It actually needs a bit more to trigger..

To lose _one_ token, you need to have a sub-make basically hit this race:

- the pselect() needs to say that the pipe is readable, so there is
at least one token

- another sub-make comes along and steals the very last token

- but because pselect returned "readable" (no SIGCHLD yet), the
read() starts and now blocks because all the jobserver tokens are gone
again due to the other sub-make stealing the last one.

- before a new token comes in, the child exits, and now because the
sub-make is blocking for reads (and because the jobserver blocks
SIGCHILD in general outside of the pselect), it doesn't react, so it
won't release the token that the child holds.

but notice how any _other_ sub-make then releasing a token will get
things going again, so the _common_situation is that the jobserver bug
only causes a slight dip in concurrency.

Hitting it _once_ is trivial.

Losing several tokens at once is also not that hard: you don't need to
hit the race many times, it's enough to hit the exact same race just
once - just with several sub-makes at the same time.

And that "lose several tokens at once" isn't as unlikely as you'd
think: they are all doing the same thing, and they all saw the free
token with "pselect()", they all did a "read()". And since it's common
for the tokens to be gone, the common case is that _one_ of the
waiting sub-makes got the read, and the N other sub-makes did not, and
went into the blocking read(). And they all had children that were
about to finish, and finished before the next token got available.

So losing quite a few tokens is easy.

This has actually gone on for a long time, and I just never debugged
it. My solution has been "I have 16 threads (8 core with HT), but I'll
use -j32, and it is all good".

I bet you can see it too - the buggy jobserver just means that the
load isn't as high as you'd expect. Just run 'top' while the make is
going.

With the _fixed_ jobserver, if I do "make -j32", I will actually see a
load that is over 32 (I do nothing but kernel compiles and occasional
reboots during the merge window, so I have the kernel in my cache, so
there's no IO, I have a fast SSD so writeback doesn't cause any delays
either etc etc, and I have my browser and a few other things going).

With the buggy one, even before the pipe rework, I would see a load
that tended to fluctuate around 16.

Because due to the bug you have a few locked-up tokens at times, so
instead of getting a load of 32 when you use "make -j32", you get a
load of maybe 20. Or maybe less.

The pipe re-work made it much easier to trigger the "almost all the
tokens are gone" for some reason. And the "fair readers()" patch I
have seems to make it _really_ easy to trigger the case where
absolutely all the tokens were gone and it goes into a single-thread
mode. I'm not sure I really ever saw the 1-second timeout trigger, but
it was slow.

But it _is_ dependent on timing, so somebody else with a different
load - or a different machine - might not see it to nearly the same
degree.

I bet you see the load value difference, though, even if you don't
necessarily see enough idle CPU time to see much of a difference in
compile times. After all, once you have all CPU's busy, it doesn't
matter if you have a load of 16 or a load of 32 - the higher load
won't make the compile go any faster.

Linus

2019-12-11 18:11:23

by DJ Delorie

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Linus Torvalds <[email protected]> writes:
> [ Added DJ to the participants, since he seems to be the Fedora make
> maintainer - DJ, any chance that this absolutely horrid 'make' buf can
> be fixed in older versions too, not just rawhide? The bugfix is two
> and a half years old by now, and the bug looks real and very serious ]

Builds for F30 and F31 are in bodhi, waiting on testing and karma...
https://bodhi.fedoraproject.org/updates/FEDORA-2019-bd81ed62bf
https://bodhi.fedoraproject.org/updates/FEDORA-2019-a056aa61d4

F29 is already EOL.

2019-12-11 19:00:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 11, 2019 at 10:09 AM DJ Delorie <[email protected]> wrote:
>
> Builds for F30 and F31 are in bodhi, waiting on testing and karma...

Thanks DJ,

Linus

2019-12-11 20:56:37

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Linus Torvalds <[email protected]> wrote:

> DavidH, give these a look:
>
> 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO
> a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic
> f467a6a66419 pipe: fix and clarify pipe read wakeup logic
> 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic
> ad910e36da4c pipe: fix poll/select race introduced by the pipe rework
>
> the top two of which are purely "I'm fed up looking at this code, this
> needs to go" kind of changes.

They look reasonable.

Is it worth reverting:

commit f94df9890e98f2090c6a8d70c795134863b70201
Add wake_up_interruptible_sync_poll_locked()

since you changed the code that was calling that new function and so it's no
longer called?

David

2019-12-12 01:31:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 11, 2019 at 12:55 PM David Howells <[email protected]> wrote:
>
> Is it worth reverting:
>
> commit f94df9890e98f2090c6a8d70c795134863b70201
> Add wake_up_interruptible_sync_poll_locked()
>
> since you changed the code that was calling that new function and so it's no
> longer called?

You are sure you won't want that for the notification queue cases? I
guess they'll never want to "sync" part..

Looking at the regular pipe read/write code, maybe we'll want to try
it again - do the wakeup while we already have the spinlock, rather
than later. But I have this suspicion that that might just then push
things into mutex contention, so who knows..

Regardless, it's not going to happen for 5.5, so I guess we could
revert it and if we ever end up trying it again we can resurrect it.

Linus

2019-12-12 07:35:37

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Linus Torvalds <[email protected]> wrote:

> You are sure you won't want that for the notification queue cases? I
> guess they'll never want to "sync" part..

Actually, that's a good point, I do call it in post_one_notification() since I
have to hold the lock anyway. But, as you say, I'm not sure whether I need
the sync variant.

David

2019-12-12 10:19:18

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On 08/12/2019 01.47, Linus Torvalds wrote:
> On Fri, Dec 6, 2019 at 7:50 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> The "make goes slow" problem bisects down to b667b8673443 ("pipe:
>> Advance tail pointer inside of wait spinlock in pipe_read()").
>
> I'm not entirely sure that ends up being 100% true. It did bisect to
> that, but the behavior wasn't entirely stable. There definitely is
> some nasty timing trigger.
>
> But I did finally figure out what seems to have been going on with at
> least the biggest part of the build performance regression. It's seems
> to be a nasty interaction with the scheduler and the GNU make
> jobserver, and in particular the pipe wakeups really _really_ do seem
> to want to be synchronous both for the readers and the writers.
>
> When a writer wakes up a reader, we want the reader to react quickly
> and vice versa. The most obvious case was for the GNU make jobserver,
> where sub-makes would do a single-byte write to the jobserver pipe,
> and we want to wake up the reader *immediatly*, because the reader is
> actually a lot more important than the writer. The reader is what gets
> the next job going, the writer just got done with the last one.
>
> And when a reader empties a full pipe, it's because the writer is
> generating data, and you want to just get the writer going again asap.
>
> Anyway, I've spent way too much time looking at this and wondering
> about odd performance patterns. It seems to be mostly back up to
> normal.
>
> I say "mostly", because I still see times of "not as many concurrent
> compiles going as I'd expect". It might be a kbuild problem, it might
> be an issue with GNU make (I've seen problems with the make jobserver
> wanting many more tokens than expected before and the kernel makefiles
> - it migth be about deep subdirectories etc), and it might be some
> remaining pipe issue. But my allmodconfig builds aren't _enormously_
> slower than they used to be.
>
> But there's definitely some unhappy interaction with the jobserver. I
> have 16 threads (8 cores with HT), and I generally use "make -j32" to
> keep them busy because the jobserver isn't great. The pipe rework made
> even that 2x slop not work all that well. Something held on to tokens
> too long, and there was definitely some interaction with the pipe
> wakeup code. Using "-j64" hid the problem, but it was a problem.
>
> It might be the new scheduler balancing changes that are interacting
> with the pipe thing. I'm adding PeterZ, Ingo and Vincent to the cc,
> because I hadn't realized just how important the sync wakeup seems to
> be for pipe performance even at a big level.
>
> I've pushed out my pipe changes. I really didn't want to do that kind
> of stuff at the end of the merge window, but I spent a lot more time
> than I wanted looking at this code, because I was getting to the point
> where the alternative was to just revert it all.
>
> DavidH, give these a look:
>
b> 85190d15f4ea pipe: don't use 'pipe_wait() for basic pipe IO
> a28c8b9db8a1 pipe: remove 'waiting_writers' merging logic
> f467a6a66419 pipe: fix and clarify pipe read wakeup logic
> 1b6b26ae7053 pipe: fix and clarify pipe write wakeup logic
> ad910e36da4c pipe: fix poll/select race introduced by the pipe rework

commit f467a6a66419 pipe: fix and clarify pipe read wakeup logic
killed "wake writer when buffer becomes half empty" part added by
commit cefa80ced57a ("pipe: Increase the writer-wakeup threshold to reduce context-switch count").

I suppose that was unintentional. Jobserver juggles with few bytes and
should never reach half/full buffer thresholds.

Also reader should wake writer with sync wakeup only if buffer is empty.
Otherwise sync wakeup adds couple unneeded context switches.

>
> the top two of which are purely "I'm fed up looking at this code, this
> needs to go" kind of changes.
>
> In particular, that last change is because I think the GNU jobserver
> problem is partly a thundering herd issue: when a job token becomes
> free (ie somebody does a one-byte write to an empty jobserver pipe),
> it wakes up *everybody* who is waiting for a token. One of them will
> get it, and the others will go to sleep again. And then it repeats all
> over. I didn't fix it, but it _could_ be fixed with exclusive waits
> for readers/writers, but that means more smarts than pipe_wait() can
> do. And because the jobserver isn't great at keeping everybody happy,
> I'm using a much bigger "make -jX" value than the number of CPU's I
> have, which makes the herd bigger. And I suspect none of this helps
> the scheduler pick the _right_ process to run, which just makes
> scheduling an even bigger problem.
>
> Linus
>

2019-12-18 22:52:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Thu, Dec 12, 2019 at 2:18 AM Konstantin Khlebnikov
<[email protected]> wrote:
>
> commit f467a6a66419 pipe: fix and clarify pipe read wakeup logic
> killed "wake writer when buffer becomes half empty" part added by
> commit cefa80ced57a ("pipe: Increase the writer-wakeup threshold to reduce context-switch count").
>
> I suppose that was unintentional. Jobserver juggles with few bytes and
> should never reach half/full buffer thresholds.

It wasn'tunintentional - the rest of the cleanups should mean that we
never wake things up unnecessarily - we now only wake up if it _used_
to be 100% full, and we only do it once per read.

And it did it without the watermark, which I was worried about might
break something that "knew" the size of the pipe.

But performance testing would be good. Both for the "no unnecessary
wakeups" case, but also for the thundering herd issue.

To answer Josh's email in this same thread:

On Wed, Dec 18, 2019 at 12:59 PM Josh Triplett <[email protected]> wrote:
>
> Debian and Ubuntu have make 4.2.1-1.2, which includes "[SV 51159] Use a
> non-blocking read with pselect to avoid hangs." and various other fixes.
> https://metadata.ftp-master.debian.org/changelogs/main/m/make-dfsg/make-dfsg_4.2.1-1.2_changelog
> So, both Debian and Ubuntu should be fine with the pipe improvements.
> (I'm testing that now.)
>
> Is the version of your non-thundering-herd pipe wakeup patch attached to
> https://lore.kernel.org/lkml/CAHk-=wicgTacrHUJmSBbW9MYAdMPdrXzULPNqQ3G7+HkLeNf1Q@mail.gmail.com/
> still the best version to test performance with?

That's my latest version, but you'll have to tweak it a tiny bit
because of d1c6a2aa02af ("pipe: simplify signal handling in
pipe_read() and add comments") which I did after that patch.

The easiest way to resolve it is likely to revert that d1c6a2aa02af,
then apply the non-thundering-herd patch and then apply d1c6a2aa02af
again by hand - it's fairly straightforward (and you can return
-ERESTARTSYS directly if wait_event_interruptible_exclusive() fails,
because of all the same reasons why it coul dhappen without the
thundering-herd patch.

I can look at re-creating that patch if you find it to be too annoying.

Linus

2019-12-19 00:04:23

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 18, 2019 at 02:51:27PM -0800, Linus Torvalds wrote:
> On Wed, Dec 18, 2019 at 12:59 PM Josh Triplett <[email protected]> wrote:
> > Debian and Ubuntu have make 4.2.1-1.2, which includes "[SV 51159] Use a
> > non-blocking read with pselect to avoid hangs." and various other fixes.
> > https://metadata.ftp-master.debian.org/changelogs/main/m/make-dfsg/make-dfsg_4.2.1-1.2_changelog
> > So, both Debian and Ubuntu should be fine with the pipe improvements.
> > (I'm testing that now.)

I've now tested this, and 4.2.1-1.2 doesn't seem to have the same
problem that affects upstream 4.2.1. With Debian and Ubuntu already
fixed, and Fedora now in the process of getting fixed, that should cover
many users who would have the buggy make.

> > Is the version of your non-thundering-herd pipe wakeup patch attached to
> > https://lore.kernel.org/lkml/CAHk-=wicgTacrHUJmSBbW9MYAdMPdrXzULPNqQ3G7+HkLeNf1Q@mail.gmail.com/
> > still the best version to test performance with?
>
> That's my latest version, but you'll have to tweak it a tiny bit
> because of d1c6a2aa02af ("pipe: simplify signal handling in
> pipe_read() and add comments") which I did after that patch.

That's what I encountered, and I ended up manually fixing it up,
resulting in the attached patch. Does that look reasonable?

> The easiest way to resolve it is likely to revert that d1c6a2aa02af,
> then apply the non-thundering-herd patch and then apply d1c6a2aa02af
> again by hand - it's fairly straightforward (and you can return
> -ERESTARTSYS directly if wait_event_interruptible_exclusive() fails,
> because of all the same reasons why it coul dhappen without the
> thundering-herd patch.

I can try adding that additional return -ERESTARTSYS. Does that seem
likely to have a noticeable additional performance impact?

- Josh Triplett


Attachments:
(No filename) (1.81 kB)
linux-pipe-fix.patch (9.66 kB)
Download all attachments

2019-12-19 00:16:06

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 18, 2019 at 04:03:18PM -0800, Josh Triplett wrote:
> On Wed, Dec 18, 2019 at 02:51:27PM -0800, Linus Torvalds wrote:
> > That's my latest version, but you'll have to tweak it a tiny bit
> > because of d1c6a2aa02af ("pipe: simplify signal handling in
> > pipe_read() and add comments") which I did after that patch.
>
> That's what I encountered, and I ended up manually fixing it up,
> resulting in the attached patch. Does that look reasonable?

Er, wrong file. That's the original patch; the attached patch is the
right one.


Attachments:
(No filename) (551.00 B)
linux-pipe-fix.patch (9.67 kB)
Download all attachments

2019-12-19 00:53:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 18, 2019 at 4:14 PM Josh Triplett <[email protected]> wrote:
>
> Er, wrong file. That's the original patch; the attached patch is the
> right one.

This looks correct to me.

If I were to actually commit it, the "split into two waitqueues" would
be a separate patch from the "use wait_event_interruptible_exclusive()
and add "wake_next_reader/writer logic", but for testing purposes the
unified patch was simpler, and your forward port looks good to me.

I ran the original patch for a couple of days, and didn't see any
other issues than the 'make' thing in F30. It was all good with my
self-build make.

But that "ran for a couple of days" wasn't all that stress-full. I did
do the "verify that the thundering herd is gone" test - including that
silly test-case here again:

#include <unistd.h>

int main(int argc, char **argv)
{
int fd[2], counters[2];

pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));

/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();

do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}

where you can tweak the numbers - add another fork() or two to create
even more pipe waiters, and maybe change the final count exit value to
match whatever hw performance you have.

Linus

2019-12-19 00:57:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 18, 2019 at 4:51 PM Linus Torvalds
<[email protected]> wrote:
>
> This looks correct to me.

Note the "looks". I checked that it applied on current -git, and I
read through the patch, and checked that it almost matched my
original, but I didn't _test_ it in any way. Caveat emptor.

Linus

2019-12-19 07:57:33

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

Linus Torvalds <[email protected]> wrote:

> If I were to actually commit it, the "split into two waitqueues" would
> be a separate patch from the "use wait_event_interruptible_exclusive()
> and add "wake_next_reader/writer logic", but for testing purposes the
> unified patch was simpler, and your forward port looks good to me.

I looked at splitting the waitqueue in to two, but it makes poll tricky.

David

2019-12-19 16:38:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] pipe: Fixes [ver #2]

On Wed, Dec 18, 2019 at 11:56 PM David Howells <[email protected]> wrote:
>
> I looked at splitting the waitqueue in to two, but it makes poll tricky.

No, it's actually trivial for poll.

The thing is, poll can happily just add two wait-queues to the
poll_table. In my conversion, I just did

- poll_wait(filp, &pipe->wait, wait);
+ if (filp->f_mode & FMODE_READ)
+ poll_wait(filp, &pipe->rd_wait, wait);
+ if (filp->f_mode & FMODE_WRITE)
+ poll_wait(filp, &pipe->wr_wait, wait);

which changes the unconditional "add one" to two conditional adds.
They _could_ have been unconditional too, but why add unnecessary
wakeups? So it only really does it twice on named pipes (if opened for
reading and writing).

It's _unusual_ to add two wait-queues for a single poll, but it's not
wrong. The tty layer has always done it - exactly because it has
separate wait queues for reading and writing. Some other drivers do
too. Sometimes there's a separate wait queue for errors, sometimes
there are multiple wait-queues because there are events from the
"subsystem" and there are other events from the "device". I think
sound does the latter, for example.

And no, I don't particularly like the FMODE_READ/WRITE testing above -
it would be better to pass in the polling mask and ask "are we waiting
for polling for reading or writing?" instead of asking whether the
file descriptor was opened for read or write, but hey, it is what it
is.

Sadly, "poll()" doesn't really get passed the bitmask of what is being
waited on (it's there in the poll_tabvle "_key" field, but I don't
want to have the pipe code look into those kinds of details.

So the named pipe case could be improved, but it's not like anybody
really cares. Nobody uses named pipes any more (and few people ever
did). So I didn't worry about it.

Linus