2016-10-04 04:00:59

by Linus Torvalds

[permalink] [raw]
Subject: BUG_ON() in workingset_node_shadows_dec() triggers

I'm really sorry I applied that last series from Andrew just before
doing the 4.8 release, because they cause problems, and now it is in
4.8 (and that buggy crap is marked for stable too).

In particular, I just got this

kernel BUG at ./include/linux/swap.h:276

and the end result was a dead kernel.

The bug that commit 22f2ac51b6d64 ("mm: workingset: fix crash in
shadow node shrinker caused by replace_page_cache_page()") purports to
have fixed has apparently been there since 3.15, but the fix is
clearly worse than the bug it tried to fix, since that original bug
has never killed my machine!

I should have reacted to the damn added BUG_ON() lines. I suspect I
will have to finally just remove the idiotic BUG_ON() concept once and
for all, because there is NO F*CKING EXCUSE to knowingly kill the
kernel.

Why the hell was that not a *warning*?

Yes, I'm grumpy. This went in very late in the release candidates, and
I had higher expectations of things coming in through Andrew. Adding
random BUG_ON()'s to code that clearly hasn't had sufficient testing
is *not* acceptable, and it's definitely not acceptable to send that
to me after rc8 unless it has gotten a *lot* of testing, which it
clearly must not have had. Adding stable to the cc too to warn about
this.

The full report is

kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis
industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss
nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
RSP: 0018:ffff8faa7f47bab0 EFLAGS: 00010046
RAX: 0000000000000001 RBX: ffff8faadfaf8c18 RCX: ffff8fa8737b5488
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fa8737b4b48
RBP: ffff8faa7f47bae8 R08: 0000000000000012 R09: ffff8fa8737b54b0
R10: 0000000000000040 R11: ffff8fa8737b54b0 R12: ffffea000b1ad580
R13: 0000000000000000 R14: ffff8faa7f47bb48 R15: ffffea000b1ad580
FS: 00007ffba3a61780(0000) GS:ffff8faaf6c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffba31a5430 CR3: 00000002c6d40000 CR4: 00000000003406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48
83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f>
0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100

and I hope somebody can see what is going wrong in there. The reason
the machine *dies* from that thing is that we end up then immediately
having a

BUG: unable to handle kernel paging request at ffffffffb70bdaa8
IP: blk_flush_plug_list+0x8b/0x250
Call Trace:
schedule+0x61/0x80
do_exit+0x8c8/0xae0
rewind_stack_do_exit+0x17/0x20

and then a

Fixing recursive fault but reboot is needed!

and the machine will never recover.

People who add random assert statements that kill machines should damn
well not be let near the VM layer.

Johannes? Please make this your first priority. And in the meantime I
will make that VM_BUG_ON() be a VM_WARN_ON_ONCE().

And dammit, if anybody else feels that they had done "debugging
messages with BUG_ON()", I would suggest you

(a) rethink your approach to programming

(b) send me patches to remove the crap entirely, or make them real
*DEBUGGING* messages, not "kill the whole machine" messages.

I've ranted against people using BUG_ON() for debugging in the past.
Why the f*ck does this still happen? And Andrew - please stop taking
those kinds of patches! Lookie here:

https://lwn.net/Articles/13183/

so excuse me for being upset that people still do this shit almost 15
years later.

Linus


2016-10-04 04:07:32

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Mon, 3 Oct 2016 21:00:55 -0700 Linus Torvalds <[email protected]> wrote:

> In particular, I just got this
>
> kernel BUG at ./include/linux/swap.h:276

Well, it's a VM_BUG_ON and few people run with CONFIG_DEBUG_VM.

But a) something's clearly wrong and b) points taken.

2016-10-04 04:12:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Mon, Oct 3, 2016 at 9:07 PM, Andrew Morton <[email protected]> wrote:
>
> Well, it's a VM_BUG_ON and few people run with CONFIG_DEBUG_VM.

Ehh. If by "few people" you mean "pretty much everybody", you'd be
right, but your choice of wording would be somewhat misleading,
wouldn't you say?

Hint: here's a line from the standard Fedora kernel config:

CONFIG_DEBUG_VM=y

so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
just allows some people to build smaller kernels, but apparently
distro people would rather have debugging than save a few kB of RAM.

The VM debvugging code has VM_WARN_ON() and VM_WARN_ON_ONCE() for
people who want to get a "oops, my assumptions were wrong"

Killing machines because somebody made an assumption that was wrong is not ok.

Killing the machine is ok if we have a situation where there literally
is no other choice.

Linus

2016-10-04 07:03:30

by Raymond Jennings

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Mon, Oct 3, 2016 at 9:12 PM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Oct 3, 2016 at 9:07 PM, Andrew Morton
> <[email protected]> wrote:
>>
>> Well, it's a VM_BUG_ON and few people run with CONFIG_DEBUG_VM.
>
> Ehh. If by "few people" you mean "pretty much everybody", you'd be
> right, but your choice of wording would be somewhat misleading,
> wouldn't you say?
>
> Hint: here's a line from the standard Fedora kernel config:
>
> CONFIG_DEBUG_VM=y
>
> so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
> just allows some people to build smaller kernels, but apparently
> distro people would rather have debugging than save a few kB of RAM.
>
> The VM debvugging code has VM_WARN_ON() and VM_WARN_ON_ONCE() for
> people who want to get a "oops, my assumptions were wrong"
>
> Killing machines because somebody made an assumption that was wrong
> is not ok.
>
> Killing the machine is ok if we have a situation where there literally
> is no other choice.

For the curious:

This would include situations like

1. The kernel is confused and further processing would result in
undefined behavior (like bluesmoke detecting PCC for example)

2. Security hazards where we'd leak stuff if we don't shut down.

?

> Linus

2016-10-04 08:16:34

by Greg KH

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Mon, Oct 03, 2016 at 09:00:55PM -0700, Linus Torvalds wrote:
> I'm really sorry I applied that last series from Andrew just before
> doing the 4.8 release, because they cause problems, and now it is in
> 4.8 (and that buggy crap is marked for stable too).

I'll hold off in putting it into stable until this is fixed in your
tree...

thanks,

greg k-h

2016-10-04 09:32:27

by Johannes Weiner

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

Hi Linus,

On Mon, Oct 03, 2016 at 09:00:55PM -0700, Linus Torvalds wrote:
> Johannes? Please make this your first priority. And in the meantime I
> will make that VM_BUG_ON() be a VM_WARN_ON_ONCE().

I'm sorry about the trouble. I'll look into where the underflow comes
from that triggers this new check, and thanks for fixing it up in the
meantime.

> And dammit, if anybody else feels that they had done "debugging
> messages with BUG_ON()", I would suggest you
>
> (a) rethink your approach to programming
>
> (b) send me patches to remove the crap entirely, or make them real
> *DEBUGGING* messages, not "kill the whole machine" messages.

Yeah, it's what CONFIG_DEBUG_VM effectively is :( There are a few
instances where it adds additional output to dmesg or a stat file, but
the vast majority of uses is VM_BUG_ON(), VM_BUG_ON_PAGE() etc., which
all mean "if anything is unexpected, stop in your tracks and drop me
into a kdump kernel." People do think of it as a development tool
rather than verbose diagnostics in production kernels, so I assume
none of the conditions protected by that config option are curated for
whether the machine could or should continue to limp along after they
trigger, or at least much less so than the bare BUG vs WARN instances.

But I agree that most, if not all of these checks could warn under a
ratelimit and be more useful for bleeding edge situations like the
fedora kernel than the binary thinking of development kernels and
production kernels as separate things with no overlap. And kernel
people can set panic_on_warn if they want quick death and/or kdump.

In the workingset code, if we detect radix tree nodes in a state in
which they shouldn't be on the shadow node LRU, we could simply warn,
abort the reclaim process and leave them off the LRU. Something like
the below patch. I hate what it does to the codeflow, though, I don't
want half of the code to be clutter that handles shouldnt-bes. But it
also feels wrong to warn about a corrupted node and then keep working
on it against who knows what state. I wonder if that's the fundamental
dilemma that keeps us sniffing that assert() glue...

diff --git a/mm/workingset.c b/mm/workingset.c
index 617475f529f4..5f07db171c03 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,23 +418,28 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
* no pages, so we expect to be able to remove them all and
* delete and free the empty node afterwards.
*/
- BUG_ON(!workingset_node_shadows(node));
- BUG_ON(workingset_node_pages(node));
+ if (WARN_ON_ONCE(!workingset_node_shadows(node)))
+ goto out_invalid;
+ if (WARN_ON_ONCE(workingset_node_pages(node)))
+ goto out_invalid;

for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
if (node->slots[i]) {
- BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+ if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
+ goto out_invalid;
node->slots[i] = NULL;
workingset_node_shadows_dec(node);
- BUG_ON(!mapping->nrexceptional);
+ if (WARN_ON_ONCE(!mapping->nrexceptional))
+ goto out_invalid;
mapping->nrexceptional--;
}
}
- BUG_ON(workingset_node_shadows(node));
+ if (WARN_ON_ONCE(workingset_node_shadows(node)))
+ goto out_invalid;
inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
- if (!__radix_tree_delete_node(&mapping->page_tree, node))
- BUG();
+ __radix_tree_delete_node(&mapping->page_tree, node);

+out_invalid:
spin_unlock(&mapping->tree_lock);
ret = LRU_REMOVED_RETRY;
out:

2016-10-04 16:03:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 4, 2016 at 12:03 AM, Raymond Jennings <[email protected]> wrote:
> On Mon, Oct 3, 2016 at 9:12 PM, Linus Torvalds <[email protected]> wrote:
>>
>>
>> Killing the machine is ok if we have a situation where there literally
>> is no other choice.
>
> For the curious:
>
> This would include situations like
>
> 1. The kernel is confused and further processing would result in undefined
> behavior (like bluesmoke detecting PCC for example)

Yes. Mainly situations where you cannot even have sane error handling,
so you can't just do "warn and return without doing anything". It
might be some "I noticed that the CPU stack is corrupt, I can't even
return, I will just have to terminate".

> 2. Security hazards where we'd leak stuff if we don't shut down.

Honestly, I can't think of a situation where that has actually happened.

Now, sometimes BUG_ON() is less onerous than other time: if you know
that you are in a regular process context where you're not holding
core locks, BUG_ON() is actually fairly benign: it will print a big
scary message, and then it will kill the current process. It's not
going to kill the machine, unless the admin has explicitly asked for
"reboot if you have issues", which is mainly a situation for the
googles of the world - if you have millions of machines and you don't
actually *care*, then rebooting is fine.

So realistically, the main places you should use BUG_ON() variants is

(a) development code where it replaces error handling that you just
haven't written yet, and you haven't really thought through all the
possibilities, so you're saying "this can't happen, I'll fix it
later".

It sounds like Andrew thought that that is what VM_BUG_ON() is, and
that it wouldn't be enabled unless you're a developer. But no, this is
a "RFC patch" kind of situation.

This kind of BUG_ON() often ends up escaping into the wild, but it
should be after *huge* amounts of testing, and by definition it should
never have been accepted during anythign but the merge window. So in a
very real sense it's really my bad for not reacting to the BUG_ON()
being added during rc8.

(b) very core code that actually verifies some very core assumptions
that are *so* important that if they are broken the code is by
definition not really able to function.

This is the actual intended case. It's not a "let's check that
everybody did things right", it's a "this is a major design rule in
this core code".

The example in workingset_node_shadows_dec() _could_ actually have
been that kind of (b) situation, except for the timing and lack of
deep testing. But a reasonable example of (b) would be something like
the

BUG_ON(!PageLocked(page));

kind of code in fs/buffer.c - it's core infrastructure that has been
tested with core code, and the BUG_ON() is meant to catch bad _new_
users quickly. And it's *such* a core requirement that error handling
doesn't even make sense.

Again, workingset_node_shadows_dec() could have a BUG_ON() in theory.
But the BUG_ON() is _wrong_ when we had a situation of "oh, we just
recently noticed a bug in this area, so lets' just verify that it's
really gone".

Notice? Just the timing and intent can make the difference between
"good BUG_ON() in solid code that has been around forever" and "bad
BUG_ON() checking something that we know we might be getting wrong".

Linus

2016-10-05 01:21:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 4, 2016 at 2:32 AM, Johannes Weiner <[email protected]> wrote:
>
> In the workingset code, if we detect radix tree nodes in a state in
> which they shouldn't be on the shadow node LRU, we could simply warn,
> abort the reclaim process and leave them off the LRU. Something like
> the below patch.

I don't hate that patch, but I wonder why the counts get corrupted and
the workingset_node_shadows_dec() thing triggered in the first place.

So I do think that the BUG_ON()'s there in shadow_lru_isolate() should
be removed, but since they haven't triggered I worry more abut the one
that has.

I've tried to follow the counting, and I don't see any obvious bugs in
the counting per se. I went as far as look where we even initialize
node->count.

Btw, whoever wrote that code liked the whole SLAB desctructor model a
lot too much. Initializing the fields as you free something is rather
silly from a cache use standpoint. You're just touching cachelines
that are almost guaranteed to be wasted. Why isn't that init code just
done at allocation time instead of in that radix_tree_node_rcu_free()
destructor?

But I couldn't see anything actively *buggy*, even if I think the code
is oddly structured.

So to debug that, I'd actually like to see something that adds a few
more warnings to try to catch *where* the count goes bad

For example, is it actually valid to free a radix_tree_node that has a
non-zero count? Shouldn't all the shadow entries have been removed?
The problem with the BUG_ON() at workingset_node_shadows_dec() time
isn't just that it killed the machine, it also doesn't actually give
very much information. The count has presumably been mis-done long
before..

Linus

2016-10-05 02:44:18

by Paul Gortmaker

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 4, 2016 at 12:00 AM, Linus Torvalds
<[email protected]> wrote:

[...]

> I should have reacted to the damn added BUG_ON() lines. I suspect I
> will have to finally just remove the idiotic BUG_ON() concept once and
> for all, because there is NO F*CKING EXCUSE to knowingly kill the
> kernel.

A couple years ago Ingo had an idea to kill BUG_ON abuse that I made
a 1st pass at. Back then it seemed nobody cared. Maybe that has since
changed?

https://lkml.org/lkml/2014/4/30/359

P.
--

2016-10-05 03:29:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 4, 2016 at 7:43 PM, Paul Gortmaker
<[email protected]> wrote:
>
> A couple years ago Ingo had an idea to kill BUG_ON abuse that I made
> a 1st pass at. Back then it seemed nobody cared. Maybe that has since
> changed?
>
> https://lkml.org/lkml/2014/4/30/359

So we actually have the checkpatch warning already:

# avoid BUG() or BUG_ON()
if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
my $msg_type = \&WARN;
$msg_type = \&CHK if ($file);
&{$msg_type}("AVOID_BUG",
"Avoid crashing the kernel - try
using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" .
$herecurr);
}

but it doesn't trigger on VM_BUG_ON().

And I'm not convinced about replacing things with BUG_ON_AND_HALT(),
it simply doesn't fix the existing issue we have: people use BUG_ON(),
and worse, _when_ they use BUG_ON(), they use it instead of error
handling, so the code _around_ the BUG_ON() tends to then very much
depend on what the BUG_ON() checks.

This is actually one way that VM_BUG_ON() is better: it's very much by
design something that can be compiled away, so at least hopefully
nobody thinks of it as a security measure. So we could just say that
we will treat VM_BUG_ON() as a WARN_ON_ONCE(), and just not kill the
machine.

Because I could easily imagine that somebody *does* treat BUG_ON()
that way, thinking "well, if that BUG_ON() triggers at least it won't
then go off the rails later".

In fact, right now we mark BUG() in such a way that gcc can even take
advantage of the "crash the machine" semantics, because we explicitly
mark the BUG() with "unreachable()".

So what I think we should think about is:

- extending the checkpatch warning to VM_BUG_ON too, to discourage new users.

- look at making BUG_ON() simply be less lethal. Remove the
unrechable(), reorganize how the string is stored, and make it act
more like WARN_ON_ONCE() instead (it's the "rewind_stack_do_exit()"
that ends up causing us to try to kill things, we *could* just try to
stop doing that).

- Instead of adding a BUG_ON_AND_HALT(), we could perhaps add a new
FATAL_ERROR() thing that acts like the current BUG_ON, and *not* call
it something similar (we don't want people doing mindless
conversions!). And that's the one that would do the whole
rewind_stack_do_exit() to kill the process.

But apart from the checkpatch thing, it's actually a pretty big change.

Linus

2016-10-05 05:44:32

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 04, 2016 at 08:29:00PM -0700, Linus Torvalds wrote:
> So what I think we should think about is:
>
> - extending the checkpatch warning to VM_BUG_ON too, to discourage new users.
>
> - look at making BUG_ON() simply be less lethal. Remove the
> unrechable(), reorganize how the string is stored, and make it act
> more like WARN_ON_ONCE() instead (it's the "rewind_stack_do_exit()"
> that ends up causing us to try to kill things, we *could* just try to
> stop doing that).
>
> - Instead of adding a BUG_ON_AND_HALT(), we could perhaps add a new
> FATAL_ERROR() thing that acts like the current BUG_ON, and *not* call
> it something similar (we don't want people doing mindless
> conversions!). And that's the one that would do the whole
> rewind_stack_do_exit() to kill the process.

I think instead we should completely remove any simple way to halt the
system and document how to do it. I've already seen some userland code
stuffed with thousands of assert() everywhere and their developers are
proud of this because their code looks clean and they show that they
care for all errors. But the cost of their stupidity doesn't seem to
affect them. Maybe they'll start to think about it the day they're
brought into a self-driven car and will realize that it'd better recover
from a failing flasher and not just crash in the middle of the highway.

Thus since their motives are just to easily write nice-looking code, I'd
simply force them to explicitly write their condition and the associated
printk() and panic() calls. It will become much more of a hassle and will
make their code less elegant, they should be much less tempted.

So I think that we'd rather run a huge sed all over the code to replace
BUG/BUG_ON with their WARN/WARN_ON equivalent. We'll very likely notice
a lot of new gcc warnings from code that was supposed not to every be
reachable, which will tell us a lot about some limited error checking
in these respective code parts.

Willy

2016-10-05 09:25:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 04, 2016 at 06:21:37PM -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2016 at 2:32 AM, Johannes Weiner <[email protected]> wrote:
> >
> > In the workingset code, if we detect radix tree nodes in a state in
> > which they shouldn't be on the shadow node LRU, we could simply warn,
> > abort the reclaim process and leave them off the LRU. Something like
> > the below patch.
>
> I don't hate that patch, but I wonder why the counts get corrupted and
> the workingset_node_shadows_dec() thing triggered in the first place.

Okay, I kept digging, and I think I found the culprit: yes, the actual
bug is ancient (3.15), but also requires a fairly specific combination
of access pattern and memory pressure to trigger, which is why testing
didn't reveal it right away. The symptom before the sanity check was
an occasionally leaked radix tree node, which is mostly harmless and
hard to notice [ really drives home the BUG_ON point, doesn't it... ]

The problem is in the way we account shadow entries inside the upper
bits of node->count behind the back of the radix tree code. We don't
use the insert and delete functions directly from the page cache and
then do our own node->count updates. And other than that the radix
tree code mostly accounts only its own childpointers in node->count
and doesn't make assumptions about how user entries are accounted.
The notable exception is during tree extension: when there is a single
entry at index 0, it's stored as a direct pointer from root->rnode. A
fault at a higher index will copy that direct entry to a proper node
and do node->count = 1. That's wrong if the direct entry is a shadow.

So the scenario required to trigger this issue is this: you need a
file with only the first page faulted in and then evicted before
touching another page in the file. When that first page faults back,
that's when the counter underflows.

Here is a reproducer that triggers the warning instantly for me:

---

/*
* When index 0 is present, then evicted, rnode is a direct pointer to
* a shadow entry. Extending the tree from there does the wrong thing
* by setting node->count = 1; that's the page counter bits, it should
* be setting the upper shadow bits. When index 0 faults back and we
* try to remove the shadow entry, the shadow counter is 0.
*/
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>

/* Set to slightly bigger than memory */
#define PRESSURE_SIZE (12UL << 30)

int main(void)
{
unsigned long off;
char buf[4096];
int fd1, fd2;

fd1 = open("test-twopages", O_CREAT|O_RDWR);
unlink("test-twopages");
ftruncate(fd1, 4096);

/* fault page at index 0 */
pread(fd1, buf, 4096, 0);

/* create pressure to plant shadow at fd1's index 0 (knock on wood) */
fd2 = open("test-pressure", O_CREAT|O_RDWR);
unlink("test-pressure");
ftruncate(fd2, PRESSURE_SIZE);
for (off = 0; off < PRESSURE_SIZE; off += 4096)
read(fd2, buf, 4096);
close(fd2);

/* fault page at index 1, extending tree sets wrong node->count */
ftruncate(fd1, 8192);
pread(fd1, buf, 4096, 4096);

/* refault index 0, deletes shadow with no shadows in node->count */
pread(fd1, buf, 4096, 0);

close(fd1);
return 0;
}

---

That radix tree node management needs some cleaning up. It probably
makes sense to split node->count into actually separate members for
clarity, and then add a root tag to distinguish shadows from regular
entries in root->rnode. I have to think about this more, the current
situation is too fragile and ugly.

But in the meantime, there is an obvious fix: don't ever store shadow
entries in root->rnode, seeing as we need nodes for proper accounting.

It means we temporarily lose the ability to detect refaults from
single-page files, but it's probably better to keep the stable fix
small and restore that functionality in a new release.

Patch below. NOTE: I'm traveling without access to my test rig right
now and so I have only lightly tested this on my laptop. I'm also
jetlagged like crazy, so please triple check my thinking. The patch
does fix the reproducer case and has otherwise been stable here.

---

>From 5e948543a814183f04c30b072707300a94deb6c7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Tue, 4 Oct 2016 22:02:08 +0200
Subject: [PATCH] mm: filemap: don't plant shadow entries without radix
tree node

When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis
industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss
nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
RSP: 0018:ffff8faa7f47bab0 EFLAGS: 00010046
RAX: 0000000000000001 RBX: ffff8faadfaf8c18 RCX: ffff8fa8737b5488
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fa8737b4b48
RBP: ffff8faa7f47bae8 R08: 0000000000000012 R09: ffff8fa8737b54b0
R10: 0000000000000040 R11: ffff8fa8737b54b0 R12: ffffea000b1ad580
R13: 0000000000000000 R14: ffff8faa7f47bb48 R15: ffffea000b1ad580
FS: 00007ffba3a61780(0000) GS:ffff8faaf6c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffba31a5430 CR3: 00000002c6d40000 CR4: 00000000003406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48
83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f>
0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/radix-tree.h | 6 +++---
lib/radix-tree.c | 14 +++-----------
mm/filemap.c | 46 ++++++++++++++++++++++++++++++----------------
3 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4c45105dece3..52b97db93830 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -280,9 +280,9 @@ bool __radix_tree_delete_node(struct radix_tree_root *root,
struct radix_tree_node *node);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
-struct radix_tree_node *radix_tree_replace_clear_tags(
- struct radix_tree_root *root,
- unsigned long index, void *entry);
+void radix_tree_clear_tags(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ void **slot);
unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
void **results, unsigned long first_index,
unsigned int max_items);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 91f0727e3cad..8e6d552c40dd 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1583,15 +1583,10 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
}
EXPORT_SYMBOL(radix_tree_delete);

-struct radix_tree_node *radix_tree_replace_clear_tags(
- struct radix_tree_root *root,
- unsigned long index, void *entry)
+void radix_tree_clear_tags(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ void **slot)
{
- struct radix_tree_node *node;
- void **slot;
-
- __radix_tree_lookup(root, index, &node, &slot);
-
if (node) {
unsigned int tag, offset = get_slot_offset(node, slot);
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
@@ -1600,9 +1595,6 @@ struct radix_tree_node *radix_tree_replace_clear_tags(
/* Clear root node tags */
root->gfp_mask &= __GFP_BITS_MASK;
}
-
- radix_tree_replace_slot(slot, entry);
- return node;
}

/**
diff --git a/mm/filemap.c b/mm/filemap.c
index c17395825650..5bcadb6471bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -169,33 +169,35 @@ static int page_cache_tree_insert(struct address_space *mapping,
static void page_cache_tree_delete(struct address_space *mapping,
struct page *page, void *shadow)
{
- struct radix_tree_node *node;
int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(nr != 1 && shadow, page);

- if (shadow) {
- mapping->nrexceptional += nr;
- /*
- * Make sure the nrexceptional update is committed before
- * the nrpages update so that final truncate racing
- * with reclaim does not see both counters 0 at the
- * same time and miss a shadow entry.
- */
- smp_wmb();
- }
- mapping->nrpages -= nr;
-
for (i = 0; i < nr; i++) {
- node = radix_tree_replace_clear_tags(&mapping->page_tree,
- page->index + i, shadow);
+ struct radix_tree_node *node;
+ void **slot;
+
+ __radix_tree_lookup(&mapping->page_tree, page->index + i,
+ &node, &slot);
+
+ radix_tree_clear_tags(&mapping->page_tree, node, slot);
+
if (!node) {
VM_BUG_ON_PAGE(nr != 1, page);
- return;
+ /*
+ * We need a node to properly account shadow
+ * entries. Don't plant any without. XXX
+ */
+ shadow = NULL;
}

+ radix_tree_replace_slot(slot, shadow);
+
+ if (!node)
+ break;
+
workingset_node_pages_dec(node);
if (shadow)
workingset_node_shadows_inc(node);
@@ -219,6 +221,18 @@ static void page_cache_tree_delete(struct address_space *mapping,
&node->private_list);
}
}
+
+ if (shadow) {
+ mapping->nrexceptional += nr;
+ /*
+ * Make sure the nrexceptional update is committed before
+ * the nrpages update so that final truncate racing
+ * with reclaim does not see both counters 0 at the
+ * same time and miss a shadow entry.
+ */
+ smp_wmb();
+ }
+ mapping->nrpages -= nr;
}

/*
--
2.10.0

2016-10-05 09:31:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 05, 2016 at 11:25:34AM +0200, Johannes Weiner wrote:
> Patch below. NOTE: I'm traveling without access to my test rig right
> now and so I have only lightly tested this on my laptop. I'm also
> jetlagged like crazy, so please triple check my thinking. The patch
> does fix the reproducer case and has otherwise been stable here.

There is an issue I spotted in the more recent fuse fix. Same caveat
applies for now regarding testing and brain function, but it should be
a relatively obvious one.

---

>From eaf606185719acaf66412413b7dbac5c683efe11 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Tue, 4 Oct 2016 16:58:06 +0200
Subject: [PATCH] mm: filemap: fix mapping->nrpages double accounting in
fuse

22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker
caused by replace_page_cache_page()") switched replace_page_cache()
from raw radix tree operations to page_cache_tree_insert() but didn't
take into account that the latter function, unlike the raw radix tree
op, handles mapping->nrpages. As a result, that counter is bumped for
each page replacement rather than balanced out even.

The mapping->nrpages counter is used to skip needless radix tree walks
when invalidating, truncating, syncing inodes without pages, as well
as statistics for userspace. Since the error is positive, we'll do
more page cache tree walks than necessary; we won't miss a necessary
one. And we'll report more buffer pages to userspace than there
are. The error is limited to fuse inodes.

Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/filemap.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d0986a64f1f..c17395825650 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -619,7 +619,6 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
__delete_from_page_cache(old, NULL);
error = page_cache_tree_insert(mapping, new, NULL);
BUG_ON(error);
- mapping->nrpages++;

/*
* hugetlb pages do not participate in page cache accounting.
--
2.10.0

2016-10-05 10:40:34

by Jan Kara

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed 05-10-16 11:25:34, Johannes Weiner wrote:
> From 5e948543a814183f04c30b072707300a94deb6c7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Tue, 4 Oct 2016 22:02:08 +0200
> Subject: [PATCH] mm: filemap: don't plant shadow entries without radix
> tree node
>
> When the underflow checks were added to workingset_node_shadow_dec(),
> they triggered immediately:

FWIW the patch looks good to me. So you can add:

Reviewed-by: Jan Kara <[email protected]>

And I agree that the games workingset code plays with node->count are
really fragile so it would be better to change that.
Honza

>
> kernel BUG at ./include/linux/swap.h:276!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE
> nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns
> nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
> soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis
> industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc dm_crypt
> CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
> Hardware name: System manufacturer System Product Name/Z170-K, BIOS
> 1803 05/06/2016
> task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
> RIP: page_cache_tree_insert+0xf1/0x100
> RSP: 0018:ffff8faa7f47bab0 EFLAGS: 00010046
> RAX: 0000000000000001 RBX: ffff8faadfaf8c18 RCX: ffff8fa8737b5488
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fa8737b4b48
> RBP: ffff8faa7f47bae8 R08: 0000000000000012 R09: ffff8fa8737b54b0
> R10: 0000000000000040 R11: ffff8fa8737b54b0 R12: ffffea000b1ad580
> R13: 0000000000000000 R14: ffff8faa7f47bb48 R15: ffffea000b1ad580
> FS: 00007ffba3a61780(0000) GS:ffff8faaf6c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffba31a5430 CR3: 00000002c6d40000 CR4: 00000000003406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> __add_to_page_cache_locked+0x12e/0x270
> add_to_page_cache_lru+0x4e/0xe0
> mpage_readpages+0x112/0x1d0
> blkdev_readpages+0x1d/0x20
> __do_page_cache_readahead+0x1ad/0x290
> force_page_cache_readahead+0xaa/0x100
> page_cache_sync_readahead+0x3f/0x50
> generic_file_read_iter+0x5af/0x740
> blkdev_read_iter+0x35/0x40
> __vfs_read+0xe1/0x130
> vfs_read+0x96/0x130
> SyS_read+0x55/0xc0
> entry_SYSCALL_64_fastpath+0x13/0x8f
> Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48
> 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f>
> 0b e8 88 68 ef ff 0f 1f 84 00
> RIP page_cache_tree_insert+0xf1/0x100
>
> This is a long-standing bug in the way shadow entries are accounted in
> the radix tree nodes. The shrinker needs to know when radix tree nodes
> contain only shadow entries, no pages, so node->count is split in half
> to count shadows in the upper bits and pages in the lower bits.
>
> Unfortunately, the radix tree implementation doesn't know of this and
> assumes all entries are in node->count. When there is a shadow entry
> directly in root->rnode and the tree is later extended, the radix tree
> implementation will copy that entry into the new node and and bump its
> node->count, i.e. increases the page count bits. Once the shadow gets
> removed and we subtract from the upper counter, node->count underflows
> and triggers the warning. Afterwards, without node->count reaching 0
> again, the radix tree node is leaked.
>
> Limit shadow entries to when we have actual radix tree nodes and can
> count them properly. That means we lose the ability to detect refaults
> from files that had only the first page faulted in at eviction time.
>
> Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/radix-tree.h | 6 +++---
> lib/radix-tree.c | 14 +++-----------
> mm/filemap.c | 46 ++++++++++++++++++++++++++++++----------------
> 3 files changed, 36 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4c45105dece3..52b97db93830 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -280,9 +280,9 @@ bool __radix_tree_delete_node(struct radix_tree_root *root,
> struct radix_tree_node *node);
> void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> void *radix_tree_delete(struct radix_tree_root *, unsigned long);
> -struct radix_tree_node *radix_tree_replace_clear_tags(
> - struct radix_tree_root *root,
> - unsigned long index, void *entry);
> +void radix_tree_clear_tags(struct radix_tree_root *root,
> + struct radix_tree_node *node,
> + void **slot);
> unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
> void **results, unsigned long first_index,
> unsigned int max_items);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 91f0727e3cad..8e6d552c40dd 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1583,15 +1583,10 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
> }
> EXPORT_SYMBOL(radix_tree_delete);
>
> -struct radix_tree_node *radix_tree_replace_clear_tags(
> - struct radix_tree_root *root,
> - unsigned long index, void *entry)
> +void radix_tree_clear_tags(struct radix_tree_root *root,
> + struct radix_tree_node *node,
> + void **slot)
> {
> - struct radix_tree_node *node;
> - void **slot;
> -
> - __radix_tree_lookup(root, index, &node, &slot);
> -
> if (node) {
> unsigned int tag, offset = get_slot_offset(node, slot);
> for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
> @@ -1600,9 +1595,6 @@ struct radix_tree_node *radix_tree_replace_clear_tags(
> /* Clear root node tags */
> root->gfp_mask &= __GFP_BITS_MASK;
> }
> -
> - radix_tree_replace_slot(slot, entry);
> - return node;
> }
>
> /**
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c17395825650..5bcadb6471bf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -169,33 +169,35 @@ static int page_cache_tree_insert(struct address_space *mapping,
> static void page_cache_tree_delete(struct address_space *mapping,
> struct page *page, void *shadow)
> {
> - struct radix_tree_node *node;
> int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
>
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(PageTail(page), page);
> VM_BUG_ON_PAGE(nr != 1 && shadow, page);
>
> - if (shadow) {
> - mapping->nrexceptional += nr;
> - /*
> - * Make sure the nrexceptional update is committed before
> - * the nrpages update so that final truncate racing
> - * with reclaim does not see both counters 0 at the
> - * same time and miss a shadow entry.
> - */
> - smp_wmb();
> - }
> - mapping->nrpages -= nr;
> -
> for (i = 0; i < nr; i++) {
> - node = radix_tree_replace_clear_tags(&mapping->page_tree,
> - page->index + i, shadow);
> + struct radix_tree_node *node;
> + void **slot;
> +
> + __radix_tree_lookup(&mapping->page_tree, page->index + i,
> + &node, &slot);
> +
> + radix_tree_clear_tags(&mapping->page_tree, node, slot);
> +
> if (!node) {
> VM_BUG_ON_PAGE(nr != 1, page);
> - return;
> + /*
> + * We need a node to properly account shadow
> + * entries. Don't plant any without. XXX
> + */
> + shadow = NULL;
> }
>
> + radix_tree_replace_slot(slot, shadow);
> +
> + if (!node)
> + break;
> +
> workingset_node_pages_dec(node);
> if (shadow)
> workingset_node_shadows_inc(node);
> @@ -219,6 +221,18 @@ static void page_cache_tree_delete(struct address_space *mapping,
> &node->private_list);
> }
> }
> +
> + if (shadow) {
> + mapping->nrexceptional += nr;
> + /*
> + * Make sure the nrexceptional update is committed before
> + * the nrpages update so that final truncate racing
> + * with reclaim does not see both counters 0 at the
> + * same time and miss a shadow entry.
> + */
> + smp_wmb();
> + }
> + mapping->nrpages -= nr;
> }
>
> /*
> --
> 2.10.0
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-10-05 15:53:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 4, 2016 at 10:44 PM, Willy Tarreau <[email protected]> wrote:
>
> I think instead we should completely remove any simple way to halt the
> system and document how to do it.

Having slept on it, I suspect you're right. I worry about some
BUG_ON() that really relies on the killing behavior, but if it takes a
"real" fault later, that is when it gets killed. And on the whole,
we've had lots of problems with the killing behavior over the years,
so we should just try switching BUG_ON() over to non-fatal. It's
unlikely to be worse than what we have now, as exemplified by this
event.

Linus

2016-10-05 16:11:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 2:25 AM, Johannes Weiner <[email protected]> wrote:
>
> Here is a reproducer that triggers the warning instantly for me:

Yup, confirmed.With the VM_WARN_ON_ONCE() it just gets a big nice
splat and the machine happily stays up.

> That radix tree node management needs some cleaning up. It probably
> makes sense to split node->count into actually separate members for
> clarity, and then add a root tag to distinguish shadows from regular
> entries in root->rnode. I have to think about this more, the current
> situation is too fragile and ugly.

Ugh. I even looked at the "node->count = 1" initialization in
radix_tree_extend(), and didn't react to it at all, it looked
obviously correct.

This code is too subtle.

> But in the meantime, there is an obvious fix: don't ever store shadow
> entries in root->rnode, seeing as we need nodes for proper accounting.
>
> It means we temporarily lose the ability to detect refaults from
> single-page files, but it's probably better to keep the stable fix
> small and restore that functionality in a new release.
>
> Patch below. NOTE: I'm traveling without access to my test rig right
> now and so I have only lightly tested this on my laptop. I'm also
> jetlagged like crazy, so please triple check my thinking. The patch
> does fix the reproducer case and has otherwise been stable here.

Hmm. I'm inclined to just apply it and mark it for stable, along with
your other patch. But yes, this needs more thinking about (and
obviously testing). The interactions with the radix tree are too
subtle.

Linus

2016-10-05 17:01:04

by Joe Perches

[permalink] [raw]
Subject: [PATCH] checkpatch: extend BUG warning

Include the prefixed and postfixed BUG/BUG_ON variants like VM_BUG_ON
that can kill the kernel. Exclude the BUILD and lower case variants.

$ grep -rP --include=*.[ch] -oh "\w+_BUG_ON\w*" * | \
sort | uniq -c | sort -rn
? 1838 BUILD_BUG_ON
? 632 snd_BUG_ON
? 263 VM_BUG_ON_PAGE
? 260 VM_BUG_ON
? 97 GEM_BUG_ON
? 57 SNIC_BUG_ON
? 49 EFX_BUG_ON_PARANOID
? 45 BUILD_BUG_ON_ZERO
? 45 BUILD_BUG_ON_MSG
? 30 B43legacy_BUG_ON
? 24 VM_BUG_ON_VMA
? 21 BUILD_BUG_ON_NOT_POWER_OF_2
? 16 PCPU_SETUP_BUG_ON
? 15 GLOCK_BUG_ON
? 12 LOGFS_BUG_ON
? 11 RWLOCK_BUG_ON
? 10 VIRTUAL_BUG_ON
? 9 SPIN_BUG_ON
? 9 BUILD_BUG_ON_INVALID
? 8 UNWINDER_BUG_ON
? 8 DEBUG_SPINLOCK_BUG_ON
? 7 VM_BUG_ON_MM
? 7 LIST_BL_BUG_ON
? 6 BUILD_BUG_ON_NULL
? 5 MAYBE_BUILD_BUG_ON
? 4 VM_BUG_ON_PGFLAGS
? 4 SSB_BUG_ON
? 4 HAVE_ARCH_BUG_ON
? 4 BIO_BUG_ON
? 3 DCCP_BUG_ON
? 3 __BUILD_BUG_ON_NOT_POWER_OF_2
? 2 __BUG_ON_cond
? 2 __BUG_ON
? 1 CONFIG_BUG_ON_DATA_CORRUPTION

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Joe Perches <[email protected]>
---
?scripts/checkpatch.pl | 5 +++--
?1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 3373c65fef1c..be796704e218 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -3725,8 +3725,9 @@ sub process {
}
}

-# avoid BUG() or BUG_ON()
- if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
+# avoid BUG() or BUG_ON() like mechanisms (exclude BUILD_BUG)
+ if ($line =~ /\b((?:[A-Z_]+_)?BUG(?:_ON(?:_[A-Z_]+)?))\s*\(/ &&
+ ????$1 !~ /BUILD/) {
my $msg_type = \&WARN;
$msg_type = \&CHK if ($file);
&{$msg_type}("AVOID_BUG",

2016-10-05 17:07:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] checkpatch: extend BUG warning

On Wed, Oct 5, 2016 at 10:00 AM, Joe Perches <[email protected]> wrote:
> 1838 BUILD_BUG_ON
> 632 snd_BUG_ON

At least snd_BUG_ON() is actually fine. The sound system does
something sane with their bugs: they just turn into WARN_ON()'s (and
then !CONFIG_SND_DEBUG will disable them entirely).

I haven't checked the other ones, but it was actually nice to see that
one: considering that it's the most widely used one (ignoring the
obviously great BUILD_BUG_ON), the fact that it just spews a warning
indicates that that model works fine.

Linus

2016-10-05 19:06:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 05, 2016 at 08:52:54AM -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2016 at 10:44 PM, Willy Tarreau <[email protected]> wrote:
> >
> > I think instead we should completely remove any simple way to halt the
> > system and document how to do it.
>
> Having slept on it, I suspect you're right. I worry about some
> BUG_ON() that really relies on the killing behavior, but if it takes a
> "real" fault later, that is when it gets killed. And on the whole,
> we've had lots of problems with the killing behavior over the years,
> so we should just try switching BUG_ON() over to non-fatal. It's
> unlikely to be worse than what we have now, as exemplified by this
> event.

I have the same doubts, so at least I would not want to run the "sed"
immediately, at least to keep the initial intent. But I think everyone
is right in is own yard when he puts a BUG_ON() when he doesn't know
how to handle an unsafe situation, he's wrong from a global perspective.

For example, it could be seen as safe to crash the system in a filesystem
driver to protect against the risk of data corruption resulting from an
impossible condition, but when this happens due to a dirty FS on a USB
stick that a person inserts on the PC to save her work, actually the
BUG_ON() is the one responsible for the data loss. Even something as
painful as leaving a process in D state in this situation would have
been cleaner as it would let the admin reboot when he wants and not
have to experience it at the worst moment.

I've already met 100% reproducible panics that I never had the time to
inestigate (one involving running an mmap-based hex editor on /dev/mem,
and the other one doing stupid things with mount --move), and I'm sure
once I find the cause I'll see a BUG_ON() that should have been a warning.

I'm pretty sure there are historically valid BUG_ON() that are probably
not needed anymore just like I'm also convinced that some of them are
hard to get rid of. Maybe at least having the same as WARN_ON() but
prepending the dump with a message saying "you encountered a critical
bug which should have crashed the kernel, you must absolutely report it"
would help at the beginning.

Cheers,
Willy

2016-10-05 19:18:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 12:06 PM, Willy Tarreau <[email protected]> wrote:
>
> I have the same doubts, so at least I would not want to run the "sed"
> immediately, at least to keep the initial intent. But I think everyone
> is right in is own yard when he puts a BUG_ON() when he doesn't know
> how to handle an unsafe situation, he's wrong from a global perspective.

Yes. And as you say, even when the developer might be right in sone
situations, you'd easily still be wrong for the same code in some
other situation.

Quite frankly, I wouldn't do a sed-script pass to actually change
existing users. I'd just change how the BUG() implementation itself
works. Not make it a direct WARN_ON(), but perhaps something like

- use WARN_ON() with a global rate limiter (we do *not* want BUG
cascades, but re-enable the warning after a few minutes)

- have some kernel command line option for the server people to allow
them to just force a reboot for it

Hmm?

Anybody want to play with it?

Linus

2016-10-05 21:14:10

by Kees Cook

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 12:18 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Oct 5, 2016 at 12:06 PM, Willy Tarreau <[email protected]> wrote:
>>
>> I have the same doubts, so at least I would not want to run the "sed"
>> immediately, at least to keep the initial intent. But I think everyone
>> is right in is own yard when he puts a BUG_ON() when he doesn't know
>> how to handle an unsafe situation, he's wrong from a global perspective.
>
> Yes. And as you say, even when the developer might be right in sone
> situations, you'd easily still be wrong for the same code in some
> other situation.

I just want to chime and and confirm that we really don't want to just
wholesale replace BUG with WARN. Most situations using BUG (whether or
not they should be) are totally unprepared to continue execution.
Which means we'd just get some memory trap or bizarre crash after the
WARN instead of the "clean" BUG behavior.

> Quite frankly, I wouldn't do a sed-script pass to actually change
> existing users. I'd just change how the BUG() implementation itself
> works. Not make it a direct WARN_ON(), but perhaps something like
>
> - use WARN_ON() with a global rate limiter (we do *not* want BUG
> cascades, but re-enable the warning after a few minutes)
>
> - have some kernel command line option for the server people to allow
> them to just force a reboot for it
>
> Hmm?
>
> Anybody want to play with it?

We absolutely have a granularity problem, but we have to retain the
no-continued-execution nature of BUG() users. The problem with BUG()
is that it is so context-sensitive. In the case you hit, killing the
process and continuing life fundamentally failed and the entire system
fell over. That wasn't the intent, obviously, but that BUG() got
effectively "promoted" to panic().

The cases where I've used BUG() are entirely about doing two things:
reporting the current state of the CPU and call stack and to kill the
process. (And I'd like to add a third: passing a meaningful string,
which right now has to happen with a separate pr_*() call that appears
outside the "cut here" line that x86 produces on a BUG.)

Now, it can be argued that killing the process part should be
configurable and that the code should be written to handle a WARN and
clean up and error out nicely. But I still want to retain the "kill
the process immediately" behavior in some capacity.

The implementation of BUG is also arch-specific, which is frustrating
to make changes on.

So, maybe another question is "when does BUG kill the system and not
just the process?" And can we detect these like we already detect bad
locking, interrupt contexts, etc? (Is this question going to have an
arch-specific answer?)

-Kees

--
Kees Cook
Nexus Security

2016-10-05 21:13:52

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 05, 2016 at 12:18:51PM -0700, Linus Torvalds wrote:
> Quite frankly, I wouldn't do a sed-script pass to actually change
> existing users. I'd just change how the BUG() implementation itself
> works. Not make it a direct WARN_ON(), but perhaps something like
>
> - use WARN_ON() with a global rate limiter (we do *not* want BUG
> cascades, but re-enable the warning after a few minutes)

That's interesting, I had exactly this discussion at kernel recipes
last week with someone complaining that when warnings scroll, you
only see the last ones while only the first one is useful. I guess
in most situations we don't even need a rate limiter, just print a
single dump and wait 2 minutes or so for the person in front of the
screen to have the time to take a photo.

> - have some kernel command line option for the server people to allow
> them to just force a reboot for it

Good point for not doing the sed.

> Hmm?
>
> Anybody want to play with it?

Many people will be much more efficient than me at doing it and even
testing it so I won't volunteer here.

Willy

2016-10-05 21:46:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 2:14 PM, Kees Cook <[email protected]> wrote:
> Now, it can be argued that killing the process part should be
> configurable and that the code should be written to handle a WARN and
> clean up and error out nicely. But I still want to retain the "kill
> the process immediately" behavior in some capacity.

If "some capacity" is "can't do user space accesses", we could easily
force a SIGKILL of the current process. It won't die immediately in
the kernel, but it won't be returning to user space either.

The problem with the immediate kill is that it can be in interrupt
context, or just holding arbitrary locks. And it's hard to even tell
dynamically (sometimes you can see it: with preemption enabled you can
tell "am I in a non-preempt area", for example, but it ends up
depending on config options).

And *if* we make BUG() actually do something sane (non-trapping), we
can easily make it be generic, not arch-specific. In fact, I'd
implement it by just adding a "handle_bug()" in kernel/panic.c...

Linus

2016-10-05 22:17:10

by Kees Cook

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 2:46 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Oct 5, 2016 at 2:14 PM, Kees Cook <[email protected]> wrote:
>> Now, it can be argued that killing the process part should be
>> configurable and that the code should be written to handle a WARN and
>> clean up and error out nicely. But I still want to retain the "kill
>> the process immediately" behavior in some capacity.
>
> If "some capacity" is "can't do user space accesses", we could easily
> force a SIGKILL of the current process. It won't die immediately in
> the kernel, but it won't be returning to user space either.

With my more paranoid desires, I would prefer to keep "stop kernel
execution with the state set up by this process", not just "make the
process never return to user-space". I would need to meditate on
whether what I really want is just "panic on Oops" or not, though.
Right now, for example, I don't use panic-on-oops when running lkdtm
tests since each test gets (correctly) killed and the Oops can be
examined for the expected failure mode, all without bringing down the
entire system.

> The problem with the immediate kill is that it can be in interrupt
> context, or just holding arbitrary locks. And it's hard to even tell
> dynamically (sometimes you can see it: with preemption enabled you can
> tell "am I in a non-preempt area", for example, but it ends up
> depending on config options).

Yeah, I've seen some hilarious failure modes while building lkdtm
tests for various kernel self-protections.

> And *if* we make BUG() actually do something sane (non-trapping), we
> can easily make it be generic, not arch-specific. In fact, I'd
> implement it by just adding a "handle_bug()" in kernel/panic.c...

Yeah, I'm not sure what the right next step would be. Do we need a new
set of functions between WARN and BUG? Or maybe extract the
process-killing logic on a per-arch level and make it a specific API
so that it can be explicitly called as part of error-handling? Hmm

-Kees

--
Kees Cook
Nexus Security

2016-10-05 22:29:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 3:17 PM, Kees Cook <[email protected]> wrote:
>
> With my more paranoid desires, I would prefer to keep "stop kernel
> execution with the state set up by this process", not just "make the
> process never return to user-space".

Quite honestly, I think the answer to that is: "No. Not by default".
So with some kind of kernel command line option, yes, kind of like
"reboot_on_oops" (or whatever it is - I've never used it ;)

>> And *if* we make BUG() actually do something sane (non-trapping), we
>> can easily make it be generic, not arch-specific. In fact, I'd
>> implement it by just adding a "handle_bug()" in kernel/panic.c...
>
> Yeah, I'm not sure what the right next step would be. Do we need a new
> set of functions between WARN and BUG? Or maybe extract the
> process-killing logic on a per-arch level and make it a specific API
> so that it can be explicitly called as part of error-handling? Hmm

So the process-killing logic actually used to historically just be
"call do_exit()". In fact, that's what most architectures still do in
their error paths. And it's what a lot of people who just want to kill
the current code do.

So calling "do_exit()" is actually perfectly fine. It's just that
calling do_exit() from BUG_ON() is a major pain, because of the
asynchronous nature of BUG_ON(). But if you are in a regular system
call and don't hold any locks, do_exit() is still fine.

In fact, all that x86 really does differently from do_exit() in the
fault path is to reset the stack pointer first, so that you don't get
stack smashers when you have recursive faults (which used to be one
really nasty failure case, not just with BUG_ON(), but with any kernel
oops in general). So on x86, the crash code actually calls a function
called "rewind_stack_do_exit()" instead.

But the name gives it away: it's the exact same thing.

So you can actually do a generic BUG_ON() (even with the current
semantics) pretty much today by just having a config option that the
architecture can set to specify whether you should just call
"do_exit()" or "rewind_stack_do_exit()" to do that final killing
action.

There's a few other possible gotcha's (the code is hard to follow
because the normal implementation uses a trapping instruction and
hides the BUG() information in the text, so you get the whole fault
path), but on the whole I think it should be fairly straightforward do
just get rid of all the arch code, and replace it with a generic
function that can then decide internally whether it wants to just
warn, whether it wants to SIGKILL, or whether it wants to do the
traditional thing and just force do_exit(). Or do new things like
reboot or just halt.

But it really would be very nice to never have do_exit() have to worry
about odd callers. We've had a *lot* of trouble over the years with
deadlocks on critical locks in do_exit(), for example.

Linus

2016-10-06 02:00:03

by Dave Chinner

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Tue, Oct 04, 2016 at 08:29:00PM -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2016 at 7:43 PM, Paul Gortmaker
> <[email protected]> wrote:
> >
> > A couple years ago Ingo had an idea to kill BUG_ON abuse that I made
> > a 1st pass at. Back then it seemed nobody cared. Maybe that has since
> > changed?
> >
> > https://lkml.org/lkml/2014/4/30/359
>
> So we actually have the checkpatch warning already:
>
> # avoid BUG() or BUG_ON()
> if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
> my $msg_type = \&WARN;
> $msg_type = \&CHK if ($file);
> &{$msg_type}("AVOID_BUG",
> "Avoid crashing the kernel - try
> using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" .
> $herecurr);
> }
>
> but it doesn't trigger on VM_BUG_ON().
>
> And I'm not convinced about replacing things with BUG_ON_AND_HALT(),
> it simply doesn't fix the existing issue we have: people use BUG_ON(),
> and worse, _when_ they use BUG_ON(), they use it instead of error
> handling, so the code _around_ the BUG_ON() tends to then very much
> depend on what the BUG_ON() checks.
>
> This is actually one way that VM_BUG_ON() is better: it's very much by
> design something that can be compiled away, so at least hopefully
> nobody thinks of it as a security measure. So we could just say that
> we will treat VM_BUG_ON() as a WARN_ON_ONCE(), and just not kill the
> machine.

In XFS, we use ASSERT() (could be XFS_BUG_ON() for all
that the name matters) but we only define that to BUG_ON if
CONFIG_XFS_DEBUG=y.

For "production debug" kernels we have CONFIG_XFS_WARN=y, which
turns ASSERT() into WARN_ON(). We get the warnings, but none of the
crashiness that are desirable in a development context. This is what
distro debug kernels should be using, as it also ensures
we don't build in the real debug code that does things that would
affect prodution systems adversely, like randomly take different
allocator paths to ensure we get code coverage of all the allocator
algorithms...

i.e. production kernels ship with neither set, the debug kernel
ships with CONFIG_XFS_WARN=y, and we do all our development with
CONFIG_XFS_DEBUG=y.

I think this case falls into the "production debug" classification;
we want a warning, but we don't want the system to be taken down....

> But apart from the checkpatch thing, it's actually a pretty big change.

Yeah, that's why we added CONFIG_XFS_WARN=y to do this - it was a 20
line change to add XFS_CONFIG_WARN instead of having to audit and
modify ~1800 call sites to do something differently. And because we
know that ASSERT() is not present in all kernels, it isn't ever used
as a replacement for error handling. Perhaps that's the simplest
solution here as well....

Just my 2c worth.

-Dave.
--
Dave Chinner
[email protected]

2016-10-06 02:12:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 6:59 PM, Dave Chinner <[email protected]> wrote:
>
> In XFS, we use ASSERT() (could be XFS_BUG_ON() for all
> that the name matters) but we only define that to BUG_ON if
> CONFIG_XFS_DEBUG=y.
>
> For "production debug" kernels we have CONFIG_XFS_WARN=y, which
> turns ASSERT() into WARN_ON(). We get the warnings, but none of the
> crashiness that are desirable in a development context.

Yes. that sounds very much like the right kind of decision.

Forcing crashes can be very useful for the actual developer that is
doing development on the code itself, kind of a "fail fast, fail
hard".

But users (or developers that are developing something _else_ than XFS
;) don't tend to like it.

Linus

2016-10-06 22:07:08

by Kees Cook

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Wed, Oct 5, 2016 at 3:29 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Oct 5, 2016 at 3:17 PM, Kees Cook <[email protected]> wrote:
>>
>> With my more paranoid desires, I would prefer to keep "stop kernel
>> execution with the state set up by this process", not just "make the
>> process never return to user-space".
>
> Quite honestly, I think the answer to that is: "No. Not by default".
> So with some kind of kernel command line option, yes, kind of like
> "reboot_on_oops" (or whatever it is - I've never used it ;)
>
>>> And *if* we make BUG() actually do something sane (non-trapping), we
>>> can easily make it be generic, not arch-specific. In fact, I'd
>>> implement it by just adding a "handle_bug()" in kernel/panic.c...
>>
>> Yeah, I'm not sure what the right next step would be. Do we need a new
>> set of functions between WARN and BUG? Or maybe extract the
>> process-killing logic on a per-arch level and make it a specific API
>> so that it can be explicitly called as part of error-handling? Hmm
>
> So the process-killing logic actually used to historically just be
> "call do_exit()". In fact, that's what most architectures still do in
> their error paths. And it's what a lot of people who just want to kill
> the current code do.
>
> So calling "do_exit()" is actually perfectly fine. It's just that
> calling do_exit() from BUG_ON() is a major pain, because of the
> asynchronous nature of BUG_ON(). But if you are in a regular system
> call and don't hold any locks, do_exit() is still fine.

Well, that's the problem I repeatedly ran into: locks. And actually,
more than that. Even in seccomp when trying to do a clean death of a
process (without the paranoid requirement that kernel execution for
that syscall stop), I've run into problems with do_exit() -- in that
case it was bad ptrace state assumptions during signal delivery. I
originally used do_exit with the hardened usercopy stuff and that
would fall all over itself under lkdtm testing, since locks were held.
The "cleanest" way to handle it seemed to be the lock-busting logic
already built into BUG, so I moved to that.

> In fact, all that x86 really does differently from do_exit() in the
> fault path is to reset the stack pointer first, so that you don't get
> stack smashers when you have recursive faults (which used to be one
> really nasty failure case, not just with BUG_ON(), but with any kernel
> oops in general). So on x86, the crash code actually calls a function
> called "rewind_stack_do_exit()" instead.
>
> But the name gives it away: it's the exact same thing.
>
> So you can actually do a generic BUG_ON() (even with the current
> semantics) pretty much today by just having a config option that the
> architecture can set to specify whether you should just call
> "do_exit()" or "rewind_stack_do_exit()" to do that final killing
> action.
>
> There's a few other possible gotcha's (the code is hard to follow
> because the normal implementation uses a trapping instruction and
> hides the BUG() information in the text, so you get the whole fault
> path), but on the whole I think it should be fairly straightforward do
> just get rid of all the arch code, and replace it with a generic
> function that can then decide internally whether it wants to just
> warn, whether it wants to SIGKILL, or whether it wants to do the
> traditional thing and just force do_exit(). Or do new things like
> reboot or just halt.
>
> But it really would be very nice to never have do_exit() have to worry
> about odd callers. We've had a *lot* of trouble over the years with
> deadlocks on critical locks in do_exit(), for example.

Right, so I see a number of "features" that are mixed together between
do_exit, WARN, BUG, and panic, and it seems code wants various
combinations of them:

- report details about an unexpected condition occurring (BUG,
actually, does NOT do this: it doesn't take any kind of format
argument, but arch-specific internal traps DO, before calling the rest
of the internals that BUG is wired up to.)
- dump CPU and stack state
- kill current
- kill current's entire thread group
- stop kernel execution from continuing
- halt the system
- reboot the system

By far the most problematic is "stop kernel execution from
continuing", but that's currently the behavior that BUG depends on, so
replacing BUG with anything needs to either fix the surrounding logic
to fail sanely or we have the keep the feature.

I remain convinced, though, that "stop this thread of kernel
execution" is not the same as "reboot the system".

-Kees

--
Kees Cook
Nexus Security

2016-10-06 22:29:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Thu, Oct 6, 2016 at 3:07 PM, Kees Cook <[email protected]> wrote:
> The "cleanest" way to handle it seemed to be the lock-busting logic
> already built into BUG, so I moved to that.

Heh. The lock-busting logic in BUG() has always been broken. It's been
random hacks. It doesn't actually work in any general case, it just
occasionally happens to get things right. Mostly it tries to handle
the console locking (the whole "oops_in_progress" magic) so that if
you have a BUG_ON() in bad areas, at least you still end up getting
output.

But no, it's not reliable in any way, shape or form. That's really why
you want to continue after a BUG().

> By far the most problematic is "stop kernel execution from
> continuing", but that's currently the behavior that BUG depends on, so
> replacing BUG with anything needs to either fix the surrounding logic
> to fail sanely or we have the keep the feature.

Well, I'm not sure how much we actually end up depending on it,
considering that we now have two examples of BUG() implementations
that actually do _not_ depend on stopping execution: both the sound
subsystem and the XFS version of BUG_ON() end up not actually doing
the BUG() thing.

Linus

2016-10-06 23:06:09

by Kees Cook

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Thu, Oct 6, 2016 at 3:29 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Oct 6, 2016 at 3:07 PM, Kees Cook <[email protected]> wrote:
>> The "cleanest" way to handle it seemed to be the lock-busting logic
>> already built into BUG, so I moved to that.
>
> Heh. The lock-busting logic in BUG() has always been broken. It's been
> random hacks. It doesn't actually work in any general case, it just
> occasionally happens to get things right. Mostly it tries to handle
> the console locking (the whole "oops_in_progress" magic) so that if
> you have a BUG_ON() in bad areas, at least you still end up getting
> output.

It seems to handle other things too, file descriptors, I think? Some
giant warning, I think about fds, went away when I switched from
do_exit() to BUG(). I'd have to go look more closely.

> But no, it's not reliable in any way, shape or form. That's really why
> you want to continue after a BUG().

Yeah, agreed about the unreliability. It's why I'm a fan of
panic_on_oops. :P (Except when doing lots of tests under lkdtm, then I
like having multiple Oopses without rebooting, but perhaps that is
literally the only use-case...)

>> By far the most problematic is "stop kernel execution from
>> continuing", but that's currently the behavior that BUG depends on, so
>> replacing BUG with anything needs to either fix the surrounding logic
>> to fail sanely or we have the keep the feature.
>
> Well, I'm not sure how much we actually end up depending on it,
> considering that we now have two examples of BUG() implementations
> that actually do _not_ depend on stopping execution: both the sound
> subsystem and the XFS version of BUG_ON() end up not actually doing
> the BUG() thing.

Yeah, for sure. I didn't mean to imply they all depended on it, just
that finding those that do will require manual inspection. We'll not
be able to do a flag-day on BUG until we fix everything.

-Kees

--
Kees Cook
Nexus Security

2016-10-06 23:59:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Thu, Oct 6, 2016 at 4:05 PM, Kees Cook <[email protected]> wrote:
>
> It seems to handle other things too, file descriptors, I think? Some
> giant warning, I think about fds, went away when I switched from
> do_exit() to BUG(). I'd have to go look more closely.

I think you must have changed something else too. I can't think of
what else there is than the crazy "oops_in+_progress" hacks. We used
to reset the preempt counter too, but with that being per-thread I
don't think that even matters.

So I think you may have some voodoo programming there.

An an oops (but not a do_exit()) will add a taint, and there's the
notifications that might do random things (mainly kgdb and tracing).
But that should be pretty much it.

> Yeah, for sure. I didn't mean to imply they all depended on it, just
> that finding those that do will require manual inspection. We'll not
> be able to do a flag-day on BUG until we fix everything.

Not true. That kind of thinking just says "we can never change BUG at all".

You'll never fix anything that way.

We should just switch BUG() over and be done with it. The whole point
it that since it should never trigger in the first place, the
semantics on BUG() should never matter.

And if you have some code that depends on the semantics of BUG(), that
code is buggy crap *by*definition*.

And there's no way we'll let that kind of shit determine kernel
development. That would be insane.

Linus

2016-10-07 05:59:16

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Thu, Oct 06, 2016 at 04:59:20PM -0700, Linus Torvalds wrote:
> We should just switch BUG() over and be done with it. The whole point
> it that since it should never trigger in the first place, the
> semantics on BUG() should never matter.
>
> And if you have some code that depends on the semantics of BUG(), that
> code is buggy crap *by*definition*.

I totally agree with this. If a developer writes BUG() somewhere, it
means he doesn't see how it is possible to end up in this situation.
Thus we cannot hope that the BUG() call is doing anything right to
fix what the code author didn't expect to happen. It just means
"try to limit the risks but I don't really know which ones".

Also we won't make things worse. Where people currently have an oops,
they'll get one or more warnings. The side effects (lockups, panic,
etc) will more or less be the same, but many of us already don't want
to continue after an oops and despite this our systems work fine, so
I don't see why anyone would suffer from such a change. However some
developers may get more details about issues than what they could get
in the past.

Willy

2016-10-07 17:16:58

by Kees Cook

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Thu, Oct 6, 2016 at 10:48 PM, Willy Tarreau <[email protected]> wrote:
> On Thu, Oct 06, 2016 at 04:59:20PM -0700, Linus Torvalds wrote:
>> We should just switch BUG() over and be done with it. The whole point
>> it that since it should never trigger in the first place, the
>> semantics on BUG() should never matter.
>>
>> And if you have some code that depends on the semantics of BUG(), that
>> code is buggy crap *by*definition*.
>
> I totally agree with this. If a developer writes BUG() somewhere, it
> means he doesn't see how it is possible to end up in this situation.
> Thus we cannot hope that the BUG() call is doing anything right to
> fix what the code author didn't expect to happen. It just means
> "try to limit the risks but I don't really know which ones".
>
> Also we won't make things worse. Where people currently have an oops,
> they'll get one or more warnings. The side effects (lockups, panic,
> etc) will more or less be the same, but many of us already don't want
> to continue after an oops and despite this our systems work fine, so
> I don't see why anyone would suffer from such a change. However some
> developers may get more details about issues than what they could get
> in the past.

Fair enough. I'll put something together for at least my use-cases and
see how ugly it gets in testing. :) I actually started on something
like this for the CONFIG_DEBUG_LIST, which had to deal with the logic
of "continue after WARN or abort after BUG" etc...

Regardless, I still think that we can't let BUG continue kernel
execution though, since it may lead to entirely unexpected behavior
(possibly security-sensitive) by still running. Upgrading BUG to
panic(), though, I'd be fine with, as a way to get people to convert
to WARN.

-Kees

--
Kees Cook
Nexus Security

2016-10-07 17:21:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Fri, Oct 7, 2016 at 10:16 AM, Kees Cook <[email protected]> wrote:
>
> Regardless, I still think that we can't let BUG continue kernel
> execution though, since it may lead to entirely unexpected behavior
> (possibly security-sensitive) by still running. Upgrading BUG to
> panic(), though, I'd be fine with, as a way to get people to convert
> to WARN.

No. Really. You can upgrade BUG() to "panic()" with a kernel command
line. But not by default.

I'm not going to take any patches that make BUG() even *worse*. That
would be insane. I'm not insane.

Linus

2016-10-07 17:33:47

by Kees Cook

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

On Fri, Oct 7, 2016 at 10:21 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Oct 7, 2016 at 10:16 AM, Kees Cook <[email protected]> wrote:
>>
>> Regardless, I still think that we can't let BUG continue kernel
>> execution though, since it may lead to entirely unexpected behavior
>> (possibly security-sensitive) by still running. Upgrading BUG to
>> panic(), though, I'd be fine with, as a way to get people to convert
>> to WARN.
>
> No. Really. You can upgrade BUG() to "panic()" with a kernel command
> line. But not by default.
>
> I'm not going to take any patches that make BUG() even *worse*. That
> would be insane. I'm not insane.

I'll quit debating how to change things, but I'll just try to point
out that the "stop execution" logic, currently, is not an accident.
Without CONFIG_BUG, BUG is defined as "do {} while (1)", and without
CONFIG_HAVE_ARCH_BUG, BUG is defined as "printk(...); panic(...);".

-Kees

--
Kees Cook
Nexus Security

2016-10-07 18:27:08

by Willy Tarreau

[permalink] [raw]
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers

Hi Kees,

On Fri, Oct 07, 2016 at 10:33:33AM -0700, Kees Cook wrote:
> I'll quit debating how to change things, but I'll just try to point
> out that the "stop execution" logic, currently, is not an accident.
> Without CONFIG_BUG, BUG is defined as "do {} while (1)", and without
> CONFIG_HAVE_ARCH_BUG, BUG is defined as "printk(...); panic(...);".

I think we're all convinced about this *initial* intent. However among
the 3197 BUG() and 9594 BUG_ON() that are present in v4.8, how many
should *really* be of them ? I'm seeing that during 4.8 development
cycle alone, we managed to add 81 BUG() and 55 BUG_ON(). I doubt we
found so many valid reasons to kill the system. 38 of them were added
to drivers/. The problem is that this "style" has accumulated over the
years. We only had 1739 BUG() and 1801 BUG_ON() in 2.6.12. So we
roughly multiplied that by 4 in 11 years.

The current trend seems to actually be to remove some of them, 3 were
removed from lib/, 4 from include/, 29 removed from fs/, one removed
from mm/ but two added to kernel/ and 3 other ones to net/.

Maybe changing only kernel/ and mm/'s BUG() occurrences to something
like "I_KNOW_I_WILL_BE_BLAMED_FOR_THIS_BUG()" and letting them kill
until they're properly audited, and leaving the other ones non-fatal
could be a reasonable tradeoff to start with ?

Willy