MIME-Version: 1.0
In-Reply-To: <1475564605.7361.11@smtp.gmail.com>
References: <CA+55aFwyNTLuZgOWMTRuabWobF27ygskuxvFd-P0n-3UNT=0Og@mail.gmail.com>
 <20161003210729.e4c39a86b8e8e4d103715c7d@linux-foundation.org>
 <CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com> <1475564605.7361.11@smtp.gmail.com>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 4 Oct 2016 09:03:41 -0700
Message-ID: <CA+55aFzH2yOrStK77cTLoOTqO9+AkOvP56d2MqG63soaNnVSRA@mail.gmail.com>
Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers
To: Raymond Jennings <shentino@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Antonio SJ Musumeci <trapexit@spawn.link>,
        Miklos Szeredi <miklos@szeredi.hu>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        stable <stable@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3398
Lines: 79

On Tue, Oct 4, 2016 at 12:03 AM, Raymond Jennings <shentino@gmail.com> wrote:
> On Mon, Oct 3, 2016 at 9:12 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>
>> Killing the machine is ok if we have a situation where there literally
>> is no other choice.
>
> For the curious:
>
> This would include situations like
>
> 1.  The kernel is confused and further processing would result in undefined
> behavior (like bluesmoke detecting PCC for example)

Yes. Mainly situations where you cannot even have sane error handling,
so you can't just do "warn and return without doing anything". It
might be some "I noticed that the CPU stack is corrupt, I can't even
return, I will just have to terminate".

> 2.  Security hazards where we'd leak stuff if we don't shut down.

Honestly, I can't think of a situation where that has actually happened.

Now, sometimes BUG_ON() is less onerous than other time: if you know
that you are in a regular process context where you're not holding
core locks, BUG_ON() is actually fairly benign: it will print a big
scary message, and then it will kill the current process. It's not
going to kill the machine, unless the admin has explicitly asked for
"reboot if you have issues", which is mainly a situation for the
googles of the world - if you have millions of machines and you don't
actually *care*, then rebooting is fine.

So realistically, the main places you should use BUG_ON() variants is

 (a) development code where it replaces error handling that you just
haven't written yet, and you haven't really thought through all the
possibilities, so you're saying "this can't happen, I'll fix it
later".

  It sounds like Andrew thought that that is what VM_BUG_ON() is, and
that it wouldn't be enabled unless you're a developer. But no, this is
a "RFC patch" kind of situation.

  This kind of BUG_ON() often ends up escaping into the wild, but it
should be after *huge* amounts of testing, and by definition it should
never have been accepted during anythign but the merge window. So in a
very real sense it's really my bad for not reacting to the BUG_ON()
being added during rc8.

 (b) very core code that actually verifies some very core assumptions
that are *so* important that if they are broken the code is by
definition not really able to function.

  This is the actual intended case. It's not a "let's check that
everybody did things right", it's a "this is a major design rule in
this core code".

The example in workingset_node_shadows_dec() _could_ actually have
been that kind of (b) situation, except for the timing and lack of
deep testing. But a reasonable example of (b) would be something like
the

        BUG_ON(!PageLocked(page));

kind of code in fs/buffer.c - it's core infrastructure that has been
tested with core code, and the BUG_ON() is meant to catch bad _new_
users quickly. And it's *such* a core requirement that error handling
doesn't even make sense.

Again, workingset_node_shadows_dec() could have a BUG_ON() in theory.
But the BUG_ON() is _wrong_ when we had a situation of "oh, we just
recently noticed a bug in this area, so lets' just verify that it's
really gone".

Notice? Just the timing and intent can make the difference between
"good BUG_ON() in solid code that has been around forever" and "bad
BUG_ON() checking something that we know we might be getting wrong".

                                Linus