Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754148AbcJDQDo (ORCPT ); Tue, 4 Oct 2016 12:03:44 -0400 Received: from mail-oi0-f53.google.com ([209.85.218.53]:33931 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752699AbcJDQDm (ORCPT ); Tue, 4 Oct 2016 12:03:42 -0400 MIME-Version: 1.0 In-Reply-To: <1475564605.7361.11@smtp.gmail.com> References: <20161003210729.e4c39a86b8e8e4d103715c7d@linux-foundation.org> <1475564605.7361.11@smtp.gmail.com> From: Linus Torvalds Date: Tue, 4 Oct 2016 09:03:41 -0700 X-Google-Sender-Auth: DZPnCEsGq2niklNjvGMp-vRBAsc Message-ID: Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers To: Raymond Jennings Cc: Andrew Morton , Johannes Weiner , Antonio SJ Musumeci , Miklos Szeredi , Linux Kernel Mailing List , stable Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3398 Lines: 79 On Tue, Oct 4, 2016 at 12:03 AM, Raymond Jennings wrote: > On Mon, Oct 3, 2016 at 9:12 PM, Linus Torvalds wrote: >> >> >> Killing the machine is ok if we have a situation where there literally >> is no other choice. > > For the curious: > > This would include situations like > > 1. The kernel is confused and further processing would result in undefined > behavior (like bluesmoke detecting PCC for example) Yes. Mainly situations where you cannot even have sane error handling, so you can't just do "warn and return without doing anything". It might be some "I noticed that the CPU stack is corrupt, I can't even return, I will just have to terminate". > 2. Security hazards where we'd leak stuff if we don't shut down. Honestly, I can't think of a situation where that has actually happened. Now, sometimes BUG_ON() is less onerous than other time: if you know that you are in a regular process context where you're not holding core locks, BUG_ON() is actually fairly benign: it will print a big scary message, and then it will kill the current process. It's not going to kill the machine, unless the admin has explicitly asked for "reboot if you have issues", which is mainly a situation for the googles of the world - if you have millions of machines and you don't actually *care*, then rebooting is fine. So realistically, the main places you should use BUG_ON() variants is (a) development code where it replaces error handling that you just haven't written yet, and you haven't really thought through all the possibilities, so you're saying "this can't happen, I'll fix it later". It sounds like Andrew thought that that is what VM_BUG_ON() is, and that it wouldn't be enabled unless you're a developer. But no, this is a "RFC patch" kind of situation. This kind of BUG_ON() often ends up escaping into the wild, but it should be after *huge* amounts of testing, and by definition it should never have been accepted during anythign but the merge window. So in a very real sense it's really my bad for not reacting to the BUG_ON() being added during rc8. (b) very core code that actually verifies some very core assumptions that are *so* important that if they are broken the code is by definition not really able to function. This is the actual intended case. It's not a "let's check that everybody did things right", it's a "this is a major design rule in this core code". The example in workingset_node_shadows_dec() _could_ actually have been that kind of (b) situation, except for the timing and lack of deep testing. But a reasonable example of (b) would be something like the BUG_ON(!PageLocked(page)); kind of code in fs/buffer.c - it's core infrastructure that has been tested with core code, and the BUG_ON() is meant to catch bad _new_ users quickly. And it's *such* a core requirement that error handling doesn't even make sense. Again, workingset_node_shadows_dec() could have a BUG_ON() in theory. But the BUG_ON() is _wrong_ when we had a situation of "oh, we just recently noticed a bug in this area, so lets' just verify that it's really gone". Notice? Just the timing and intent can make the difference between "good BUG_ON() in solid code that has been around forever" and "bad BUG_ON() checking something that we know we might be getting wrong". Linus