Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754178AbcJETGd (ORCPT ); Wed, 5 Oct 2016 15:06:33 -0400 Received: from wtarreau.pck.nerim.net ([62.212.114.60]:34773 "EHLO 1wt.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752652AbcJETGb (ORCPT ); Wed, 5 Oct 2016 15:06:31 -0400 Date: Wed, 5 Oct 2016 21:06:04 +0200 From: Willy Tarreau To: Linus Torvalds Cc: Paul Gortmaker , Johannes Weiner , Andrew Morton , Antonio SJ Musumeci , Miklos Szeredi , Linux Kernel Mailing List , stable Subject: Re: BUG_ON() in workingset_node_shadows_dec() triggers Message-ID: <20161005190604.GA8116@1wt.eu> References: <20161005054407.GC7297@1wt.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2182 Lines: 42 On Wed, Oct 05, 2016 at 08:52:54AM -0700, Linus Torvalds wrote: > On Tue, Oct 4, 2016 at 10:44 PM, Willy Tarreau wrote: > > > > I think instead we should completely remove any simple way to halt the > > system and document how to do it. > > Having slept on it, I suspect you're right. I worry about some > BUG_ON() that really relies on the killing behavior, but if it takes a > "real" fault later, that is when it gets killed. And on the whole, > we've had lots of problems with the killing behavior over the years, > so we should just try switching BUG_ON() over to non-fatal. It's > unlikely to be worse than what we have now, as exemplified by this > event. I have the same doubts, so at least I would not want to run the "sed" immediately, at least to keep the initial intent. But I think everyone is right in is own yard when he puts a BUG_ON() when he doesn't know how to handle an unsafe situation, he's wrong from a global perspective. For example, it could be seen as safe to crash the system in a filesystem driver to protect against the risk of data corruption resulting from an impossible condition, but when this happens due to a dirty FS on a USB stick that a person inserts on the PC to save her work, actually the BUG_ON() is the one responsible for the data loss. Even something as painful as leaving a process in D state in this situation would have been cleaner as it would let the admin reboot when he wants and not have to experience it at the worst moment. I've already met 100% reproducible panics that I never had the time to inestigate (one involving running an mmap-based hex editor on /dev/mem, and the other one doing stupid things with mount --move), and I'm sure once I find the cause I'll see a BUG_ON() that should have been a warning. I'm pretty sure there are historically valid BUG_ON() that are probably not needed anymore just like I'm also convinced that some of them are hard to get rid of. Maybe at least having the same as WARN_ON() but prepending the dump with a message saying "you encountered a critical bug which should have crashed the kernel, you must absolutely report it" would help at the beginning. Cheers, Willy