MIME-Version: 1.0
In-Reply-To: <Pine.LNX.4.44L0.1612131034280.1533-100000@iolanthe.rowland.org>
References: <CACT4Y+biykX8Rf=PudYiezcxmo7u35xRTnFq+ip96kAG=KqVcg@mail.gmail.com>
 <Pine.LNX.4.44L0.1612131034280.1533-100000@iolanthe.rowland.org>
From: Dmitry Vyukov <dvyukov@google.com>
Date: Tue, 13 Dec 2016 17:23:14 +0100
Message-ID: <CACT4Y+ZTWzdefxQQy9hi1ZSQ=60TNi0+MF2D8VHC7FhWVd0oHg@mail.gmail.com>
Subject: Re: usb/core: warning in usb_create_ep_devs/sysfs_create_dir_ns
To: syzkaller <syzkaller@googlegroups.com>
Cc: Andrey Konovalov <andreyknvl@google.com>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        USB list <linux-usb@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Kostya Serebryany <kcc@google.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5734
Lines: 129

On Tue, Dec 13, 2016 at 4:52 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Tue, 13 Dec 2016, Dmitry Vyukov wrote:
>
>> >> >  If it is
>> >> > not a bug in kernel source code, then it must not produce a WARNING.
>> >
>> > What about a memory allocation failure?  The memory management part of
>> > the kernel produces a WARNING message if an allocation fails and the
>> > caller did not specify __GFP_NOWARN.
>> >
>> > There is no way for a driver to guarantee that a memory allocation
>> > request will succeed -- failure is always an option.  But obviously
>> > memory allocation failures are not bugs in the kernel.
>> >
>> > Are you saying that mm/page_alloc.c:warn_alloc() should produce
>> > something other than a WARNING?
>>
>>
>> The main thing I am saying is that we absolutely need a way for a
>> human or a computer program to be able to determine if there is
>> anything wrong with kernel or not.
>
> Okay, I agree with that.  I'm not at all sure that this decision should
> be based on whether a message is a WARNING.
>
> In my experience, there is no general consensus on what conditions
> should qualify as a WARNING.  There aren't any hard-and-fast rules, so
> people are forced to rely on their own judgment.  The general
> interpretation seems to be that WARNING is appropriate for notifying
> the user whenever something important has gone unexpectedly wrong.
> That covers a lot of ground, including things (like hardware failures)
> that are not kernel bugs.

see below

>> Page_alloc produces a WARNING iff you ask for an amount of memory that
>> can't possibly be allocated under any circumstances (order >=
>> MAX_ORDER).
>
> Doesn't it also produce a WARNING under other circumstances?

No.

OOM is not a WARNING and is easily distinguishable from BUG/WARNING.


>> That's not just an allocation failure. Kernel itself
>> should not generally ask for such large amounts of memory. If the
>> allocation size is dictated by user, then kernel code should either
>> use __GFP_NOWARN, or impose own stricter limit dictated by context
>> (e.g. if it's a text command of known format, then limit can be as
>> small as say 128 bytes).
>
> All right, yes, it makes sense to use __GFP_NOWARN when the allocation
> size is dictated by the user.
>
> But still, even when a stricter limit has been imposed, a memory
> allocation request may fail.  In fact, as far as I know, the kernel has
> _no_ guarantee at all that any memory allocation request will succeed!
> The best it offers is that sufficiently small requests may wait
> indefinitely for memory to become available, which isn't much better
> than failing.

Memory allocator does not print WARNINGs on allocation failures.


>> Re fake hardware. panic_on_warn will definitely cause problems. I
>> don't know if it used in any paranoid production systems or not,
>> though. But more generally, I don't see how it is different from
>> incorrect syscall arguments or nonsensical network packets received
>> from free internet. In ideal productions environments none of these
>> incorrect inputs to kernel should happen. I don't see any single
>> reason to not protect kernel from incorrect input in this case as
>> well, as we do in all other cases.
>
> I _do_ see a reason.  Real computers don't live in ideal production
> environments.  Why go to extra trouble to add protection for some
> particular incorrect input when the kernel is _already_ capable of
> handling this input in a valid manner (even though this may include
> logging a WARNING message)?

One reason is to be able to distinguish kernel bugs from incorrect
inputs to kernel.
also see below


>> In particular, it would resolve a
>> very real and practical issue for us -- fuzzer will not reboot machine
>> every minute, and we will not spend time looking at these WARNINGs,
>> and we will not spend your time by reporting these WARNINGs.
>
> Maybe you should decide that ERROR messages indicate a kernel bug,
> rather than WARNING messages.  Even that is questionable, but you will
> get far fewer false positives.
>
> Even better, you should publicize this decision (in the kernel
> documentation, on various mailing lists, on LWN, and so forth), and
> enforce it by reducing existing ERROR severity levels to WARNINGs in
> places where they do not indicate a kernel bug.

OK, I have some numbers on my hands.
To date we've run about 1e12 random programs covering 1000+ syscalls
and have found 300+ bugs. 50+ of these bugs are found on WARNINGs
(search by "warning" here
https://github.com/google/syzkaller/wiki/Found-Bugs). Some of the bugs
found on WARNINGs are as severe as VMM exploitations. With that
coverage the _only_ case I know of now where a WARNING does not mean
bug in kernel source code is basically the one we are arguing about
now. Provided that total number of WARN/WARN_ON in kernel code is
~10000. There is currently a very-very-very-very-very clear WARNING
usage practice that suggests that WARNING is-a bug in kernel code.
If we stop treating WARNINGs as bugs we will miss dozens of real bugs.
This just doesn't make sense provided that we are talking about
basically a single precedent.

We actually even treat INFO as bugs, because there are:
INFO: possible circular locking dependency detected
INFO: rcu_preempt detected stalls
INFO: rcu_sched detected stalls
INFO: rcu_preempt self-detected stall on CPU
INFO: rcu_sched self-detected stall on CPU
INFO: suspicious RCU usage
INFO: task .* blocked for more than [0-9]+ seconds
with only a single exception of:
INFO: lockdep is turned off

For INFO the story is so clear. The name strongly suggests that it is
not a bug. So probably we should promote harmful INFO to WARNINGs. And
then stop treating INFO as bugs.