2022-08-24 16:35:15

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 0/2] coding-style.rst: document BUG() and WARN() rules

As it seems to be rather unclear if/when to use BUG(), BUG_ON(),
VM_BUG_ON(), WARN_ON_ONCE(), ... let's try to document the result of a
recent discussion.

Details can be found in patch #1.

--------------------------------------------------------------------------

Here is some braindump after thinking about BUG_ON(), WARN_ON(), ... and
how it interacts with kdump.

I was wondering what the expectation on a system with armed kdump are,
for example, after we removed most BUG_ON() instances and replaced them
by WARN_ON_ONCE(). I would assume that we actually want to panic in some
cases to capture a proper system dump instead of continuing and eventually
ending up with a completely broken system where it's hard to extract any
useful debug information. We'd have to enable panic_on_warn. But we'd only
want to do that in case kdump is actually armed after boot.

So one idea would be to have some kind of "panic_on_warn_with_kdump" mode.
But then, we'd actually crash+kdump even on the most harmless WARN_ON()
conditions, because they all look alike. To compensate, we would need
some kind of "severity" levels of a warning -- at least some kind of
"this is harmless and we can easily recover, but please tell the
developers" vs. "this is real bad and unexpected, capture a dump
immediately instead of trying to recover and eventually failing miserably".

But then, maybe we really want something like BUG_ON() -- let's call it
CBUG_ON() for simplicity -- but be able to make it be usable in
conditionals (to implement recovery code if easily possible) and make the
runtime behavior configurable.

if (CBUG_ON(whatever))
try_to_recover()

Whereby, for example, "panic_on_cbug" and "panic_on_cbug_with_kdump"
could control the runtime behavior.

But this is just a braindump and I assume people reading along have other,
better ideas. Especially, a better name for CBUG.


Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Laight <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Andy Whitcroft <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Dwaipayan Ray <[email protected]>
Cc: Lukas Bulwahn <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Dave Young <[email protected]>

David Hildenbrand (2):
coding-style.rst: document BUG() and WARN() rules ("do not crash the
kernel")
checkpatch: warn on usage of VM_BUG_ON() and friends

Documentation/process/coding-style.rst | 27 ++++++++++++++++++++++++++
scripts/checkpatch.pl | 6 +++---
2 files changed, 30 insertions(+), 3 deletions(-)

--
2.37.1


2022-08-24 16:54:23

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

Linus notes [1] that the introduction of new code that uses VM_BUG_ON()
is just as bad as BUG_ON(), because it will crash the kernel on
distributions that enable CONFIG_DEBUG_VM (like Fedora):

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
no different, the only difference is "we can make the code smaller
because these are less important". [2]

This resulted in a more generic discussion about usage of BUG() and
friends. While there might be corner cases that still deserve a BUG_ON(),
most BUG_ON() cases should simply use WARN_ON_ONCE() and implement a
recovery path if reasonable:

The only possible case where BUG_ON can validly be used is "I have
some fundamental data corruption and cannot possibly return an
error". [2]

As a very good approximation is the general rule:

"absolutely no new BUG_ON() calls _ever_" [2]

... not even if something really shouldn't ever happen and is merely for
documenting that an invariant always has to hold.

There is only one good BUG_ON():

Now, that said, there is one very valid sub-form of BUG_ON():
BUILD_BUG_ON() is absolutely 100% fine. [2]

While WARN will also crash the machine with panic_on_warn set, that's
exactly to be expected:

So we have two very different cases: the "virtual machine with good
logging where a dead machine is fine" - use 'panic_on_warn'. And
the actual real hardware with real drivers, running real loads by
users. [3]

The basic idea is that warnings will similarly get reported by users
and be found during testing. However, in contrast to a BUG(), there is a
way to actually influence the expected behavior (e.g., panic_on_warn)
and to eventually keep the machine alive to extract some debug info.

Ingo notes that not all WARN_ON_ONCE cases need recovery. If we don't ever
expect this code to trigger in any case, recovery code is not really
helpful.

I'd prefer to keep all these warnings 'simple' - i.e. no attempted
recovery & control flow, unless we ever expect these to trigger.
[4]

There have been different rules floating around that were never properly
documented. Let's try to clarify.

[1] https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=NKPWHU8AdO3V56BXcCsU97oYJ1EA@mail.gmail.com
[2] https://lore.kernel.org/r/CAHk-=wg40EAZofO16Eviaj7mfqDhZ2gVEbvfsMf6gYzspRjYvw@mail.gmail.com
[3] https://lore.kernel.org/r/CAHk-=wgF7K2gSSpy=m_=K3Nov4zaceUX9puQf1TjkTJLA2XC_g@mail.gmail.com

Signed-off-by: David Hildenbrand <[email protected]>
---
Documentation/process/coding-style.rst | 27 ++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
index 03eb53fd029a..a6d81ff578fe 100644
--- a/Documentation/process/coding-style.rst
+++ b/Documentation/process/coding-style.rst
@@ -1186,6 +1186,33 @@ expression used. For instance:
#endif /* CONFIG_SOMETHING */


+22) Do not crash the kernel
+---------------------------
+
+Do not add new code that uses BUG(), BUG_ON(), VM_BUG_ON(), ... to crash
+the kernel if certain conditions are not met. Instead, use WARN_ON_ONCE()
+with recovery code (if reasonable) instead. Unavoidable data corruption /
+security issues might be a very rare exception to this rule and need good
+justification.
+
+There is no need for WARN_ON_ONCE() recovery code if we never expect it to
+possibly trigger unless something goes seriously wrong or some other code
+is changed to break invariants. Note that VM_WARN_ON_ONCE() cannot be used
+in conditionals.
+
+Usage of WARN() and friends is fine for something that is not expected to
+be triggered easily. ``panic_on_warn`` users are not an excuse to not use
+WARN(): whoever enables ``panic_on_warn`` explicitly asked the kernel to
+crash in this case.
+
+However, WARN() and friends should not be used for something that is
+expected to trigger easily, for example, by user space. pr_warn_once()
+might be a reasonable replacement to notify the user.
+
+Note that BUILD_BUG_ON() is perfectly fine because it will make compilation
+fail instead of crashing the kernel.
+
+
Appendix I) References
----------------------

--
2.37.1

2022-08-24 17:09:02

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

checkpatch does not point out that VM_BUG_ON() and friends should be
avoided, however, Linus notes:

VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
no different, the only difference is "we can make the code smaller
because these are less important". [1]

So let's warn on VM_BUG_ON() and friends as well. While at it, make it
clearer that the kernel really shouldn't be crashed.

Note that there are some other *_BUG_ON flavors, but they are not all
bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
flags KVM as being buggy, so we'll not care about them for now here.

[1] https://lore.kernel.org/r/CAHk-=wg40EAZofO16Eviaj7mfqDhZ2gVEbvfsMf6gYzspRjYvw@mail.gmail.com

Signed-off-by: David Hildenbrand <[email protected]>
---
scripts/checkpatch.pl | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 79e759aac543..4c18acf17032 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -4695,12 +4695,12 @@ sub process {
}
}

-# avoid BUG() or BUG_ON()
- if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
+# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
+ if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
my $msg_level = \&WARN;
$msg_level = \&CHK if ($file);
&{$msg_level}("AVOID_BUG",
- "Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
+ "Do not crash the kernel unless it is unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than BUG(), BUG_ON(), VM_BUG_ON(), ...\n" . $herecurr);
}

# avoid LINUX_VERSION_CODE
--
2.37.1

2022-08-24 17:26:52

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
> checkpatch does not point out that VM_BUG_ON() and friends should be
> avoided, however, Linus notes:
>
> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
> no different, the only difference is "we can make the code smaller
> because these are less important". [1]
>
> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
> clearer that the kernel really shouldn't be crashed.
>
> Note that there are some other *_BUG_ON flavors, but they are not all
> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
> flags KVM as being buggy, so we'll not care about them for now here.
[]
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
[]
> @@ -4695,12 +4695,12 @@ sub process {
> }
> }
>
> -# avoid BUG() or BUG_ON()
> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
> + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {

Perhaps better as something like the below to pick up more variants

if ($line =~ /\b(?!KVM_|BUILD_)(?:[A-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/

> my $msg_level = \&WARN;
> $msg_level = \&CHK if ($file);
> &{$msg_level}("AVOID_BUG",
> - "Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);

and maybe:

"Do not crash the kernel unless it is unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than BUG() or variants\n" . $herecurr);

2022-08-24 19:31:11

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

On 24.08.22 18:52, Joe Perches wrote:
> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>> checkpatch does not point out that VM_BUG_ON() and friends should be
>> avoided, however, Linus notes:
>>
>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>> no different, the only difference is "we can make the code smaller
>> because these are less important". [1]
>>
>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>> clearer that the kernel really shouldn't be crashed.
>>
>> Note that there are some other *_BUG_ON flavors, but they are not all
>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>> flags KVM as being buggy, so we'll not care about them for now here.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -4695,12 +4695,12 @@ sub process {
>> }
>> }
>>
>> -# avoid BUG() or BUG_ON()
>> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>> + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
>
> Perhaps better as something like the below to pick up more variants
>
> if ($line =~ /\b(?!KVM_|BUILD_)(?:[A-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/

... then I'll have to scan the other cases if they do something similar
as KVM. ... well, okay, I'll do it. :)

>
>> my $msg_level = \&WARN;
>> $msg_level = \&CHK if ($file);
>> &{$msg_level}("AVOID_BUG",
>> - "Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
>
> and maybe:
>
> "Do not crash the kernel unless it is unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than BUG() or variants\n" . $herecurr);
>
>

Yes, thanks!

--
Thanks,

David / dhildenb

2022-08-24 22:56:47

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On 8/24/22 09:30, David Hildenbrand wrote:
> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> index 03eb53fd029a..a6d81ff578fe 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1186,6 +1186,33 @@ expression used. For instance:
> #endif /* CONFIG_SOMETHING */
>

I like the idea of adding this documentation, and this is the right
place. Naturally, if one likes something, one must immediately change
it. :) Therefore, here is an alternative writeup that I think captures
what you and the email threads were saying.

How's this sound?

diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
index 03eb53fd029a..32df0d503388 100644
--- a/Documentation/process/coding-style.rst
+++ b/Documentation/process/coding-style.rst
@@ -1185,6 +1185,53 @@ expression used. For instance:
...
#endif /* CONFIG_SOMETHING */

+22) Do not crash the kernel
+---------------------------
+
+Use WARN() rather than BUG()
+****************************
+
+Do not add new code that uses any of the BUG() variants, such as BUG(),
+BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably
+WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required
+if there is no reasonable way to at least partially recover.
+
+Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
+**************************************************
+
+WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is
+common for a given warning condition, if it occurs at all, to occur multiple
+times. (For example, once per file, or once per struct page.) This can fill up
+and wrap the kernel log, and can even slow the system enough that the excessive
+logging turns into its own, additional problem.
+
+Do not WARN lightly
+*******************
+
+WARN*() is intended for unexpected, this-should-never-happen situations. WARN*()
+macros are not to be used for anything that is expected to happen during normal
+operation. These are not pre- or post-condition asserts, for example. Again:
+WARN*() must not be used for a condition that is expected to trigger easily, for
+example, by user space actions. pr_warn_once() is a possible alternative, if you
+need to notify the user of a problem.
+
+Do not worry about panic_on_warn users
+**************************************
+
+A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
+available kernel option, and that many users set this option. This is why there
+is a "Do not WARN lightly" writeup, above. However, the existence of
+panic_on_warn users is not a valid reason to avoid the judicious use WARN*().
+That is because, whoever enables panic_on_warn has explicitly asked the kernel
+to crash if a WARN*() fires, and such users must be prepared to deal with the
+consequences of a system that is somewhat more likely to crash.
+
+Use BUILD_BUG_ON() for compile-time assertions
+**********************************************
+
+The use of BUILD_BUG_ON() is acceptable and encouraged, because it is a
+compile-time assertion that has no effect at runtime.


thanks,

--
John Hubbard
NVIDIA

2022-08-25 02:40:44

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH RFC 0/2] coding-style.rst: document BUG() and WARN() rules

On 8/24/22 09:30, David Hildenbrand wrote:
...
> So one idea would be to have some kind of "panic_on_warn_with_kdump" mode.
> But then, we'd actually crash+kdump even on the most harmless WARN_ON()
> conditions, because they all look alike. To compensate, we would need
> some kind of "severity" levels of a warning -- at least some kind of
> "this is harmless and we can easily recover, but please tell the
> developers" vs. "this is real bad and unexpected, capture a dump
> immediately instead of trying to recover and eventually failing miserably".
>
> But then, maybe we really want something like BUG_ON() -- let's call it
> CBUG_ON() for simplicity -- but be able to make it be usable in
> conditionals (to implement recovery code if easily possible) and make the
> runtime behavior configurable.
>
> if (CBUG_ON(whatever))
> try_to_recover()
>
> Whereby, for example, "panic_on_cbug" and "panic_on_cbug_with_kdump"
> could control the runtime behavior.
>
> But this is just a braindump and I assume people reading along have other,
> better ideas. Especially, a better name for CBUG.
>

If this direction is pursued (as opposed to just recommending the
panic_on_warn approach, which is probably viable as well, btw), then I'd
suggest this name:

PANIC_ON()

It's different than BUG_ON(), because it calls panic() instead of
immediately halting on a undefined instruction exception (yes, that's
x86-centric, I know). So at least in the better behaved cases, there is
a backtrace and a reboot, rather than a mysterious hard lockup.

As Mel points out [1], it's not always that much better. But in my
experience, this is usually a *lot* better.

It's only intended for a few very special cases. Not intended as any
sort of assert (which BUG sometimes was used for).

This forces a panic(), which is what David is looking for.

[1] https://lore.kernel.org/all/[email protected]/


thanks,

--
John Hubbard
NVIDIA

2022-08-25 10:26:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

On 24.08.22 18:52, Joe Perches wrote:
> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>> checkpatch does not point out that VM_BUG_ON() and friends should be
>> avoided, however, Linus notes:
>>
>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>> no different, the only difference is "we can make the code smaller
>> because these are less important". [1]
>>
>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>> clearer that the kernel really shouldn't be crashed.
>>
>> Note that there are some other *_BUG_ON flavors, but they are not all
>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>> flags KVM as being buggy, so we'll not care about them for now here.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -4695,12 +4695,12 @@ sub process {
>> }
>> }
>>
>> -# avoid BUG() or BUG_ON()
>> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>> + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
>
> Perhaps better as something like the below to pick up more variants
>

Trying to find more possible variants and exceptions

$ git grep -h -o -E "\b[a-zA-Z]+_BUG(_ON(_[a-zA-Z]+)*)?\(" | sort | uniq
AA_BUG(
-> Ok, no BUG()
ASM_BUG(
-> Bad
BUILD_BUG(
BUILD_BUG_ON(
BUILD_BUG_ON_INVALID(
BUILD_BUG_ON_MSG(
BUILD_BUG_ON_ZERO(
-> Ok
CI_BUG_ON(
-> Bad with CONFIG_DRM_I915_DEBUG
DCCP_BUG(
DCCP_BUG_ON(
-> Ok, no BUG()
do_BUG(
-> BUG implementation, ok.
GEM_BUG_ON(
-> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE
GLOCK_BUG_ON(
-> Bad
handle_BUG(
-> BUG implementation, ok.
IDA_BUG_ON(
KVM_BUG(
KVM_BUG_ON(
-> Ok, no BUG()
lkdtm_BUG(
paravirt_BUG(
-> bad
PROM_BUG(
-> unused, will remove
RWLOCK_BUG_ON(
-> Ok, no BUG()
snd_BUG(
snd_BUG_ON(
-> Ok, no BUG()
SNIC_BUG_ON(
-> Bad
SPIN_BUG_ON(
-> Ok, no BUG()
UNWINDER_BUG(
UNWINDER_BUG_ON(
VIRTUAL_BUG_ON(
VM_BUG_ON(
VM_BUG_ON_FOLIO(
VM_BUG_ON_MM(
VM_BUG_ON_PAGE(
VM_BUG_ON_PGFLAGS(
VM_BUG_ON_VMA(
XA_BUG_ON(
-> Bad

So an extended versions of your proposal like (ignoring do_BUG and handle_BUG, people are smart enough to figure that out)

if ($line =~ /\b(?!AA_|BUILD_|DCCP_|IDA_|KVM_|RWLOCK_|snd_|SPIN_)(?:[a-zA-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/

?

--
Thanks,

David / dhildenb

2022-08-25 12:00:03

by Jani Nikula

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

On Thu, 25 Aug 2022, David Hildenbrand <[email protected]> wrote:
> On 24.08.22 18:52, Joe Perches wrote:
>> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>>> checkpatch does not point out that VM_BUG_ON() and friends should be
>>> avoided, however, Linus notes:
>>>
>>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>>> no different, the only difference is "we can make the code smaller
>>> because these are less important". [1]
>>>
>>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>>> clearer that the kernel really shouldn't be crashed.
>>>
>>> Note that there are some other *_BUG_ON flavors, but they are not all
>>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>>> flags KVM as being buggy, so we'll not care about them for now here.
>> []
>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>> []
>>> @@ -4695,12 +4695,12 @@ sub process {
>>> }
>>> }
>>>
>>> -# avoid BUG() or BUG_ON()
>>> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>>> + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
>>
>> Perhaps better as something like the below to pick up more variants
>>
>
> Trying to find more possible variants and exceptions

> CI_BUG_ON(
> -> Bad with CONFIG_DRM_I915_DEBUG
> GEM_BUG_ON(
> -> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE

These are hidden behind debug knobs that we use in our CI to
specifically catch "should not happen" cases fast and loud. Should not
be a problem for regular users.

BR,
Jani.


> So an extended versions of your proposal like (ignoring do_BUG and handle_BUG, people are smart enough to figure that out)
>
> if ($line =~ /\b(?!AA_|BUILD_|DCCP_|IDA_|KVM_|RWLOCK_|snd_|SPIN_)(?:[a-zA-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/
>
> ?

--
Jani Nikula, Intel Open Source Graphics Center

2022-08-25 12:02:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/2] checkpatch: warn on usage of VM_BUG_ON() and friends

On 25.08.22 13:43, Jani Nikula wrote:
> On Thu, 25 Aug 2022, David Hildenbrand <[email protected]> wrote:
>> On 24.08.22 18:52, Joe Perches wrote:
>>> On Wed, 2022-08-24 at 18:31 +0200, David Hildenbrand wrote:
>>>> checkpatch does not point out that VM_BUG_ON() and friends should be
>>>> avoided, however, Linus notes:
>>>>
>>>> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
>>>> no different, the only difference is "we can make the code smaller
>>>> because these are less important". [1]
>>>>
>>>> So let's warn on VM_BUG_ON() and friends as well. While at it, make it
>>>> clearer that the kernel really shouldn't be crashed.
>>>>
>>>> Note that there are some other *_BUG_ON flavors, but they are not all
>>>> bad: for example, KVM_BUG_ON() only triggers a WARN_ON_ONCE and then
>>>> flags KVM as being buggy, so we'll not care about them for now here.
>>> []
>>>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>>> []
>>>> @@ -4695,12 +4695,12 @@ sub process {
>>>> }
>>>> }
>>>>
>>>> -# avoid BUG() or BUG_ON()
>>>> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
>>>> +# do not use BUG(), BUG_ON(), VM_BUG_ON() and friends.
>>>> + if ($line =~ /\b(?:BUG|BUG_ON|VM_BUG_ON|VM_BUG_ON_[A-Z]+)\b/) {
>>>
>>> Perhaps better as something like the below to pick up more variants
>>>
>>
>> Trying to find more possible variants and exceptions
>
>> CI_BUG_ON(
>> -> Bad with CONFIG_DRM_I915_DEBUG
>> GEM_BUG_ON(
>> -> Bad with CONFIG_DRM_I915_DEBUG_GEM_ONCE
>
> These are hidden behind debug knobs that we use in our CI to
> specifically catch "should not happen" cases fast and loud. Should not
> be a problem for regular users.
>

I tend to agree but I don't think this is worth an exception.
VM_BUG_ON also requires CONFIG_DEBUG_VM and absolutely shouldn't
be used as I learned.

Quoting Linus:

Really. BUG_ON() IS NOT FOR DEBUGGING. [1]

This kind of "I don't think this can happen" is _never_ an excuse for it. [2]


For CI work, it might be sufficient to use WARN_ON_ONCE() combined with panic_on_warn.

[1] https://lore.kernel.org/all/CAHk-=wiEAH+ojSpAgx_Ep=NKPWHU8AdO3V56BXcCsU97oYJ1EA@mail.gmail.com/
[2] https://lore.kernel.org/all/CAHk-=wg40EAZofO16Eviaj7mfqDhZ2gVEbvfsMf6gYzspRjYvw@mail.gmail.com/

--
Thanks,

David / dhildenb

2022-08-25 12:20:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On 24.08.22 23:59, John Hubbard wrote:
> On 8/24/22 09:30, David Hildenbrand wrote:
>> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
>> index 03eb53fd029a..a6d81ff578fe 100644
>> --- a/Documentation/process/coding-style.rst
>> +++ b/Documentation/process/coding-style.rst
>> @@ -1186,6 +1186,33 @@ expression used. For instance:
>> #endif /* CONFIG_SOMETHING */
>>
>
> I like the idea of adding this documentation, and this is the right
> place. Naturally, if one likes something, one must immediately change
> it. :) Therefore, here is an alternative writeup that I think captures
> what you and the email threads were saying.
>
> How's this sound?

Much better, thanks! :)

>
> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> index 03eb53fd029a..32df0d503388 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1185,6 +1185,53 @@ expression used. For instance:
> ...
> #endif /* CONFIG_SOMETHING */
>
> +22) Do not crash the kernel
> +---------------------------
> +
> +Use WARN() rather than BUG()
> +****************************
> +
> +Do not add new code that uses any of the BUG() variants, such as BUG(),
> +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably
> +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required
> +if there is no reasonable way to at least partially recover.

I'll tend to keep in this section:

"Unavoidable data corruption / security issues might be a very rare
exception to this rule and need good justification."

Because there are rare exceptions, and I'd much rather document the
clear exception to this rule.

> +
> +Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
> +**************************************************
> +
> +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is
> +common for a given warning condition, if it occurs at all, to occur multiple
> +times. (For example, once per file, or once per struct page.) This can fill up

I'll drop the "For example" part. I feel like this doesn't really need
an example -- most probably we've all been there already when the kernel
log was flooded :)

> +and wrap the kernel log, and can even slow the system enough that the excessive
> +logging turns into its own, additional problem.
> +
> +Do not WARN lightly
> +*******************
> +
> +WARN*() is intended for unexpected, this-should-never-happen situations. WARN*()
> +macros are not to be used for anything that is expected to happen during normal
> +operation. These are not pre- or post-condition asserts, for example. Again:
> +WARN*() must not be used for a condition that is expected to trigger easily, for
> +example, by user space actions. pr_warn_once() is a possible alternative, if you
> +need to notify the user of a problem.
> +
> +Do not worry about panic_on_warn users
> +**************************************
> +
> +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> +available kernel option, and that many users set this option. This is why there
> +is a "Do not WARN lightly" writeup, above. However, the existence of
> +panic_on_warn users is not a valid reason to avoid the judicious use WARN*().
> +That is because, whoever enables panic_on_warn has explicitly asked the kernel
> +to crash if a WARN*() fires, and such users must be prepared to deal with the
> +consequences of a system that is somewhat more likely to crash.

Side note: especially with kdump() I feel like we might see much more
widespread use of panic_on_warn to be able to actually extract debug
information in a controlled manner -- for example on enterprise distros.
... which would then make these systems more likely to crash, because
there is no way to distinguish a rather harmless warning from a severe
warning :/ . But let's see if some kdump() folks will share their
opinion as reply to the cover letter.

--
Thanks,

David / dhildenb

2022-08-26 02:19:05

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

Hi David,

[Added more people in cc]

On Thu, 25 Aug 2022 at 20:13, David Hildenbrand <[email protected]> wrote:
>
> On 24.08.22 23:59, John Hubbard wrote:
> > On 8/24/22 09:30, David Hildenbrand wrote:
> >> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> >> index 03eb53fd029a..a6d81ff578fe 100644
> >> --- a/Documentation/process/coding-style.rst
> >> +++ b/Documentation/process/coding-style.rst
> >> @@ -1186,6 +1186,33 @@ expression used. For instance:
> >> #endif /* CONFIG_SOMETHING */
> >>
> >
> > I like the idea of adding this documentation, and this is the right
> > place. Naturally, if one likes something, one must immediately change
> > it. :) Therefore, here is an alternative writeup that I think captures
> > what you and the email threads were saying.
> >
> > How's this sound?
>
> Much better, thanks! :)
>
> >
> > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> > index 03eb53fd029a..32df0d503388 100644
> > --- a/Documentation/process/coding-style.rst
> > +++ b/Documentation/process/coding-style.rst
> > @@ -1185,6 +1185,53 @@ expression used. For instance:
> > ...
> > #endif /* CONFIG_SOMETHING */
> >
> > +22) Do not crash the kernel
> > +---------------------------
> > +
> > +Use WARN() rather than BUG()
> > +****************************
> > +
> > +Do not add new code that uses any of the BUG() variants, such as BUG(),
> > +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably
> > +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not required
> > +if there is no reasonable way to at least partially recover.
>
> I'll tend to keep in this section:
>
> "Unavoidable data corruption / security issues might be a very rare
> exception to this rule and need good justification."
>
> Because there are rare exceptions, and I'd much rather document the
> clear exception to this rule.
>
> > +
> > +Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
> > +**************************************************
> > +
> > +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it is
> > +common for a given warning condition, if it occurs at all, to occur multiple
> > +times. (For example, once per file, or once per struct page.) This can fill up
>
> I'll drop the "For example" part. I feel like this doesn't really need
> an example -- most probably we've all been there already when the kernel
> log was flooded :)
>
> > +and wrap the kernel log, and can even slow the system enough that the excessive
> > +logging turns into its own, additional problem.
> > +
> > +Do not WARN lightly
> > +*******************
> > +
> > +WARN*() is intended for unexpected, this-should-never-happen situations. WARN*()
> > +macros are not to be used for anything that is expected to happen during normal
> > +operation. These are not pre- or post-condition asserts, for example. Again:
> > +WARN*() must not be used for a condition that is expected to trigger easily, for
> > +example, by user space actions. pr_warn_once() is a possible alternative, if you
> > +need to notify the user of a problem.
> > +
> > +Do not worry about panic_on_warn users
> > +**************************************
> > +
> > +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > +available kernel option, and that many users set this option. This is why there
> > +is a "Do not WARN lightly" writeup, above. However, the existence of
> > +panic_on_warn users is not a valid reason to avoid the judicious use WARN*().
> > +That is because, whoever enables panic_on_warn has explicitly asked the kernel
> > +to crash if a WARN*() fires, and such users must be prepared to deal with the
> > +consequences of a system that is somewhat more likely to crash.
>
> Side note: especially with kdump() I feel like we might see much more
> widespread use of panic_on_warn to be able to actually extract debug
> information in a controlled manner -- for example on enterprise distros.
> ... which would then make these systems more likely to crash, because
> there is no way to distinguish a rather harmless warning from a severe
> warning :/ . But let's see if some kdump() folks will share their
> opinion as reply to the cover letter.

I can understand the intention of this patch, and I totally agree that
BUG() should be used carefully, this is a good proposal if we can
clearly define the standard about when to use BUG(). But I do have
some worries, I think this standard is different for different sub
components, it is not clear to me at least, so this may introduce an
unstable running kernel and cause troubles (eg. data corruption) with
a WARN instead of a BUG. Probably it would be better to say "Do not
WARN lightly, and do not hesitate to use BUG if it is really needed"?

About "patch_on_warn", it will depend on the admin/end user to set it,
it is not a good idea for distribution to set it. It seems we are
leaving it to end users to take the risk of a kernel panic even with
all kernel WARN even if it is sometimes not necessary.


>
> --
> Thanks,
>
> David / dhildenb
>

Thanks
Dave

2022-08-26 17:11:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On 26.08.22 03:43, Dave Young wrote:
> Hi David,
>
> [Added more people in cc]
>

Hi Dave,

thanks for your input!

[...]

>> Side note: especially with kdump() I feel like we might see much more
>> widespread use of panic_on_warn to be able to actually extract debug
>> information in a controlled manner -- for example on enterprise distros.
>> ... which would then make these systems more likely to crash, because
>> there is no way to distinguish a rather harmless warning from a severe
>> warning :/ . But let's see if some kdump() folks will share their
>> opinion as reply to the cover letter.
>
> I can understand the intention of this patch, and I totally agree that
> BUG() should be used carefully, this is a good proposal if we can
> clearly define the standard about when to use BUG(). But I do have

Essentially, the general rule from Linus is "absolutely no new BUG_ON()
calls ever" -- but I think the consensus in that thread was that there
are corner cases when it comes to unavoidable data corruption/security
issues. And these are rare cases, not the usual case where we'd have
used BUG_ON()/VM_BUG_ON().

> some worries, I think this standard is different for different sub
> components, it is not clear to me at least, so this may introduce an
> unstable running kernel and cause troubles (eg. data corruption) with
> a WARN instead of a BUG. Probably it would be better to say "Do not
> WARN lightly, and do not hesitate to use BUG if it is really needed"?


Well, I don't make the rules, I document them and share them for general
awareness/comments :) Documenting this is valuable, because there seem
to be quite some different opinions floating around in the community --
and I've been learning different rules from different people over the years.

>
> About "patch_on_warn", it will depend on the admin/end user to set it,
> it is not a good idea for distribution to set it. It seems we are
> leaving it to end users to take the risk of a kernel panic even with
> all kernel WARN even if it is sometimes not necessary.

My question would be what we could add/improve to keep systems with
kdump armed running as expected for end users, that is most probably:

1) don't crash on harmless WARN() that can just be reported and the
machine will continue running mostly fine without real issues.
2) crash on severe issues (previously BUG) such that we can properly
capture a system dump via kdump. The restart the machine.

Of course, once one would run into 2), one could try reproducing with
"panic_on_warn" to get a reasonable system dump. But I guess that's not
what enterprise customers expect.


One wild idea (in the cover letter) was to add something new that can be
configured by user space and that expresses that something is more
severe than just some warning that can be recovered easily. But it can
eventually be recovered to keep the system running to some degree. But
still, it's configurable if we want to trigger a panic or let the system
run.

John mentioned PANIC_ON().


What would be your expectation for kdump users under which conditions we
want to trigger kdump and when not?

Regarding panic_on_warn, how often do e.g., RHEL users observe warnings
that we're not able to catch during testing, such that "panic_on_warn"
would be a real no-go?

--
Thanks,

David / dhildenb

2022-08-29 02:12:22

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

Hi David,

On Sat, 27 Aug 2022 at 01:02, David Hildenbrand <[email protected]> wrote:
>
> On 26.08.22 03:43, Dave Young wrote:
> > Hi David,
> >
> > [Added more people in cc]
> >
>
> Hi Dave,
>
> thanks for your input!

You are welcome :)

>
> [...]
>
> >> Side note: especially with kdump() I feel like we might see much more
> >> widespread use of panic_on_warn to be able to actually extract debug
> >> information in a controlled manner -- for example on enterprise distros.
> >> ... which would then make these systems more likely to crash, because
> >> there is no way to distinguish a rather harmless warning from a severe
> >> warning :/ . But let's see if some kdump() folks will share their
> >> opinion as reply to the cover letter.
> >
> > I can understand the intention of this patch, and I totally agree that
> > BUG() should be used carefully, this is a good proposal if we can
> > clearly define the standard about when to use BUG(). But I do have
>
> Essentially, the general rule from Linus is "absolutely no new BUG_ON()
> calls ever" -- but I think the consensus in that thread was that there
> are corner cases when it comes to unavoidable data corruption/security
> issues. And these are rare cases, not the usual case where we'd have
> used BUG_ON()/VM_BUG_ON().

Yes, probably.. (say probably because those cases are hidden and not
clear sometimes)

>
> > some worries, I think this standard is different for different sub
> > components, it is not clear to me at least, so this may introduce an
> > unstable running kernel and cause troubles (eg. data corruption) with
> > a WARN instead of a BUG. Probably it would be better to say "Do not
> > WARN lightly, and do not hesitate to use BUG if it is really needed"?
>
>
> Well, I don't make the rules, I document them and share them for general
> awareness/comments :) Documenting this is valuable, because there seem
> to be quite some different opinions floating around in the community --
> and I've been learning different rules from different people over the years.

Understand.

>
> >
> > About "patch_on_warn", it will depend on the admin/end user to set it,
> > it is not a good idea for distribution to set it. It seems we are
> > leaving it to end users to take the risk of a kernel panic even with
> > all kernel WARN even if it is sometimes not necessary.
>
> My question would be what we could add/improve to keep systems with
> kdump armed running as expected for end users, that is most probably:
>
> 1) don't crash on harmless WARN() that can just be reported and the
> machine will continue running mostly fine without real issues.
> 2) crash on severe issues (previously BUG) such that we can properly
> capture a system dump via kdump. The restart the machine.
>
> Of course, once one would run into 2), one could try reproducing with
> "panic_on_warn" to get a reasonable system dump. But I guess that's not
> what enterprise customers expect.
>

Sometimes the bug can not be easily reproduced again. So there seems
no easy and good way to use..

>
> One wild idea (in the cover letter) was to add something new that can be
> configured by user space and that expresses that something is more
> severe than just some warning that can be recovered easily. But it can
> eventually be recovered to keep the system running to some degree. But
> still, it's configurable if we want to trigger a panic or let the system
> run.
>
> John mentioned PANIC_ON().
>

I would vote for PANIC_ON(), it sounds like a good idea, because
BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
this will cause a kernel panic and one will be more careful before
using it.

>
> What would be your expectation for kdump users under which conditions we
> want to trigger kdump and when not?
>
> Regarding panic_on_warn, how often do e.g., RHEL users observe warnings
> that we're not able to catch during testing, such that "panic_on_warn"
> would be a real no-go?

Well, I'm not sure how to answer the questions, when to panic should
be decided by kernel developers instead of kdump users, but I think
the panic behaviour does impact the supporting team. I added Stephen
who is from the RH supporting team, maybe he can have some inputs.

BTW, I vaguely remember Prarit introduced the panic_on_warn, see if he
has any comments here.

Thanks
Dave



>
> --
> Thanks,
>
> David / dhildenb
>

2022-08-29 03:33:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On Sun, Aug 28, 2022 at 6:56 PM Dave Young <[email protected]> wrote:
>
> > John mentioned PANIC_ON().
>
> I would vote for PANIC_ON(), it sounds like a good idea, because
> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
> this will cause a kernel panic and one will be more careful before
> using it.

People, NO.

We're trying to get rid of BUG_ON() because it kills the machine.

Not replace it with another bogus thing that kills a machine.

So no PANIC_ON(). We used to have "panic()" many many years ago, we
got rid of it. We're not re-introducing it.

People who want to panic on warnings can do so. WARN_ON() _becomes_
PANIC for those people. But those people are the "we have a million
machines, we want to just fail things on any sign of trouble, and we
have MIS people who can look at the logs".

And it's not like we need to get rid of _all_ BUG_ON() cases. If you
have a "this is major internal corruption, there's no way we can
continue", then BUG_ON() is appropriate. It will try to kill that
process and try to keep the machine running, and again, the kind of
people who don't care about one machine (because - again - they have
millions of them) can just turn that into a panic-and-reboot
situation.

But the kind of people for whom the machine they are on IS THEIR ONLY
MACHINE - whether it be a workstation, a laptop, or a cellphone -
there is absolutely zero situation where "let's just kill the machine"
is *EVER* approproate. Even a BUG_ON() will try to continue as well as
it can after killing the current thread, but it's going to be iffy,
because locking etc.

So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for
"oops, I really don't know what to do, and I physically *cannot*
continue" (and that is *not* "I'm too lazy to do error handling").

There is no room for PANIC. None. Ever.

The only thing there is are "I don't care about this machine because
I've got 999,999 other machines, so I'd rather take one machine
offline for analysis".

Understand? The "should I panic and reboot" is fundamentally not about
the code, and it's not a choice that the kernel code gets to make.
It's purely about the choice of the person maintaining the machine.

As a kernel developer, you do not EVER get to say "panic" or "kill the machine".

End of story.

Linus

2022-08-29 05:01:15

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On 8/28/22 20:07, Linus Torvalds wrote:
> On Sun, Aug 28, 2022 at 6:56 PM Dave Young <[email protected]> wrote:
>>
>>> John mentioned PANIC_ON().
>>
>> I would vote for PANIC_ON(), it sounds like a good idea, because
>> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
>> this will cause a kernel panic and one will be more careful before
>> using it.
>
> People, NO.
>
> We're trying to get rid of BUG_ON() because it kills the machine.
>
> Not replace it with another bogus thing that kills a machine.

OK, this guidance, and the write-up that follows it, is all very clear,
except for one point...

>
> So no PANIC_ON(). We used to have "panic()" many many years ago, we
> got rid of it. We're not re-introducing it.

...here. I count ~1000 calls to panic() in today's kernel, to a
function in kernel/panic.c that shows no hint of being removed, nor
even deprecated.

$ git grep -nw panic\( | wc -l
1321

Could you please clarify that point? (I'm not trying to debate
policy, but rather, to figure out what we should write in the
documentation of the policy.)


thanks,

--
John Hubbard
NVIDIA

2022-08-29 09:15:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On 29.08.22 05:07, Linus Torvalds wrote:
> On Sun, Aug 28, 2022 at 6:56 PM Dave Young <[email protected]> wrote:
>>
>>> John mentioned PANIC_ON().
>>
>> I would vote for PANIC_ON(), it sounds like a good idea, because
>> BUG_ON() is not obvious and, PANIC_ON() can alert the code author that
>> this will cause a kernel panic and one will be more careful before
>> using it.
>
> People, NO.
>
> We're trying to get rid of BUG_ON() because it kills the machine.
>
> Not replace it with another bogus thing that kills a machine.
>
> So no PANIC_ON(). We used to have "panic()" many many years ago, we
> got rid of it. We're not re-introducing it.
>
> People who want to panic on warnings can do so. WARN_ON() _becomes_
> PANIC for those people. But those people are the "we have a million
> machines, we want to just fail things on any sign of trouble, and we
> have MIS people who can look at the logs".
>
> And it's not like we need to get rid of _all_ BUG_ON() cases. If you
> have a "this is major internal corruption, there's no way we can
> continue", then BUG_ON() is appropriate. It will try to kill that
> process and try to keep the machine running, and again, the kind of
> people who don't care about one machine (because - again - they have
> millions of them) can just turn that into a panic-and-reboot
> situation.
>
> But the kind of people for whom the machine they are on IS THEIR ONLY
> MACHINE - whether it be a workstation, a laptop, or a cellphone -
> there is absolutely zero situation where "let's just kill the machine"
> is *EVER* approproate. Even a BUG_ON() will try to continue as well as
> it can after killing the current thread, but it's going to be iffy,
> because locking etc.
>
> So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for
> "oops, I really don't know what to do, and I physically *cannot*
> continue" (and that is *not* "I'm too lazy to do error handling").
>
> There is no room for PANIC. None. Ever.

Let me clearer what I had in mind, avoiding the PANIC_ON terminology
John raised. I was wondering if it would make sense to

1) Be able to specify a severity for WARN (developer decision)

2) Be able to specify a severity for panic_on_warn (admin decision)

Distributions with kdump could keep a mode whereby severe warnings
(e.g., former BUG_ON) would properly kdump+reboot and harmless warnings
(e.g., clean recovery path) would WARN once + continue.

I agree that whether to panic should in most cases be a decision of the
admin, not the developer.


Now, that's a different discussion then the documentation update at
hand, and I primary wanted to raise awareness for the kdump people, and
ask them if a stronger move towards WARN_ON_ONCE will affect
them/customer expectations.

I'll work with John to document the current rules to reflect everything
you said here.

--
Thanks,

David / dhildenb

2022-08-29 09:28:37

by Jani Nikula

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On Sun, 28 Aug 2022, Linus Torvalds <[email protected]> wrote:
> So WARN_ON_ONCE() is the thing to aim for. BUG_ON() is the thing for
> "oops, I really don't know what to do, and I physically *cannot*
> continue" (and that is *not* "I'm too lazy to do error handling").

Any insight for the tradeoff between WARN_ON_ONCE() and WARN_ON(),
i.e. wasting the static once variable per use site vs. littering the
dmesg on every hit? I see there have been some improvements with the
__WARN_FLAGS() stuff, but is the data use really neglible?

BR,
Jani.


--
Jani Nikula, Intel Open Source Graphics Center

2022-08-29 17:48:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

On Sun, Aug 28, 2022 at 9:49 PM John Hubbard <[email protected]> wrote:
>
> ...here. I count ~1000 calls to panic() in today's kernel, to a
> function in kernel/panic.c that shows no hint of being removed, nor
> even deprecated.

Heh. I guess we never finished the panic() removal.

It's been decades, I suspect we ended up deciding that the bootup
failures might as well continue to panic.

Anyway, please don't use it. It's one of those things that should
never ever trigger, and mainly for something like "oops, I ran out of
memory during boot" etc.

Oh, I'm sure it's crept into other places too, but that doesn't make it ok.

Linus