LinuxLists.cc - [PATCH] Documentation: coding-style: don't encourage WARN*()

2024-04-14 17:09:04

Subject: [PATCH] Documentation: coding-style: don't encourage WARN*()

Several times recently Greg KH has admonished that variants of WARN()
should not be used, because when the panic_on_warn kernel option is set,
their use can lead to a panic. His reasoning was that the majority of
Linux instances (including Android and cloud systems) run with this option
enabled. And therefore a condition leading to a warning will frequently
cause an undesirable panic.

The "coding-style.rst" document says not to worry about this kernel
option. Update it to provide a more nuanced explanation.

Signed-off-by: Alex Elder <[email protected]>
---
Documentation/process/coding-style.rst | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
index 9c7cf73473943..bce43b01721cb 100644
--- a/Documentation/process/coding-style.rst
+++ b/Documentation/process/coding-style.rst
@@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
to trigger easily, for example, by user space actions. pr_warn_once() is a
possible alternative, if you need to notify the user of a problem.

-Do not worry about panic_on_warn users
-**************************************
+The panic_on_warn kernel option
+********************************

-A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
-available kernel option, and that many users set this option. This is why
-there is a "Do not WARN lightly" writeup, above. However, the existence of
-panic_on_warn users is not a valid reason to avoid the judicious use
-WARN*(). That is because, whoever enables panic_on_warn has explicitly
-asked the kernel to crash if a WARN*() fires, and such users must be
-prepared to deal with the consequences of a system that is somewhat more
-likely to crash.
+Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
+a WARN*() call whose condition holds leads to a kernel panic. Many users
+(including Android and many cloud providers) set this option, and this is
+why there is a "Do not WARN lightly" writeup, above.
+
+The existence of this option is not a valid reason to avoid the judicious
+use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
+issue warnings but do **not** cause the kernel to crash. Use these if you
+want to prevent such panics.

Use BUILD_BUG_ON() for compile-time assertions
**********************************************
--
2.40.1

2024-04-14 19:48:57

by Laurent Pinchart

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

Hi Alex,

Thank you for the patch.

On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> Several times recently Greg KH has admonished that variants of WARN()
> should not be used, because when the panic_on_warn kernel option is set,
> their use can lead to a panic. His reasoning was that the majority of
> Linux instances (including Android and cloud systems) run with this option
> enabled. And therefore a condition leading to a warning will frequently
> cause an undesirable panic.
>
> The "coding-style.rst" document says not to worry about this kernel
> option. Update it to provide a more nuanced explanation.
>
> Signed-off-by: Alex Elder <[email protected]>
> ---
> Documentation/process/coding-style.rst | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> index 9c7cf73473943..bce43b01721cb 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> to trigger easily, for example, by user space actions. pr_warn_once() is a
> possible alternative, if you need to notify the user of a problem.
>
> -Do not worry about panic_on_warn users
> -**************************************
> +The panic_on_warn kernel option
> +********************************
>
> -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> -available kernel option, and that many users set this option. This is why
> -there is a "Do not WARN lightly" writeup, above. However, the existence of
> -panic_on_warn users is not a valid reason to avoid the judicious use
> -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> -asked the kernel to crash if a WARN*() fires, and such users must be
> -prepared to deal with the consequences of a system that is somewhat more
> -likely to crash.
> +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> +a WARN*() call whose condition holds leads to a kernel panic. Many users
> +(including Android and many cloud providers) set this option, and this is
> +why there is a "Do not WARN lightly" writeup, above.
> +
> +The existence of this option is not a valid reason to avoid the judicious
> +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> +issue warnings but do **not** cause the kernel to crash. Use these if you
> +want to prevent such panics.

Those options are not equivalent, they print a single message, which is
much easier to ignore. WARN() is similar to -Werror in some sense, it
pushes vendors to fix the warnings. I have used WARN() in the past to
indicate usage of long-deprecated APIs that we were getting close to
removing for instance. dev_warn() wouldn't have had the same effect.

>
> Use BUILD_BUG_ON() for compile-time assertions
> **********************************************

--
Regards,

Laurent Pinchart

2024-04-14 20:06:48

by Alex Elder

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On 4/14/24 2:48 PM, Laurent Pinchart wrote:
> Hi Alex,
>
> Thank you for the patch.
>
> On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
>> Several times recently Greg KH has admonished that variants of WARN()
>> should not be used, because when the panic_on_warn kernel option is set,
>> their use can lead to a panic. His reasoning was that the majority of
>> Linux instances (including Android and cloud systems) run with this option
>> enabled. And therefore a condition leading to a warning will frequently
>> cause an undesirable panic.
>>
>> The "coding-style.rst" document says not to worry about this kernel
>> option. Update it to provide a more nuanced explanation.
>>
>> Signed-off-by: Alex Elder <[email protected]>
>> ---
>> Documentation/process/coding-style.rst | 21 +++++++++++----------
>> 1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
>> index 9c7cf73473943..bce43b01721cb 100644
>> --- a/Documentation/process/coding-style.rst
>> +++ b/Documentation/process/coding-style.rst
>> @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
>> to trigger easily, for example, by user space actions. pr_warn_once() is a
>> possible alternative, if you need to notify the user of a problem.
>>
>> -Do not worry about panic_on_warn users
>> -**************************************
>> +The panic_on_warn kernel option
>> +********************************
>>
>> -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
>> -available kernel option, and that many users set this option. This is why
>> -there is a "Do not WARN lightly" writeup, above. However, the existence of
>> -panic_on_warn users is not a valid reason to avoid the judicious use
>> -WARN*(). That is because, whoever enables panic_on_warn has explicitly
>> -asked the kernel to crash if a WARN*() fires, and such users must be
>> -prepared to deal with the consequences of a system that is somewhat more
>> -likely to crash.
>> +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
>> +a WARN*() call whose condition holds leads to a kernel panic. Many users
>> +(including Android and many cloud providers) set this option, and this is
>> +why there is a "Do not WARN lightly" writeup, above.
>> +
>> +The existence of this option is not a valid reason to avoid the judicious
>> +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
>> +issue warnings but do **not** cause the kernel to crash. Use these if you
>> +want to prevent such panics.
>
> Those options are not equivalent, they print a single message, which is
> much easier to ignore. WARN() is similar to -Werror in some sense, it
> pushes vendors to fix the warnings. I have used WARN() in the past to
> indicate usage of long-deprecated APIs that we were getting close to
> removing for instance. dev_warn() wouldn't have had the same effect.

Honestly, I feel somewhat the same way--that WARN() has a use
that differs from dev_warn(). E.g., in places where something
"won't happen" (but conceivably could if someone was developing
a future change and violated an assumption).

Nevertheless, if panic_on_warn is used in Android and cloud
scenarios as Greg says, he's right that it affects many, many
systems. Perhaps it's better to more strongly discourage the
use of that option?

I saw this "don't worry about it" message and felt it at least
ought to be toned down. The broader question of whether WARN()
should generally not be used might need some more discussion.

-Alex

>> Use BUILD_BUG_ON() for compile-time assertions
>> **********************************************
>

2024-04-15 05:21:52

by Greg KH

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Sun, Apr 14, 2024 at 10:48:35PM +0300, Laurent Pinchart wrote:
> Hi Alex,
>
> Thank you for the patch.
>
> On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> > Several times recently Greg KH has admonished that variants of WARN()
> > should not be used, because when the panic_on_warn kernel option is set,
> > their use can lead to a panic. His reasoning was that the majority of
> > Linux instances (including Android and cloud systems) run with this option
> > enabled. And therefore a condition leading to a warning will frequently
> > cause an undesirable panic.
> >
> > The "coding-style.rst" document says not to worry about this kernel
> > option. Update it to provide a more nuanced explanation.
> >
> > Signed-off-by: Alex Elder <[email protected]>
> > ---
> > Documentation/process/coding-style.rst | 21 +++++++++++----------
> > 1 file changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> > index 9c7cf73473943..bce43b01721cb 100644
> > --- a/Documentation/process/coding-style.rst
> > +++ b/Documentation/process/coding-style.rst
> > @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> > to trigger easily, for example, by user space actions. pr_warn_once() is a
> > possible alternative, if you need to notify the user of a problem.
> >
> > -Do not worry about panic_on_warn users
> > -**************************************
> > +The panic_on_warn kernel option
> > +********************************
> >
> > -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > -available kernel option, and that many users set this option. This is why
> > -there is a "Do not WARN lightly" writeup, above. However, the existence of
> > -panic_on_warn users is not a valid reason to avoid the judicious use
> > -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> > -asked the kernel to crash if a WARN*() fires, and such users must be
> > -prepared to deal with the consequences of a system that is somewhat more
> > -likely to crash.
> > +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> > +a WARN*() call whose condition holds leads to a kernel panic. Many users
> > +(including Android and many cloud providers) set this option, and this is
> > +why there is a "Do not WARN lightly" writeup, above.
> > +
> > +The existence of this option is not a valid reason to avoid the judicious
> > +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> > +issue warnings but do **not** cause the kernel to crash. Use these if you
> > +want to prevent such panics.
>
> Those options are not equivalent, they print a single message, which is
> much easier to ignore. WARN() is similar to -Werror in some sense, it
> pushes vendors to fix the warnings. I have used WARN() in the past to
> indicate usage of long-deprecated APIs that we were getting close to
> removing for instance. dev_warn() wouldn't have had the same effect.

If you want to reboot a box because someone called an "improper" api,
then sure, use WARN(), but that feels like a really bad idea. Just
remove the api and fix up all in-kernel users instead. Why wait?

If you want to show a traceback, then just print that out, but I've seen
that totally ignored as well, removing the api is usually the only way
to get people to actually notice, as then their builds break.

thanks,

greg k-h

2024-04-15 05:22:47

by Greg KH

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> Several times recently Greg KH has admonished that variants of WARN()
> should not be used, because when the panic_on_warn kernel option is set,
> their use can lead to a panic. His reasoning was that the majority of
> Linux instances (including Android and cloud systems) run with this option
> enabled. And therefore a condition leading to a warning will frequently
> cause an undesirable panic.
>
> The "coding-style.rst" document says not to worry about this kernel
> option. Update it to provide a more nuanced explanation.
>
> Signed-off-by: Alex Elder <[email protected]>

Thanks for writing this up:

Reviewed-by: Greg Kroah-Hartman <[email protected]>

2024-04-15 08:25:54

by Laurent Pinchart

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

Hi Greg,

On Mon, Apr 15, 2024 at 07:21:37AM +0200, Greg KH wrote:
> On Sun, Apr 14, 2024 at 10:48:35PM +0300, Laurent Pinchart wrote:
> > On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> > > Several times recently Greg KH has admonished that variants of WARN()
> > > should not be used, because when the panic_on_warn kernel option is set,
> > > their use can lead to a panic. His reasoning was that the majority of
> > > Linux instances (including Android and cloud systems) run with this option
> > > enabled. And therefore a condition leading to a warning will frequently
> > > cause an undesirable panic.
> > >
> > > The "coding-style.rst" document says not to worry about this kernel
> > > option. Update it to provide a more nuanced explanation.
> > >
> > > Signed-off-by: Alex Elder <[email protected]>
> > > ---
> > > Documentation/process/coding-style.rst | 21 +++++++++++----------
> > > 1 file changed, 11 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> > > index 9c7cf73473943..bce43b01721cb 100644
> > > --- a/Documentation/process/coding-style.rst
> > > +++ b/Documentation/process/coding-style.rst
> > > @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> > > to trigger easily, for example, by user space actions. pr_warn_once() is a
> > > possible alternative, if you need to notify the user of a problem.
> > >
> > > -Do not worry about panic_on_warn users
> > > -**************************************
> > > +The panic_on_warn kernel option
> > > +********************************
> > >
> > > -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > > -available kernel option, and that many users set this option. This is why
> > > -there is a "Do not WARN lightly" writeup, above. However, the existence of
> > > -panic_on_warn users is not a valid reason to avoid the judicious use
> > > -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> > > -asked the kernel to crash if a WARN*() fires, and such users must be
> > > -prepared to deal with the consequences of a system that is somewhat more
> > > -likely to crash.
> > > +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> > > +a WARN*() call whose condition holds leads to a kernel panic. Many users
> > > +(including Android and many cloud providers) set this option, and this is
> > > +why there is a "Do not WARN lightly" writeup, above.
> > > +
> > > +The existence of this option is not a valid reason to avoid the judicious
> > > +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> > > +issue warnings but do **not** cause the kernel to crash. Use these if you
> > > +want to prevent such panics.
> >
> > Those options are not equivalent, they print a single message, which is
> > much easier to ignore. WARN() is similar to -Werror in some sense, it
> > pushes vendors to fix the warnings. I have used WARN() in the past to
> > indicate usage of long-deprecated APIs that we were getting close to
> > removing for instance. dev_warn() wouldn't have had the same effect.
>
> If you want to reboot a box because someone called an "improper" api,

I don't "want" to reboot. It came as a side effect when panic_on_warn
was added, and worsened when its adoption increased. I won't argued for
or against panic_on_warn, but WARN() serves some use cases today that I
consider valid. If we want to discourage its usage, we need another API
to cover those use cases.

> then sure, use WARN(), but that feels like a really bad idea. Just
> remove the api and fix up all in-kernel users instead. Why wait?

There are multiple use cases. One of them is to make sure no new user of
the old, deprecated behaviour is introduced. This is especially
important when driver development spans multiple kernel releases, the
development can start before the API behaviour changes, with the driver
merged after the API change. This is something we've done multiple times
in V4L2.

> If you want to show a traceback, then just print that out, but I've seen
> that totally ignored as well, removing the api is usually the only way
> to get people to actually notice, as then their builds break.

Does your experience tell that tracebacks are routinely ignored during
development too, not just in production ?

--
Regards,

Laurent Pinchart

2024-04-15 08:33:52

by Greg KH

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Mon, Apr 15, 2024 at 11:25:29AM +0300, Laurent Pinchart wrote:
> Hi Greg,
>
> On Mon, Apr 15, 2024 at 07:21:37AM +0200, Greg KH wrote:
> > On Sun, Apr 14, 2024 at 10:48:35PM +0300, Laurent Pinchart wrote:
> > > On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> > > > Several times recently Greg KH has admonished that variants of WARN()
> > > > should not be used, because when the panic_on_warn kernel option is set,
> > > > their use can lead to a panic. His reasoning was that the majority of
> > > > Linux instances (including Android and cloud systems) run with this option
> > > > enabled. And therefore a condition leading to a warning will frequently
> > > > cause an undesirable panic.
> > > >
> > > > The "coding-style.rst" document says not to worry about this kernel
> > > > option. Update it to provide a more nuanced explanation.
> > > >
> > > > Signed-off-by: Alex Elder <[email protected]>
> > > > ---
> > > > Documentation/process/coding-style.rst | 21 +++++++++++----------
> > > > 1 file changed, 11 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> > > > index 9c7cf73473943..bce43b01721cb 100644
> > > > --- a/Documentation/process/coding-style.rst
> > > > +++ b/Documentation/process/coding-style.rst
> > > > @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> > > > to trigger easily, for example, by user space actions. pr_warn_once() is a
> > > > possible alternative, if you need to notify the user of a problem.
> > > >
> > > > -Do not worry about panic_on_warn users
> > > > -**************************************
> > > > +The panic_on_warn kernel option
> > > > +********************************
> > > >
> > > > -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > > > -available kernel option, and that many users set this option. This is why
> > > > -there is a "Do not WARN lightly" writeup, above. However, the existence of
> > > > -panic_on_warn users is not a valid reason to avoid the judicious use
> > > > -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> > > > -asked the kernel to crash if a WARN*() fires, and such users must be
> > > > -prepared to deal with the consequences of a system that is somewhat more
> > > > -likely to crash.
> > > > +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> > > > +a WARN*() call whose condition holds leads to a kernel panic. Many users
> > > > +(including Android and many cloud providers) set this option, and this is
> > > > +why there is a "Do not WARN lightly" writeup, above.
> > > > +
> > > > +The existence of this option is not a valid reason to avoid the judicious
> > > > +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> > > > +issue warnings but do **not** cause the kernel to crash. Use these if you
> > > > +want to prevent such panics.
> > >
> > > Those options are not equivalent, they print a single message, which is
> > > much easier to ignore. WARN() is similar to -Werror in some sense, it
> > > pushes vendors to fix the warnings. I have used WARN() in the past to
> > > indicate usage of long-deprecated APIs that we were getting close to
> > > removing for instance. dev_warn() wouldn't have had the same effect.
> >
> > If you want to reboot a box because someone called an "improper" api,
>
> I don't "want" to reboot. It came as a side effect when panic_on_warn
> was added, and worsened when its adoption increased. I won't argued for
> or against panic_on_warn, but WARN() serves some use cases today that I
> consider valid. If we want to discourage its usage, we need another API
> to cover those use cases.
>
> > then sure, use WARN(), but that feels like a really bad idea. Just
> > remove the api and fix up all in-kernel users instead. Why wait?
>
> There are multiple use cases. One of them is to make sure no new user of
> the old, deprecated behaviour is introduced. This is especially
> important when driver development spans multiple kernel releases, the
> development can start before the API behaviour changes, with the driver
> merged after the API change. This is something we've done multiple times
> in V4L2.
>
> > If you want to show a traceback, then just print that out, but I've seen
> > that totally ignored as well, removing the api is usually the only way
> > to get people to actually notice, as then their builds break.
>
> Does your experience tell that tracebacks are routinely ignored during
> development too, not just in production ?

Yes, we have done this in the past in some driver core apis and nothing
ever changed until we actually deleted the apis.

thanks,

greg k-h

2024-04-15 08:35:32

by Greg KH

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Mon, Apr 15, 2024 at 01:07:41AM -0700, Christoph Hellwig wrote:
> No, this advice is wronger than wrong. If you set panic_on_warn you
> get to keep the pieces.
>

But don't add new WARN() calls please, just properly clean up and handle
the error. And any WARN() that userspace can trigger ends up triggering
syzbot reports which also is a major pain, even if you don't have
panic_on_warn enabled.

And I think the "do not use panic_on_warn" recommendation has been
ignored, given the huge use of it by vendors who have enabled it (i.e.
all Samsung phones and cloud servers).

thanks,

greg k-h

2024-04-15 08:46:30

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Mon, Apr 15, 2024 at 10:35:21AM +0200, Greg KH wrote:
> On Mon, Apr 15, 2024 at 01:07:41AM -0700, Christoph Hellwig wrote:
> > No, this advice is wronger than wrong. If you set panic_on_warn you
> > get to keep the pieces.
> >
>
> But don't add new WARN() calls please, just properly clean up and handle
> the error. And any WARN() that userspace can trigger ends up triggering
> syzbot reports which also is a major pain, even if you don't have
> panic_on_warn enabled.

Important distinction here: WARN_ON_ONCE is for internal error
checking and absolutely intentional, and does not replace error
handling, that's why it passes the error value through. OF course
it should not be trigger by user action.

> And I think the "do not use panic_on_warn" recommendation has been
> ignored, given the huge use of it by vendors who have enabled it (i.e.
> all Samsung phones and cloud servers).

Sucks for them.

2024-04-15 08:55:04

by Laurent Pinchart

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Mon, Apr 15, 2024 at 10:33:42AM +0200, Greg KH wrote:
> On Mon, Apr 15, 2024 at 11:25:29AM +0300, Laurent Pinchart wrote:
> > On Mon, Apr 15, 2024 at 07:21:37AM +0200, Greg KH wrote:
> > > On Sun, Apr 14, 2024 at 10:48:35PM +0300, Laurent Pinchart wrote:
> > > > On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> > > > > Several times recently Greg KH has admonished that variants of WARN()
> > > > > should not be used, because when the panic_on_warn kernel option is set,
> > > > > their use can lead to a panic. His reasoning was that the majority of
> > > > > Linux instances (including Android and cloud systems) run with this option
> > > > > enabled. And therefore a condition leading to a warning will frequently
> > > > > cause an undesirable panic.
> > > > >
> > > > > The "coding-style.rst" document says not to worry about this kernel
> > > > > option. Update it to provide a more nuanced explanation.
> > > > >
> > > > > Signed-off-by: Alex Elder <[email protected]>
> > > > > ---
> > > > > Documentation/process/coding-style.rst | 21 +++++++++++----------
> > > > > 1 file changed, 11 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> > > > > index 9c7cf73473943..bce43b01721cb 100644
> > > > > --- a/Documentation/process/coding-style.rst
> > > > > +++ b/Documentation/process/coding-style.rst
> > > > > @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> > > > > to trigger easily, for example, by user space actions. pr_warn_once() is a
> > > > > possible alternative, if you need to notify the user of a problem.
> > > > >
> > > > > -Do not worry about panic_on_warn users
> > > > > -**************************************
> > > > > +The panic_on_warn kernel option
> > > > > +********************************
> > > > >
> > > > > -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > > > > -available kernel option, and that many users set this option. This is why
> > > > > -there is a "Do not WARN lightly" writeup, above. However, the existence of
> > > > > -panic_on_warn users is not a valid reason to avoid the judicious use
> > > > > -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> > > > > -asked the kernel to crash if a WARN*() fires, and such users must be
> > > > > -prepared to deal with the consequences of a system that is somewhat more
> > > > > -likely to crash.
> > > > > +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> > > > > +a WARN*() call whose condition holds leads to a kernel panic. Many users
> > > > > +(including Android and many cloud providers) set this option, and this is
> > > > > +why there is a "Do not WARN lightly" writeup, above.
> > > > > +
> > > > > +The existence of this option is not a valid reason to avoid the judicious
> > > > > +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> > > > > +issue warnings but do **not** cause the kernel to crash. Use these if you
> > > > > +want to prevent such panics.
> > > >
> > > > Those options are not equivalent, they print a single message, which is
> > > > much easier to ignore. WARN() is similar to -Werror in some sense, it
> > > > pushes vendors to fix the warnings. I have used WARN() in the past to
> > > > indicate usage of long-deprecated APIs that we were getting close to
> > > > removing for instance. dev_warn() wouldn't have had the same effect.
> > >
> > > If you want to reboot a box because someone called an "improper" api,
> >
> > I don't "want" to reboot. It came as a side effect when panic_on_warn
> > was added, and worsened when its adoption increased. I won't argued for
> > or against panic_on_warn, but WARN() serves some use cases today that I
> > consider valid. If we want to discourage its usage, we need another API
> > to cover those use cases.
> >
> > > then sure, use WARN(), but that feels like a really bad idea. Just
> > > remove the api and fix up all in-kernel users instead. Why wait?
> >
> > There are multiple use cases. One of them is to make sure no new user of
> > the old, deprecated behaviour is introduced. This is especially
> > important when driver development spans multiple kernel releases, the
> > development can start before the API behaviour changes, with the driver
> > merged after the API change. This is something we've done multiple times
> > in V4L2.
> >
> > > If you want to show a traceback, then just print that out, but I've seen
> > > that totally ignored as well, removing the api is usually the only way
> > > to get people to actually notice, as then their builds break.
> >
> > Does your experience tell that tracebacks are routinely ignored during
> > development too, not just in production ?
>
> Yes, we have done this in the past in some driver core apis and nothing
> ever changed until we actually deleted the apis.

Let's keep WARN() + panic_on_warn then, it should help making people
notice :-)

Jokes aside, if we want to discourage new users of WARN() because of
panic_on_warn, I'd like a WARN_NO_PANIC().

--
Regards,

Laurent Pinchart

2024-04-15 12:06:39

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

No, this advice is wronger than wrong. If you set panic_on_warn you
get to keep the pieces.

2024-04-15 16:26:51

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Mon, Apr 15, 2024 at 10:35:21AM +0200, Greg KH wrote:
> On Mon, Apr 15, 2024 at 01:07:41AM -0700, Christoph Hellwig wrote:
> > No, this advice is wronger than wrong. If you set panic_on_warn you
> > get to keep the pieces.
> >
>
> But don't add new WARN() calls please, just properly clean up and handle
> the error. And any WARN() that userspace can trigger ends up triggering
> syzbot reports which also is a major pain, even if you don't have
> panic_on_warn enabled.

Here's what was more recently written on WARN:

https://docs.kernel.org/process/deprecated.html#bug-and-bug-on

Specifically:

- never use BUG*()
- WARN*() should only be used for "expected to be unreachable" situations

This, then, maps correctly to panic_on_warn: System owners may have set
the panic_on_warn sysctl, to make sure their systems do not continue
running in the face of "unreachable" conditions.

As in, userspace should _never_ be able to reach a WARN(). If it can,
either the logic leading to it needs to be fixed, or the WARN() needs to
be changed to a pr_warn().

--
Kees Cook

2024-04-18 16:12:12

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Mon, Apr 15, 2024 at 09:26:40AM -0700, Kees Cook wrote:
> On Mon, Apr 15, 2024 at 10:35:21AM +0200, Greg KH wrote:
> > On Mon, Apr 15, 2024 at 01:07:41AM -0700, Christoph Hellwig wrote:
> > > No, this advice is wronger than wrong. If you set panic_on_warn you
> > > get to keep the pieces.
> > >
> >
> > But don't add new WARN() calls please, just properly clean up and handle
> > the error. And any WARN() that userspace can trigger ends up triggering
> > syzbot reports which also is a major pain, even if you don't have
> > panic_on_warn enabled.
>
> Here's what was more recently written on WARN:
>
> https://docs.kernel.org/process/deprecated.html#bug-and-bug-on
>
> Specifically:
>
> - never use BUG*()
> - WARN*() should only be used for "expected to be unreachable" situations
>
> This, then, maps correctly to panic_on_warn: System owners may have set
> the panic_on_warn sysctl, to make sure their systems do not continue
> running in the face of "unreachable" conditions.
>
> As in, userspace should _never_ be able to reach a WARN(). If it can,
> either the logic leading to it needs to be fixed, or the WARN() needs to
> be changed to a pr_warn().

Exactly! No doubt there are mistakes, but we already document this too
a few lines above where this is touching:

Do not WARN lightly
*******************

WARN*() is intended for unexpected, this-should-never-happen situations.
WARN*() macros are not to be used for anything that is expected to happen
during normal operation. These are not pre- or post-condition asserts, for
example. Again: WARN*() must not be used for a condition that is expected
to trigger easily, for example, by user space actions. pr_warn_once() is a
possible alternative, if you need to notify the user of a problem.

Usages following that advice should be left alone and more should be
added. Invariant checks that indicate the kernel is malfunctioning are
desirable things to have!

Yes, by all means tell people to follow the above rules! But that
isn't a ban on WARN and shouldn't be communicated as "don't add new
WARN() calls please".

Let's all keep in mind that fuzzing reports are incredibly valuable to
make the kernel more secure and robust. We actually want *more*
invariants that indicate bugs for the fuzzers to trip up on!

As above, a correctly used WARN, should indicate a certain bug if it
triggers.

I'd guess about 30-40% of the syzkaller found bugs I've delt with are
from a correct use of WARN_ON not oops/kasn/etc. I wonder what a
datamine on the whole syzkaller database would indicate.

pr_warn/etc don't trigger fuzzer faults, and don't give a debugging
backtrace.

I also find it strange to want panic_on_warn to exist, and people want
to use it, while also saying that the WARN() calls that actually make
it do something and be valuable are forbidden :(

Jason

2024-04-18 16:14:50

by Eric Biggers

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> index 9c7cf73473943..bce43b01721cb 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> to trigger easily, for example, by user space actions. pr_warn_once() is a
> possible alternative, if you need to notify the user of a problem.
>
> -Do not worry about panic_on_warn users
> -**************************************
> +The panic_on_warn kernel option
> +********************************
>
> -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> -available kernel option, and that many users set this option. This is why
> -there is a "Do not WARN lightly" writeup, above. However, the existence of
> -panic_on_warn users is not a valid reason to avoid the judicious use
> -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> -asked the kernel to crash if a WARN*() fires, and such users must be
> -prepared to deal with the consequences of a system that is somewhat more
> -likely to crash.
> +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> +a WARN*() call whose condition holds leads to a kernel panic. Many users
> +(including Android and many cloud providers) set this option, and this is
> +why there is a "Do not WARN lightly" writeup, above.
> +
> +The existence of this option is not a valid reason to avoid the judicious
> +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> +issue warnings but do **not** cause the kernel to crash. Use these if you
> +want to prevent such panics.
>

Nacked-by: Eric Biggers <[email protected]>

WARN*() are for recoverable assertions, i.e. situations where the condition
being true can only happen due to a kernel bug but where they can be recovered
from (unlike BUG*() which are for unrecoverable situations). The people who use
panic_on_warn *want* the kernel to crash when such a situation happens so that
the underlying issue can be discovered and fixed. That's the whole point.

Also, it's not true that "Android" sets this option. It might be the case that
some individual Android OEMs have decided to use it for some reason (they do
have the ability to customize their kernel command line, after all). It's
certainly not used by default or even recommended.

- Eric

2024-04-18 17:12:55

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On Thu, Apr 18, 2024 at 09:14:30AM -0700, Eric Biggers wrote:
> Also, it's not true that "Android" sets this option. It might be the case that
> some individual Android OEMs have decided to use it for some reason (they do
> have the ability to customize their kernel command line, after all). It's
> certainly not used by default or even recommended.

Ah yes, you are right. I had misremembered; last I looked Android uses
of CONFIG_UBSAN_TRAP=y, which has a similar affect (effectively promoting
warning into bug) for UBSAN checks.

--
Kees Cook

2024-04-18 22:34:01

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On 4/18/24 9:14 AM, Eric Biggers wrote:
> On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
>> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
>> index 9c7cf73473943..bce43b01721cb 100644
>> --- a/Documentation/process/coding-style.rst
>> +++ b/Documentation/process/coding-style.rst
>> @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
>> to trigger easily, for example, by user space actions. pr_warn_once() is a
>> possible alternative, if you need to notify the user of a problem.
>>
>> -Do not worry about panic_on_warn users

This is still[1] good advice. panic_on_warn users have made a
trade-off that works for them, but that should not mean that
all of the valid cases for WARN*() suddenly disappear. In fact,
without the WARN*() calls, panic_on_warn is a no-op, as someone
else has already pointed out.

>> -**************************************
>> +The panic_on_warn kernel option
>> +********************************
>>
>> -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
>> -available kernel option, and that many users set this option. This is why
>> -there is a "Do not WARN lightly" writeup, above. However, the existence of
>> -panic_on_warn users is not a valid reason to avoid the judicious use
>> -WARN*(). That is because, whoever enables panic_on_warn has explicitly
>> -asked the kernel to crash if a WARN*() fires, and such users must be
>> -prepared to deal with the consequences of a system that is somewhat more
>> -likely to crash.
>> +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
>> +a WARN*() call whose condition holds leads to a kernel panic. Many users
>> +(including Android and many cloud providers) set this option, and this is
>> +why there is a "Do not WARN lightly" writeup, above.
>> +
>> +The existence of this option is not a valid reason to avoid the judicious
>> +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
>> +issue warnings but do **not** cause the kernel to crash. Use these if you
>> +want to prevent such panics.
>>
>
> Nacked-by: Eric Biggers <[email protected]>

Yes. I agree with this NAK.

>
> WARN*() are for recoverable assertions, i.e. situations where the condition
> being true can only happen due to a kernel bug but where they can be recovered
> from (unlike BUG*() which are for unrecoverable situations). The people who use
> panic_on_warn *want* the kernel to crash when such a situation happens so that
> the underlying issue can be discovered and fixed. That's the whole point.
>
> Also, it's not true that "Android" sets this option. It might be the case that
> some individual Android OEMs have decided to use it for some reason (they do
> have the ability to customize their kernel command line, after all). It's
> certainly not used by default or even recommended.
>
> - Eric
>

[1] https://lore.kernel.org/lkml/[email protected]/T/#md6836102150ac1e6265569aad55a692e3af75f34

thanks,
--
John Hubbard
NVIDIA

2024-04-19 07:16:55

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

On 14.04.24 19:08, Alex Elder wrote:
> Several times recently Greg KH has admonished that variants of WARN()
> should not be used, because when the panic_on_warn kernel option is set,
> their use can lead to a panic. His reasoning was that the majority of
> Linux instances (including Android and cloud systems) run with this option
> enabled. And therefore a condition leading to a warning will frequently
> cause an undesirable panic.
>
> The "coding-style.rst" document says not to worry about this kernel
> option. Update it to provide a more nuanced explanation.
>
> Signed-off-by: Alex Elder <[email protected]>
> ---
> Documentation/process/coding-style.rst | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> index 9c7cf73473943..bce43b01721cb 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> to trigger easily, for example, by user space actions. pr_warn_once() is a
> possible alternative, if you need to notify the user of a problem.
>
> -Do not worry about panic_on_warn users
> -**************************************
> +The panic_on_warn kernel option
> +********************************
>
> -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> -available kernel option, and that many users set this option. This is why
> -there is a "Do not WARN lightly" writeup, above. However, the existence of
> -panic_on_warn users is not a valid reason to avoid the judicious use
> -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> -asked the kernel to crash if a WARN*() fires, and such users must be
> -prepared to deal with the consequences of a system that is somewhat more
> -likely to crash.
> +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> +a WARN*() call whose condition holds leads to a kernel panic. Many users
> +(including Android and many cloud providers) set this option, and this is
> +why there is a "Do not WARN lightly" writeup, above.
> +
> +The existence of this option is not a valid reason to avoid the judicious
> +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> +issue warnings but do **not** cause the kernel to crash. Use these if you
> +want to prevent such panics.
>
> Use BUILD_BUG_ON() for compile-time assertions
> **********************************************

Did you even read the history about that? Likely not, otherwise I wouldn't
have to learn about this patch on lwn.net.

I suggest reading:

commit 1cfd9d7e43d5a1cf739d1420b10b1e65feb02f88
Author: David Hildenbrand <[email protected]>
Date: Fri Sep 23 13:34:24 2022 +0200

coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

which includes links to relevant discussions between me and Linus. Most
relevant to the discussion is [1].

All that's written in the document right now (use WARN_ON_ONCE() *lightly*) is precisely
what I still think we should do. That's the case *1.5 years* after I documented that.

Clear NACK from my side: "If you set 'panic_on_warn' you get to keep both
pieces when something breaks." [1]

[1] https://lore.kernel.org/all/CAHk-=wgF7K2gSSpy=m_=K3Nov4zaceUX9puQf1TjkTJLA2XC_g@mail.gmail.com/

--
Cheers,

David / dhildenb