2014-10-20 12:00:46

by Prarit Bhargava

[permalink] [raw]
Subject: [PATCH] kernel, add bug_on_warn

There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system. Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to the
user.

A much easier method would be a switch to change the WARN() over to a
BUG(). This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.

This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
warn_slowpath_common() path. The function will still print out the
location of the warning.

Successfully tested by me.

Cc: Jonathan Corbet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Fabian Frederick <[email protected]>
Cc: [email protected]
Signed-off-by: Prarit Bhargava <[email protected]>
---
Documentation/kernel-parameters.txt | 2 ++
kernel/panic.c | 16 +++++++++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7dbe5ec..2967542 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -553,6 +553,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
bttv.pll= See Documentation/video4linux/bttv/Insmod-options
bttv.tuner=

+ bug_on_warn BUG() instead of WARN()
+
bulk_remove=off [PPC] This parameter disables the use of the pSeries
firmware feature for flushing multiple hpte entries
at a time.
diff --git a/kernel/panic.c b/kernel/panic.c
index d09dc5c..258a7be 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -33,6 +33,7 @@ static int pause_on_oops;
static int pause_on_oops_flag;
static DEFINE_SPINLOCK(pause_on_oops_lock);
static bool crash_kexec_post_notifiers;
+static bool bug_on_warn;

int panic_timeout = CONFIG_PANIC_TIMEOUT;
EXPORT_SYMBOL_GPL(panic_timeout);
@@ -420,13 +421,19 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
{
disable_trace_on_warning();

- pr_warn("------------[ cut here ]------------\n");
+ if (!bug_on_warn)
+ pr_warn("------------[ cut here ]------------\n");
pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
raw_smp_processor_id(), current->pid, file, line, caller);

if (args)
vprintk(args->fmt, args->args);

+ if (bug_on_warn) {
+ pr_warn("bug_on_warn set, calling BUG()...\n");
+ BUG();
+ }
+
print_modules();
dump_stack();
print_oops_end_marker();
@@ -501,3 +508,10 @@ static int __init oops_setup(char *s)
return 0;
}
early_param("oops", oops_setup);
+
+static int __init bug_on_warn_setup(char *s)
+{
+ bug_on_warn = true;
+ return 0;
+}
+early_param("bug_on_warn", bug_on_warn_setup);
--
1.7.9.3


2014-10-20 22:24:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kernel, add bug_on_warn

On Mon, 20 Oct 2014 08:00:20 -0400 Prarit Bhargava <[email protected]> wrote:

> There have been several times where I have had to rebuild a kernel to
> cause a panic when hitting a WARN() in the code in order to get a crash
> dump from a system. Sometimes this is easy to do, other times (such as
> in the case of a remote admin) it is not trivial to send new images to the
> user.
>
> A much easier method would be a switch to change the WARN() over to a
> BUG(). This makes debugging easier in that I can now test the actual
> image the WARN() was seen on and I do not have to engage in remote
> debugging.
>
> This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
> warn_slowpath_common() path. The function will still print out the
> location of the warning.
>
> Successfully tested by me.

Looks nice and simple and useful. However I suspect you're exclusively
focussed on "I want a crash dump" and things haven't been fully thought
through.

- Do you have any example WARN->BUG console output at hand? I'd like
to check for missing or duplicated info.

- Did you consider permitting this to be tweaked at runtime via
/proc? Sometimes we get pesky WARNs at boot time and having runtime
alteration would permit the user to prevent those from tripping a
BUG.

- Also, perhaps bug_on_warn should be single-shot: clear itself after
it has triggered one BUG. Because once the kernel has gone
WARN->BUG, it's probably messed up and is likely to trigger more
WARNs. Also, the kernel might generate many WARNs for the same
issue.

> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -553,6 +553,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> bttv.pll= See Documentation/video4linux/bttv/Insmod-options
> bttv.tuner=
>
> + bug_on_warn BUG() instead of WARN()

There's no mention here that this feature is mainly aimed at generating
a crash dump. How do we tell the people who aren't reading this email
thread (ie: all of humanity except you and me ;)) that this feature
even exists? Is there crash dump documentation that we can update?

2014-10-21 00:54:16

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] kernel, add bug_on_warn



On 10/20/2014 06:24 PM, Andrew Morton wrote:
> On Mon, 20 Oct 2014 08:00:20 -0400 Prarit Bhargava <[email protected]> wrote:
>
>> There have been several times where I have had to rebuild a kernel to
>> cause a panic when hitting a WARN() in the code in order to get a crash
>> dump from a system. Sometimes this is easy to do, other times (such as
>> in the case of a remote admin) it is not trivial to send new images to the
>> user.
>>
>> A much easier method would be a switch to change the WARN() over to a
>> BUG(). This makes debugging easier in that I can now test the actual
>> image the WARN() was seen on and I do not have to engage in remote
>> debugging.
>>
>> This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
>> warn_slowpath_common() path. The function will still print out the
>> location of the warning.
>>
>> Successfully tested by me.
>
> Looks nice and simple and useful. However I suspect you're exclusively
> focussed on "I want a crash dump" and things haven't been fully thought
> through.
>
> - Do you have any example WARN->BUG console output at hand? I'd like
> to check for missing or duplicated info.

Yep, here you go, with some additional annotation notes from me. The first
line below is from the WARN_ON() to output the WARN_ON()'s location. After
that, we hit the new BUG() call.

WARNING: CPU: 27 PID: 3204 at
/home/rhel7/redhat/debug/dummy-module/dummy-module.c:25 init_dummy+0x28/0x30
[dummy_module]()
bug_on_warn set, calling BUG()...
------------[ cut here ]------------
kernel BUG at kernel/panic.c:434!
invalid opcode: 0000 [#1] SMP
Modules linked in: dummy_module(OE+) sg nfsv3 rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel igb iTCO_wdt aesni_intel iTCO_vendor_support lrw gf128mul
sb_edac ptp edac_core glue_helper lpc_ich ioatdma pcspkr ablk_helper pps_core
i2c_i801 mfd_core cryptd dca shpchp ipmi_si wmi ipmi_msghandler acpi_cpufreq
nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod
mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper isci ttm
drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror
dm_region_hash dm_log dm_mod
CPU: 27 PID: 3204 Comm: insmod Tainted: G OE 3.17.0+ #19
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
RMLSDP.86I.00.29.D696.1311111329 11/11/2013
task: ffff880034e75160 ti: ffff8807fc5ac000 task.ti: ffff8807fc5ac000
RIP: 0010:[<ffffffff81076b81>] [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
RSP: 0018:ffff8807fc5afc68 EFLAGS: 00010246
RAX: 0000000000000021 RBX: ffff8807fc5afcb0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88081efee5f8 RDI: ffff88081efee5f8
RBP: ffff8807fc5afc98 R08: 0000000000000096 R09: 0000000000000000
R10: 0000000000000711 R11: ffff8807fc5af93e R12: ffffffffa0424070
R13: 0000000000000019 R14: ffffffffa0423068 R15: 0000000000000009
FS: 00007f2d4b034740(0000) GS:ffff88081efe0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2d4a99f3c0 CR3: 00000007fd88b000 CR4: 00000000001407e0
Stack:
ffff8807fc5afcb8 ffffffff8199f020 ffff88080e396160 0000000000000000
ffffffffa0423040 ffffffffa0425000 ffff8807fc5afd08 ffffffff81076be5
0000000000000008 ffffffffa0424053 ffff880700000018 ffff8807fc5afd18
Call Trace:
[<ffffffffa0423040>] ? dummy_greetings+0x40/0x40 [dummy_module]
[<ffffffff81076be5>] warn_slowpath_fmt+0x55/0x70
[<ffffffffa0423068>] init_dummy+0x28/0x30 [dummy_module]
[<ffffffff81002144>] do_one_initcall+0xd4/0x210
[<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
[<ffffffff810f8889>] load_module+0x16a9/0x1b30
[<ffffffff810f3d30>] ? store_uevent+0x70/0x70
[<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
[<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
[<ffffffff8166ce29>] system_call_fastpath+0x12/0x17
Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 20 42 8a 81 31 c0 e8 fc
80 5e 00 eb 80 48 c7 c7 78 42 8a 81 31 c0 e8 ec 80 5e 00 <0f> 0b 66 66 66 66 2e
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
RIP [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
RSP <ffff8807fc5afc68>
---[ end trace 428218934a12088b ]---
>
> - Did you consider permitting this to be tweaked at runtime via
> /proc? Sometimes we get pesky WARNs at boot time and having runtime
> alteration would permit the user to prevent those from tripping a
> BUG.
>

I did actually, but I was wondering how people liked the idea before I looked
at the /proc implementation. It's pretty much the same as panic_on_oops, so
it's not difficult to do.

> - Also, perhaps bug_on_warn should be single-shot: clear itself after
> it has triggered one BUG. Because once the kernel has gone
> WARN->BUG, it's probably messed up and is likely to trigger more
> WARNs. Also, the kernel might generate many WARNs for the same
> issue.

Okay, I'll add that.

>
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -553,6 +553,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>> bttv.pll= See Documentation/video4linux/bttv/Insmod-options
>> bttv.tuner=
>>
>> + bug_on_warn BUG() instead of WARN()
>
> There's no mention here that this feature is mainly aimed at generating
> a crash dump. How do we tell the people who aren't reading this email
> thread (ie: all of humanity except you and me ;)) that this feature
> even exists? Is there crash dump documentation that we can update?
>

I'll look into this too.

P.
>

2014-10-21 04:58:54

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH] kernel, add bug_on_warn

Hi Prarit,

(2014/10/21 9:54), Prarit Bhargava wrote:
>
>
> On 10/20/2014 06:24 PM, Andrew Morton wrote:
>> On Mon, 20 Oct 2014 08:00:20 -0400 Prarit Bhargava <[email protected]> wrote:
>>
>>> There have been several times where I have had to rebuild a kernel to
>>> cause a panic when hitting a WARN() in the code in order to get a crash
>>> dump from a system. Sometimes this is easy to do, other times (such as
>>> in the case of a remote admin) it is not trivial to send new images to the
>>> user.
>>>
>>> A much easier method would be a switch to change the WARN() over to a
>>> BUG(). This makes debugging easier in that I can now test the actual
>>> image the WARN() was seen on and I do not have to engage in remote
>>> debugging.
>>>
>>> This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
>>> warn_slowpath_common() path. The function will still print out the
>>> location of the warning.
>>>
>>> Successfully tested by me.
>>
>> Looks nice and simple and useful. However I suspect you're exclusively
>> focussed on "I want a crash dump" and things haven't been fully thought
>> through.
>>
>> - Do you have any example WARN->BUG console output at hand? I'd like
>> to check for missing or duplicated info.
>
> Yep, here you go, with some additional annotation notes from me. The first
> line below is from the WARN_ON() to output the WARN_ON()'s location. After
> that, we hit the new BUG() call.
>
> WARNING: CPU: 27 PID: 3204 at
> /home/rhel7/redhat/debug/dummy-module/dummy-module.c:25 init_dummy+0x28/0x30
> [dummy_module]()
> bug_on_warn set, calling BUG()...
> ------------[ cut here ]------------
> kernel BUG at kernel/panic.c:434!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: dummy_module(OE+) sg nfsv3 rpcsec_gss_krb5 nfsv4
> dns_resolver nfs fscache cfg80211 rfkill x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel
> ghash_clmulni_intel igb iTCO_wdt aesni_intel iTCO_vendor_support lrw gf128mul
> sb_edac ptp edac_core glue_helper lpc_ich ioatdma pcspkr ablk_helper pps_core
> i2c_i801 mfd_core cryptd dca shpchp ipmi_si wmi ipmi_msghandler acpi_cpufreq
> nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sr_mod cdrom sd_mod
> mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper isci ttm
> drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror
> dm_region_hash dm_log dm_mod
> CPU: 27 PID: 3204 Comm: insmod Tainted: G OE 3.17.0+ #19
> Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
> RMLSDP.86I.00.29.D696.1311111329 11/11/2013
> task: ffff880034e75160 ti: ffff8807fc5ac000 task.ti: ffff8807fc5ac000
> RIP: 0010:[<ffffffff81076b81>] [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
> RSP: 0018:ffff8807fc5afc68 EFLAGS: 00010246
> RAX: 0000000000000021 RBX: ffff8807fc5afcb0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffff88081efee5f8 RDI: ffff88081efee5f8
> RBP: ffff8807fc5afc98 R08: 0000000000000096 R09: 0000000000000000
> R10: 0000000000000711 R11: ffff8807fc5af93e R12: ffffffffa0424070
> R13: 0000000000000019 R14: ffffffffa0423068 R15: 0000000000000009
> FS: 00007f2d4b034740(0000) GS:ffff88081efe0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2d4a99f3c0 CR3: 00000007fd88b000 CR4: 00000000001407e0
> Stack:
> ffff8807fc5afcb8 ffffffff8199f020 ffff88080e396160 0000000000000000
> ffffffffa0423040 ffffffffa0425000 ffff8807fc5afd08 ffffffff81076be5
> 0000000000000008 ffffffffa0424053 ffff880700000018 ffff8807fc5afd18
> Call Trace:
> [<ffffffffa0423040>] ? dummy_greetings+0x40/0x40 [dummy_module]
> [<ffffffff81076be5>] warn_slowpath_fmt+0x55/0x70
> [<ffffffffa0423068>] init_dummy+0x28/0x30 [dummy_module]
> [<ffffffff81002144>] do_one_initcall+0xd4/0x210
> [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
> [<ffffffff810f8889>] load_module+0x16a9/0x1b30
> [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
> [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
> [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
> [<ffffffff8166ce29>] system_call_fastpath+0x12/0x17
> Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 20 42 8a 81 31 c0 e8 fc
> 80 5e 00 eb 80 48 c7 c7 78 42 8a 81 31 c0 e8 ec 80 5e 00 <0f> 0b 66 66 66 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
> RIP [<ffffffff81076b81>] warn_slowpath_common+0xc1/0xd0
> RSP <ffff8807fc5afc68>
> ---[ end trace 428218934a12088b ]---
>>
>> - Did you consider permitting this to be tweaked at runtime via
>> /proc? Sometimes we get pesky WARNs at boot time and having runtime
>> alteration would permit the user to prevent those from tripping a
>> BUG.
>>
>
> I did actually, but I was wondering how people liked the idea before I looked
> at the /proc implementation. It's pretty much the same as panic_on_oops, so
> it's not difficult to do.
>
>> - Also, perhaps bug_on_warn should be single-shot: clear itself after
>> it has triggered one BUG. Because once the kernel has gone
>> WARN->BUG, it's probably messed up and is likely to trigger more
>> WARNs. Also, the kernel might generate many WARNs for the same
>> issue.
>
> Okay, I'll add that.

When you update it, please CC me.
Your patch works well as follows:


WARNING: CPU: 3 PID: 468 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
bug_on_warn set, calling BUG()...
------------[ cut here ]------------
kernel BUG at kernel/panic.c:434!
invalid opcode: 0000 [#1] SMP
<...>
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
task: ffff880866d8c3d0 ti: ffff88086227c000 task.ti: ffff88086227c000
RIP: 0010:[<ffffffff81074001>] [<ffffffff81074001>] warn_slowpath_common+0xc1/0xd0
RSP: 0018:ffff88086227fa68 EFLAGS: 00010246
RAX: 0000000000000021 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88087fa6e5f8 RDI: ffff88087fa6e5f8
RBP: ffff88086227fa98 R08: 0000000000000096 R09: 0000000000000000
R10: 0000000000000b37 R11: ffff88086227f73e R12: ffffffff818a9195
R13: 0000000000001368 R14: ffffffff81651b17 R15: 0000000000000009
FS: 0000000000000000(0000) GS:ffff88087fa60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8301965000 CR3: 0000000001984000 CR4: 00000000001407e0
Stack:
ffff88086f00e5e8 0000000000000002 ffff8a07fffb4000 0000000040000000
0000000000000002 0000000000000002 ffff88086227faa8 ffffffff8107410a
ffff88086227fb38 ffffffff81651b17 0000000000000296 ffff88087f402120
Call Trace:
[<ffffffff8107410a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81651b17>] free_area_init_node+0x3fe/0x426
[<ffffffff810b9d22>] ? up+0x32/0x50
[<ffffffff8164ddd0>] hotadd_new_pgdat+0x90/0x110
[<ffffffff8164df24>] add_memory+0xd4/0x200
[<ffffffff813a8329>] acpi_memory_device_add+0x1aa/0x289
[<ffffffff8137b486>] acpi_bus_attach+0xfd/0x204
[<ffffffff8140fd1e>] ? device_register+0x1e/0x30
[<ffffffff8137b501>] acpi_bus_attach+0x178/0x204
[<ffffffff8137b5f7>] acpi_bus_scan+0x6a/0x90
[<ffffffff813786d2>] ? acpi_bus_get_status+0x2d/0x5f
[<ffffffff8137b7a7>] acpi_device_hotplug+0xe8/0x418
[<ffffffff813750d9>] acpi_hotplug_work_fn+0x1f/0x2b
[<ffffffff8108cc5e>] process_one_work+0x14e/0x3f0
[<ffffffff8108d36b>] worker_thread+0x11b/0x510
[<ffffffff8108d250>] ? rescuer_thread+0x350/0x350
[<ffffffff81092af1>] kthread+0xe1/0x100
[<ffffffff81092a10>] ? kthread_create_on_node+0x1b0/0x1b0
[<ffffffff81663efc>] ret_from_fork+0x7c/0xb0
[<ffffffff81092a10>] ? kthread_create_on_node+0x1b0/0x1b0
Code: c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 60 20 89 81 31 c0 e8 5c 26 5e 00 eb 80 48 c7 c7 b8 20 89 81 31 c0 e8 4c 26 5e 00 <0f> 0b 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
RIP [<ffffffff81074001>] warn_slowpath_common+0xc1/0xd0
RSP <ffff88086227fa68>

Thanks,
Yasuaki Ishimatsu

>
>>
>>> --- a/Documentation/kernel-parameters.txt
>>> +++ b/Documentation/kernel-parameters.txt
>>> @@ -553,6 +553,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>> bttv.pll= See Documentation/video4linux/bttv/Insmod-options
>>> bttv.tuner=
>>>
>>> + bug_on_warn BUG() instead of WARN()
>>
>> There's no mention here that this feature is mainly aimed at generating
>> a crash dump. How do we tell the people who aren't reading this email
>> thread (ie: all of humanity except you and me ;)) that this feature
>> even exists? Is there crash dump documentation that we can update?
>>
>
> I'll look into this too.
>
> P.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>