2022-10-10 15:09:50

by Mel Gorman

[permalink] [raw]
Subject: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

Hi Rafael,

I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
early in boot. It fails to boot 5 times after 10 boot attempts and I've
only observed it on one machine so far. Either a revert or the patch below
fixes it but it's unlikely it is the correct fix.

--- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
+++ drivers/rtc/rtc-cmos.c 2022-10-10 15:11:53.211756691 +0200
@@ -1209,7 +1209,7 @@
* Or else, ACPI SCI is enabled during suspend/resume only,
* update rtc irq in that case.
*/
- if (cmos_use_acpi_alarm())
+ if (cmos_use_acpi_alarm() && cmos)
cmos_interrupt(0, (void *)cmos->rtc);
else {
/* Fix me: can we use cmos_interrupt() here as well? */

Boot failure looks like the below, it's not a vanilla kernel but the
applied patch is not relevant and it's known to fail on a vanilla kernel.
The machine has a E5-2698 v4 CPU plugged into a SGI C2112-4GP3 platform
with a X10DRT-P-Series motherboard.

[ 10.924167][ C1] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 10.928016][ C1] #PF: supervisor read access in kernel mode
[ 10.928016][ C1] #PF: error_code(0x0000) - not-present page
[ 10.928016][ C1] PGD 0 P4D 0
[ 10.928016][ C1] Oops: 0000 [#1] PREEMPT SMP PTI
[ 10.928016][ C1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.0-mm-pcpnoirq-v1r2 #1 6debc4647ebcbe3e91270f1109aebc1e85510e3e
[ 10.928016][ C1] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[ 10.928016][ C1] RIP: 0010:rtc_handler+0x73/0xd0
[ 10.928016][ C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
[ 10.928016][ C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
[ 10.928016][ C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
[ 10.928016][ C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
[ 10.928016][ C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
[ 10.928016][ C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
[ 10.928016][ C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
[ 10.928016][ C1] FS: 0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
[ 10.928016][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 10.928016][ C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
[ 10.928016][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 10.928016][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 10.928016][ C1] Call Trace:
[ 10.928016][ C1] <IRQ>
[ 10.928016][ C1] acpi_ev_fixed_event_detect+0x14a/0x18c
[ 10.928016][ C1] acpi_ev_sci_xrupt_handler+0x2c/0x6e
[ 10.928016][ C1] acpi_irq+0x18/0x40
[ 10.928016][ C1] __handle_irq_event_percpu+0x3e/0x2d0
[ 10.928016][ C1] handle_irq_event_percpu+0xf/0x40
[ 10.928016][ C1] handle_irq_event+0x34/0x60
[ 10.928016][ C1] handle_fasteoi_irq+0x7b/0x140
[ 10.928016][ C1] __common_interrupt+0x4b/0x100
[ 10.928016][ C1] common_interrupt+0x58/0xa0
[ 10.928016][ C1] </IRQ>
[ 10.928016][ C1] <TASK>
[ 10.928016][ C1] asm_common_interrupt+0x22/0x40
[ 10.928016][ C1] RIP: 0010:cmos_wake_setup.part.9+0x2f/0x120
[ 10.928016][ C1] Code: 80 3d 65 16 4a 01 00 53 48 89 fb 0f 84 a5 00 00 00 48 89 da 48 c7 c6 00 0c 6d ad bf 04 00 00 00 e8 53 b8 e7 ff bf 04 00 00 00 <e8> 98 c6 e7 ff 31 f6 bf 04 00 00 00 e8 fd c8 e7 ff 0f b6 0d 34 ce
[ 10.928016][ C1] RSP: 0000:ffffaf7f800d7ca8 EFLAGS: 00000246
[ 10.928016][ C1] RAX: 0000000000000000 RBX: ffff94049801a000 RCX: 0000000000000004
[ 10.928016][ C1] RDX: ffffffffadefef10 RSI: ffffffffadefee20 RDI: 0000000000000004
[ 10.928016][ C1] RBP: ffffffffaeaf98a0 R08: 0000000000000000 R09: 0000000000000000
[ 10.928016][ C1] R10: 0000000000000000 R11: 000000000000000a R12: ffffffffad6d1750
[ 10.928016][ C1] R13: 0000000000000000 R14: ffff93c5111191a0 R15: ffffffffaefe47f8
[ 10.928016][ C1] ? rdinit_setup+0x2f/0x2f
[ 10.928016][ C1] ? cmos_do_probe+0x570/0x570
[ 10.928016][ C1] ? cmos_wake_setup.part.9+0x2a/0x120
[ 10.928016][ C1] cmos_pnp_probe+0x6c/0xa0
[ 10.928016][ C1] pnp_device_probe+0x5b/0xb0
[ 10.928016][ C1] ? driver_sysfs_add+0x75/0xe0
[ 10.928016][ C1] really_probe+0x109/0x3e0
[ 10.928016][ C1] ? pm_runtime_barrier+0x4f/0xa0
[ 10.928016][ C1] __driver_probe_device+0x79/0x170
[ 10.928016][ C1] driver_probe_device+0x1f/0xa0
[ 10.928016][ C1] __driver_attach+0x11e/0x180
[ 10.928016][ C1] ? __device_attach_driver+0x110/0x110
[ 10.928016][ C1] bus_for_each_dev+0x79/0xc0
[ 10.928016][ C1] bus_add_driver+0x1ba/0x250
[ 10.928016][ C1] ? rtc_dev_init+0x34/0x34
[ 10.928016][ C1] driver_register+0x5f/0x100
[ 10.928016][ C1] ? rtc_dev_init+0x34/0x34
[ 10.928016][ C1] cmos_init+0x12/0x70
[ 10.928016][ C1] do_one_initcall+0x5b/0x310
[ 10.928016][ C1] ? rcu_read_lock_held_common+0xe/0x50
[ 10.928016][ C1] ? rcu_read_lock_sched_held+0x23/0x80
[ 10.928016][ C1] kernel_init_freeable+0x2b7/0x319
[ 10.928016][ C1] ? rest_init+0x1b0/0x1b0
[ 10.928016][ C1] kernel_init+0x16/0x140
[ 10.928016][ C1] ret_from_fork+0x22/0x30
[ 10.928016][ C1] </TASK>
[ 10.928016][ C1] Modules linked in:
[ 10.928016][ C1] CR2: 0000000000000000
[ 10.928016][ C1] ---[ end trace 0000000000000000 ]---
[ 10.928016][ C1] RIP: 0010:rtc_handler+0x73/0xd0
[ 10.928016][ C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
[ 10.928016][ C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
[ 10.928016][ C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
[ 10.928016][ C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
[ 10.928016][ C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
[ 10.928016][ C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
[ 10.928016][ C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
[ 10.928016][ C1] FS: 0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
[ 10.928016][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 10.928016][ C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
[ 10.928016][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 10.928016][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 10.928016][ C1] Kernel panic - not syncing: Fatal exception in interrupt
[ 10.928016][ C1] Kernel Offset: 0x2be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 10.928016][ C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

--
Mel Gorman
SUSE Labs


2022-10-10 15:23:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

Hi Mel,

Thanks for the report!

On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <[email protected]> wrote:
>
> Hi Rafael,
>
> I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> early in boot. It fails to boot 5 times after 10 boot attempts and I've
> only observed it on one machine so far. Either a revert or the patch below
> fixes it but it's unlikely it is the correct fix.
>
> --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> +++ drivers/rtc/rtc-cmos.c 2022-10-10 15:11:53.211756691 +0200
> @@ -1209,7 +1209,7 @@
> * Or else, ACPI SCI is enabled during suspend/resume only,
> * update rtc irq in that case.
> */
> - if (cmos_use_acpi_alarm())
> + if (cmos_use_acpi_alarm() && cmos)
> cmos_interrupt(0, (void *)cmos->rtc);
> else {
> /* Fix me: can we use cmos_interrupt() here as well? */

It looks like I've exposed a race condition there.

Generally speaking, it is misguided to install an event handler that
is not ready to handle the event at that time before making sure that
the event is disabled.

Does the attached patch help?

>
> Boot failure looks like the below, it's not a vanilla kernel but the
> applied patch is not relevant and it's known to fail on a vanilla kernel.
> The machine has a E5-2698 v4 CPU plugged into a SGI C2112-4GP3 platform
> with a X10DRT-P-Series motherboard.
>
> [ 10.924167][ C1] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 10.928016][ C1] #PF: supervisor read access in kernel mode
> [ 10.928016][ C1] #PF: error_code(0x0000) - not-present page
> [ 10.928016][ C1] PGD 0 P4D 0
> [ 10.928016][ C1] Oops: 0000 [#1] PREEMPT SMP PTI
> [ 10.928016][ C1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.0-mm-pcpnoirq-v1r2 #1 6debc4647ebcbe3e91270f1109aebc1e85510e3e
> [ 10.928016][ C1] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
> [ 10.928016][ C1] RIP: 0010:rtc_handler+0x73/0xd0
> [ 10.928016][ C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
> [ 10.928016][ C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
> [ 10.928016][ C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
> [ 10.928016][ C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
> [ 10.928016][ C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
> [ 10.928016][ C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
> [ 10.928016][ C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
> [ 10.928016][ C1] FS: 0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
> [ 10.928016][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 10.928016][ C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
> [ 10.928016][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 10.928016][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 10.928016][ C1] Call Trace:
> [ 10.928016][ C1] <IRQ>
> [ 10.928016][ C1] acpi_ev_fixed_event_detect+0x14a/0x18c
> [ 10.928016][ C1] acpi_ev_sci_xrupt_handler+0x2c/0x6e
> [ 10.928016][ C1] acpi_irq+0x18/0x40
> [ 10.928016][ C1] __handle_irq_event_percpu+0x3e/0x2d0
> [ 10.928016][ C1] handle_irq_event_percpu+0xf/0x40
> [ 10.928016][ C1] handle_irq_event+0x34/0x60
> [ 10.928016][ C1] handle_fasteoi_irq+0x7b/0x140
> [ 10.928016][ C1] __common_interrupt+0x4b/0x100
> [ 10.928016][ C1] common_interrupt+0x58/0xa0
> [ 10.928016][ C1] </IRQ>
> [ 10.928016][ C1] <TASK>
> [ 10.928016][ C1] asm_common_interrupt+0x22/0x40
> [ 10.928016][ C1] RIP: 0010:cmos_wake_setup.part.9+0x2f/0x120
> [ 10.928016][ C1] Code: 80 3d 65 16 4a 01 00 53 48 89 fb 0f 84 a5 00 00 00 48 89 da 48 c7 c6 00 0c 6d ad bf 04 00 00 00 e8 53 b8 e7 ff bf 04 00 00 00 <e8> 98 c6 e7 ff 31 f6 bf 04 00 00 00 e8 fd c8 e7 ff 0f b6 0d 34 ce
> [ 10.928016][ C1] RSP: 0000:ffffaf7f800d7ca8 EFLAGS: 00000246
> [ 10.928016][ C1] RAX: 0000000000000000 RBX: ffff94049801a000 RCX: 0000000000000004
> [ 10.928016][ C1] RDX: ffffffffadefef10 RSI: ffffffffadefee20 RDI: 0000000000000004
> [ 10.928016][ C1] RBP: ffffffffaeaf98a0 R08: 0000000000000000 R09: 0000000000000000
> [ 10.928016][ C1] R10: 0000000000000000 R11: 000000000000000a R12: ffffffffad6d1750
> [ 10.928016][ C1] R13: 0000000000000000 R14: ffff93c5111191a0 R15: ffffffffaefe47f8
> [ 10.928016][ C1] ? rdinit_setup+0x2f/0x2f
> [ 10.928016][ C1] ? cmos_do_probe+0x570/0x570
> [ 10.928016][ C1] ? cmos_wake_setup.part.9+0x2a/0x120
> [ 10.928016][ C1] cmos_pnp_probe+0x6c/0xa0
> [ 10.928016][ C1] pnp_device_probe+0x5b/0xb0
> [ 10.928016][ C1] ? driver_sysfs_add+0x75/0xe0
> [ 10.928016][ C1] really_probe+0x109/0x3e0
> [ 10.928016][ C1] ? pm_runtime_barrier+0x4f/0xa0
> [ 10.928016][ C1] __driver_probe_device+0x79/0x170
> [ 10.928016][ C1] driver_probe_device+0x1f/0xa0
> [ 10.928016][ C1] __driver_attach+0x11e/0x180
> [ 10.928016][ C1] ? __device_attach_driver+0x110/0x110
> [ 10.928016][ C1] bus_for_each_dev+0x79/0xc0
> [ 10.928016][ C1] bus_add_driver+0x1ba/0x250
> [ 10.928016][ C1] ? rtc_dev_init+0x34/0x34
> [ 10.928016][ C1] driver_register+0x5f/0x100
> [ 10.928016][ C1] ? rtc_dev_init+0x34/0x34
> [ 10.928016][ C1] cmos_init+0x12/0x70
> [ 10.928016][ C1] do_one_initcall+0x5b/0x310
> [ 10.928016][ C1] ? rcu_read_lock_held_common+0xe/0x50
> [ 10.928016][ C1] ? rcu_read_lock_sched_held+0x23/0x80
> [ 10.928016][ C1] kernel_init_freeable+0x2b7/0x319
> [ 10.928016][ C1] ? rest_init+0x1b0/0x1b0
> [ 10.928016][ C1] kernel_init+0x16/0x140
> [ 10.928016][ C1] ret_from_fork+0x22/0x30
> [ 10.928016][ C1] </TASK>
> [ 10.928016][ C1] Modules linked in:
> [ 10.928016][ C1] CR2: 0000000000000000
> [ 10.928016][ C1] ---[ end trace 0000000000000000 ]---
> [ 10.928016][ C1] RIP: 0010:rtc_handler+0x73/0xd0
> [ 10.928016][ C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
> [ 10.928016][ C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
> [ 10.928016][ C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
> [ 10.928016][ C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
> [ 10.928016][ C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
> [ 10.928016][ C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
> [ 10.928016][ C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
> [ 10.928016][ C1] FS: 0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
> [ 10.928016][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 10.928016][ C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
> [ 10.928016][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 10.928016][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 10.928016][ C1] Kernel panic - not syncing: Fatal exception in interrupt
> [ 10.928016][ C1] Kernel Offset: 0x2be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 10.928016][ C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
>
> --
> Mel Gorman
> SUSE Labs


Attachments:
rtc-handler-wake-setup-debug.patch (938.00 B)

2022-10-10 18:03:13

by Mel Gorman

[permalink] [raw]
Subject: Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

On Mon, Oct 10, 2022 at 04:47:50PM +0200, Rafael J. Wysocki wrote:
> Hi Mel,
>
> Thanks for the report!
>
> On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <[email protected]> wrote:
> >
> > Hi Rafael,
> >
> > I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> > Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> > early in boot. It fails to boot 5 times after 10 boot attempts and I've
> > only observed it on one machine so far. Either a revert or the patch below
> > fixes it but it's unlikely it is the correct fix.
> >
> > --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> > +++ drivers/rtc/rtc-cmos.c 2022-10-10 15:11:53.211756691 +0200
> > @@ -1209,7 +1209,7 @@
> > * Or else, ACPI SCI is enabled during suspend/resume only,
> > * update rtc irq in that case.
> > */
> > - if (cmos_use_acpi_alarm())
> > + if (cmos_use_acpi_alarm() && cmos)
> > cmos_interrupt(0, (void *)cmos->rtc);
> > else {
> > /* Fix me: can we use cmos_interrupt() here as well? */
>
> It looks like I've exposed a race condition there.
>
> Generally speaking, it is misguided to install an event handler that
> is not ready to handle the event at that time before making sure that
> the event is disabled.
>
> Does the attached patch help?
>

It failed 3/10 times. That's less than the previous 5/10 failures but I
cannot be certain it helped without running a lot more boot tests. The
failure happens in the same function as before.

--
Mel Gorman
SUSE Labs

2022-10-10 18:49:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

On Mon, Oct 10, 2022 at 7:50 PM Mel Gorman <[email protected]> wrote:
>
> On Mon, Oct 10, 2022 at 04:47:50PM +0200, Rafael J. Wysocki wrote:
> > Hi Mel,
> >
> > Thanks for the report!
> >
> > On Mon, Oct 10, 2022 at 4:25 PM Mel Gorman <[email protected]> wrote:
> > >
> > > Hi Rafael,
> > >
> > > I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
> > > Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
> > > early in boot. It fails to boot 5 times after 10 boot attempts and I've
> > > only observed it on one machine so far. Either a revert or the patch below
> > > fixes it but it's unlikely it is the correct fix.
> > >
> > > --- drivers/rtc/rtc-cmos.c.orig 2022-10-10 15:11:50.335756567 +0200
> > > +++ drivers/rtc/rtc-cmos.c 2022-10-10 15:11:53.211756691 +0200
> > > @@ -1209,7 +1209,7 @@
> > > * Or else, ACPI SCI is enabled during suspend/resume only,
> > > * update rtc irq in that case.
> > > */
> > > - if (cmos_use_acpi_alarm())
> > > + if (cmos_use_acpi_alarm() && cmos)
> > > cmos_interrupt(0, (void *)cmos->rtc);
> > > else {
> > > /* Fix me: can we use cmos_interrupt() here as well? */
> >
> > It looks like I've exposed a race condition there.
> >
> > Generally speaking, it is misguided to install an event handler that
> > is not ready to handle the event at that time before making sure that
> > the event is disabled.
> >
> > Does the attached patch help?
> >
>
> It failed 3/10 times.

This is still not acceptable.

> That's less than the previous 5/10 failures but I
> cannot be certain it helped without running a lot more boot tests. The
> failure happens in the same function as before.

I've overlooked the fact that acpi_install_fixed_event_handler()
enables the event on success, so it is a bug to call it when the
handler is not ready.

It should help to only enable the event after running cmos_do_probe()
where the driver data pointer is set, so please try the attached
patch.


Attachments:
rtc-handler-wake-setup-debug.patch (1.72 kB)

2022-10-11 09:39:36

by Mel Gorman

[permalink] [raw]
Subject: Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > It failed 3/10 times.
>
> This is still not acceptable.
>

Agreed.

> > That's less than the previous 5/10 failures but I
> > cannot be certain it helped without running a lot more boot tests. The
> > failure happens in the same function as before.
>
> I've overlooked the fact that acpi_install_fixed_event_handler()
> enables the event on success, so it is a bug to call it when the
> handler is not ready.
>
> It should help to only enable the event after running cmos_do_probe()
> where the driver data pointer is set, so please try the attached
> patch.

Looks good and it booted 10 times successfully.

--
Mel Gorman
SUSE Labs

2022-12-12 18:48:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

On Mon, Dec 12, 2022 at 7:25 PM Mathieu Chouquet-Stringer
<[email protected]> wrote:
>
> Hello Rafael,
>
> On Tue, Oct 11, 2022 at 10:20:50AM +0100, Mel Gorman wrote:
> > On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > > > That's less than the previous 5/10 failures but I
> > > > cannot be certain it helped without running a lot more boot tests. The
> > > > failure happens in the same function as before.
> > >
> > > I've overlooked the fact that acpi_install_fixed_event_handler()
> > > enables the event on success, so it is a bug to call it when the
> > > handler is not ready.
> > >
> > > It should help to only enable the event after running cmos_do_probe()
> > > where the driver data pointer is set, so please try the attached
> > > patch.
>
> I'm hitting this issue on the 6.0 stable releases (aka 6.0.y) and
> looking at the stable tree I see this hasn't been merged... I just got
> bitten by this on 6.0.12.
>
> Greg, if Rafael agrees, I think you should apply 4919d3eb2ec0 and
> 0782b66ed2fb to the 6.0.y tree.

This is fine with me, please send an inclusion request to Greg and the
"stable" list.

Subject: Re: Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

Hello Rafael,

On Tue, Oct 11, 2022 at 10:20:50AM +0100, Mel Gorman wrote:
> On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > > That's less than the previous 5/10 failures but I
> > > cannot be certain it helped without running a lot more boot tests. The
> > > failure happens in the same function as before.
> >
> > I've overlooked the fact that acpi_install_fixed_event_handler()
> > enables the event on success, so it is a bug to call it when the
> > handler is not ready.
> >
> > It should help to only enable the event after running cmos_do_probe()
> > where the driver data pointer is set, so please try the attached
> > patch.

I'm hitting this issue on the 6.0 stable releases (aka 6.0.y) and
looking at the stable tree I see this hasn't been merged... I just got
bitten by this on 6.0.12.

Greg, if Rafael agrees, I think you should apply 4919d3eb2ec0 and
0782b66ed2fb to the 6.0.y tree.

Thank you in advance.

Cheers,

--
Mathieu Chouquet-Stringer [email protected]
The sun itself sees not till heaven clears.
-- William Shakespeare --