2009-01-07 00:12:39

by Justin P. Mattock

[permalink] [raw]
Subject: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

With pulling git today I'm unable to shut the machine down completely.
(the system just sits there with the message on the screen);

* will now halt
[ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
[ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
[ 286.550598] Oops: 0002 [#1] SMP
[ 286.552206] last sysfs file: /sys/block/sda/removeable
[ 286.553844] Modules linked in: hidp radeon drm agpgart btusb rfcomm
bnep sco l2cap bluetooth fan battery container ipt_LOG xt_limit
xt_tcpudp xt_state ipt_addrtype nf_nat_irc nf_conntrack_irc nf_nat_ftp
nf_nat nf_conntrack_ftp ipmi_watchdog ipmi_msghandler uvcvideo
isight_firmware uinput arpt_mangle arptable_filter arp_tables
nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_mangle
iptable_filter ip_tables x_tables coretemp eeprom acpi_cpufreq
cpufreq_powersave cpufreq_performance cpufreq_ondemand
cpufreq_conservative appletouch snd_had_codec_idt ohci1394 ehci_hcd
snd_hda_intel snd_hda_codec thermal ath9k uhci_hcd ieee1394 joydev
pata_acpi snd_hwdep snd_pcm snd_page_alloc video ac button processor
applesmc evdev
[ 286.560580]
[ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
#1) MacBookPro2,2
[ 286.560580] EIP: 0060:[<c0150ca4>] EFLAGS: 00010293 CPU: 0
[ 286.560580] EIP: is at __stop_machine+0x88/0xe3
[ 286.560580] EAX: 6b6b6b6b EBX: 00000000 ECX: 6b6b6b6b EDX: 00000000
[ 286.560580] ESI: c054abe0 EDI: c03d03a4 EBP: f1a29e54 ESP: f1a29e44
[ 286.560580] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
task.ti=f1a28000)
[ 286.560580] Stack:
[ 286.560580] f1a29e60 c054abe0 00000001 00000010 f1a29e7c c03d04e4
ffffffea 00000010
[ 286.560580] 00000001 00000003 00000022 00000001 4321fedc c054abe0
f1a29e94 c012a57e
[ 286.560580] 00000000 ffffffff 4321fedc 28121969 f1a29e9c c01360c0
f1a29fb0 c0136301
[ 286.560580] Call Trace:
[ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
[ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
[ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
[ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
[ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
[ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
[ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
[ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
[ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
[ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
[ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
[ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
[ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
[ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
[ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
[ 286.560580] Code: c7 05 10 06 62 c0 00 00 00 00 a3 f4 05 62 c0 c7 05
ec 05 62 c0 01 00 00 00 83 cb ff eb 2d a1 1c 06 62 c0 f7 d0 8b 0c 98 8d
41 04 <c7> 01 00 00 00 00 89 41 04 89 41 08 c7 41 0c ff 0c 15 c0 89 d8
[ 286.560580] EIP: [<c0150ca4>] __stop_machine+0x88/0xe3 SS:ESP
0068:f1a29e44
[ 286.639215] ---[ end trace 5b080c1ab14203ae ] ---
Segmentation fault

after this message appears, if I hold down the start button
the system shuts off after a few seconds.
(BTW hopefully the number are correct,
manually writing this down, is a bit of a pain);

regards;

Justin P. Mattock


2009-01-07 06:48:35

by Pekka Enberg

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

Hi Justin,

On Wed, Jan 7, 2009 at 2:12 AM, Justin P. Mattock
<[email protected]> wrote:
> With pulling git today I'm unable to shut the machine down completely.
> (the system just sits there with the message on the screen);
>
> * will now halt
> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

That looks like use-after-free in __stop_machine() so lets cc Rusty.
If you want, you can convert the oops location into human-readable
form. Just search for "GDB" in Documentation/BUG-HUNTING for
instructions how to do that. And don't forget to send your .config.

> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
> [ 286.550598] Oops: 0002 [#1] SMP
> [ 286.552206] last sysfs file: /sys/block/sda/removeable
> [ 286.553844] Modules linked in: hidp radeon drm agpgart btusb rfcomm bnep
> sco l2cap bluetooth fan battery container ipt_LOG xt_limit xt_tcpudp
> xt_state ipt_addrtype nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat
> nf_conntrack_ftp ipmi_watchdog ipmi_msghandler uvcvideo isight_firmware
> uinput arpt_mangle arptable_filter arp_tables nf_conntrack_ipv4 nf_conntrack
> nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables x_tables coretemp
> eeprom acpi_cpufreq cpufreq_powersave cpufreq_performance cpufreq_ondemand
> cpufreq_conservative appletouch snd_had_codec_idt ohci1394 ehci_hcd
> snd_hda_intel snd_hda_codec thermal ath9k uhci_hcd ieee1394 joydev pata_acpi
> snd_hwdep snd_pcm snd_page_alloc video ac button processor applesmc evdev
> [ 286.560580]
> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5 #1)
> MacBookPro2,2
> [ 286.560580] EIP: 0060:[<c0150ca4>] EFLAGS: 00010293 CPU: 0
> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
> [ 286.560580] EAX: 6b6b6b6b EBX: 00000000 ECX: 6b6b6b6b EDX: 00000000
> [ 286.560580] ESI: c054abe0 EDI: c03d03a4 EBP: f1a29e54 ESP: f1a29e44
> [ 286.560580] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
> task.ti=f1a28000)
> [ 286.560580] Stack:
> [ 286.560580] f1a29e60 c054abe0 00000001 00000010 f1a29e7c c03d04e4
> ffffffea 00000010
> [ 286.560580] 00000001 00000003 00000022 00000001 4321fedc c054abe0
> f1a29e94 c012a57e
> [ 286.560580] 00000000 ffffffff 4321fedc 28121969 f1a29e9c c01360c0
> f1a29fb0 c0136301
> [ 286.560580] Call Trace:
> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
> [ 286.560580] Code: c7 05 10 06 62 c0 00 00 00 00 a3 f4 05 62 c0 c7 05 ec
> 05 62 c0 01 00 00 00 83 cb ff eb 2d a1 1c 06 62 c0 f7 d0 8b 0c 98 8d 41 04
> <c7> 01 00 00 00 00 89 41 04 89 41 08 c7 41 0c ff 0c 15 c0 89 d8
> [ 286.560580] EIP: [<c0150ca4>] __stop_machine+0x88/0xe3 SS:ESP
> 0068:f1a29e44
> [ 286.639215] ---[ end trace 5b080c1ab14203ae ] ---
> Segmentation fault
>
> after this message appears, if I hold down the start button
> the system shuts off after a few seconds.
> (BTW hopefully the number are correct,
> manually writing this down, is a bit of a pain);
>
> regards;
>
> Justin P. Mattock
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-01-07 08:13:41

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

Pekka Enberg wrote:
> Hi Justin,
>
> On Wed, Jan 7, 2009 at 2:12 AM, Justin P. Mattock
> <[email protected]> wrote:
>
>> With pulling git today I'm unable to shut the machine down completely.
>> (the system just sits there with the message on the screen);
>>
>> * will now halt
>> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
>>
>
> That looks like use-after-free in __stop_machine() so lets cc Rusty.
> If you want, you can convert the oops location into human-readable
> form. Just search for "GDB" in Documentation/BUG-HUNTING for
> instructions how to do that. And don't forget to send your .config.
>
>
>> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
>> [ 286.550598] Oops: 0002 [#1] SMP
>> [ 286.552206] last sysfs file: /sys/block/sda/removeable
>> [ 286.553844] Modules linked in: hidp radeon drm agpgart btusb rfcomm bnep
>> sco l2cap bluetooth fan battery container ipt_LOG xt_limit xt_tcpudp
>> xt_state ipt_addrtype nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat
>> nf_conntrack_ftp ipmi_watchdog ipmi_msghandler uvcvideo isight_firmware
>> uinput arpt_mangle arptable_filter arp_tables nf_conntrack_ipv4 nf_conntrack
>> nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables x_tables coretemp
>> eeprom acpi_cpufreq cpufreq_powersave cpufreq_performance cpufreq_ondemand
>> cpufreq_conservative appletouch snd_had_codec_idt ohci1394 ehci_hcd
>> snd_hda_intel snd_hda_codec thermal ath9k uhci_hcd ieee1394 joydev pata_acpi
>> snd_hwdep snd_pcm snd_page_alloc video ac button processor applesmc evdev
>> [ 286.560580]
>> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5 #1)
>> MacBookPro2,2
>> [ 286.560580] EIP: 0060:[<c0150ca4>] EFLAGS: 00010293 CPU: 0
>> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
>> [ 286.560580] EAX: 6b6b6b6b EBX: 00000000 ECX: 6b6b6b6b EDX: 00000000
>> [ 286.560580] ESI: c054abe0 EDI: c03d03a4 EBP: f1a29e54 ESP: f1a29e44
>> [ 286.560580] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
>> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
>> task.ti=f1a28000)
>> [ 286.560580] Stack:
>> [ 286.560580] f1a29e60 c054abe0 00000001 00000010 f1a29e7c c03d04e4
>> ffffffea 00000010
>> [ 286.560580] 00000001 00000003 00000022 00000001 4321fedc c054abe0
>> f1a29e94 c012a57e
>> [ 286.560580] 00000000 ffffffff 4321fedc 28121969 f1a29e9c c01360c0
>> f1a29fb0 c0136301
>> [ 286.560580] Call Trace:
>> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
>> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
>> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
>> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
>> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
>> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
>> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
>> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
>> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
>> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
>> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
>> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
>> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
>> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
>> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
>> [ 286.560580] Code: c7 05 10 06 62 c0 00 00 00 00 a3 f4 05 62 c0 c7 05 ec
>> 05 62 c0 01 00 00 00 83 cb ff eb 2d a1 1c 06 62 c0 f7 d0 8b 0c 98 8d 41 04
>> <c7> 01 00 00 00 00 89 41 04 89 41 08 c7 41 0c ff 0c 15 c0 89 d8
>> [ 286.560580] EIP: [<c0150ca4>] __stop_machine+0x88/0xe3 SS:ESP
>> 0068:f1a29e44
>> [ 286.639215] ---[ end trace 5b080c1ab14203ae ] ---
>> Segmentation fault
>>
>> after this message appears, if I hold down the start button
>> the system shuts off after a few seconds.
>> (BTW hopefully the number are correct,
>> manually writing this down, is a bit of a pain);
>>
>> regards;
>>
>> Justin P. Mattock
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>
>
Thats nice, thanks for the info.
I like the idea of using gcc to disassemble
this text.
Since not knowing what I'm doing,
I'll have to do my homework on this.
(really curious to see what this does);

regards;

Justin P. Mattock


2009-01-07 08:31:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

On Wed, Jan 7, 2009 at 8:48 AM, Pekka Enberg <[email protected]> wrote:
> On Wed, Jan 7, 2009 at 2:12 AM, Justin P. Mattock
> <[email protected]> wrote:
>> With pulling git today I'm unable to shut the machine down completely.
>> (the system just sits there with the message on the screen);
>>
>> * will now halt
>> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
>
> That looks like use-after-free in __stop_machine() so lets cc Rusty.
> If you want, you can convert the oops location into human-readable
> form. Just search for "GDB" in Documentation/BUG-HUNTING for
> instructions how to do that. And don't forget to send your .config.
>
>> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
>> [ 286.550598] Oops: 0002 [#1] SMP
>> [ 286.552206] last sysfs file: /sys/block/sda/removeable
>> [ 286.553844] Modules linked in: hidp radeon drm agpgart btusb rfcomm bnep
>> sco l2cap bluetooth fan battery container ipt_LOG xt_limit xt_tcpudp
>> xt_state ipt_addrtype nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_nat
>> nf_conntrack_ftp ipmi_watchdog ipmi_msghandler uvcvideo isight_firmware
>> uinput arpt_mangle arptable_filter arp_tables nf_conntrack_ipv4 nf_conntrack
>> nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables x_tables coretemp
>> eeprom acpi_cpufreq cpufreq_powersave cpufreq_performance cpufreq_ondemand
>> cpufreq_conservative appletouch snd_had_codec_idt ohci1394 ehci_hcd
>> snd_hda_intel snd_hda_codec thermal ath9k uhci_hcd ieee1394 joydev pata_acpi
>> snd_hwdep snd_pcm snd_page_alloc video ac button processor applesmc evdev
>> [ 286.560580]
>> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5 #1)
>> MacBookPro2,2
>> [ 286.560580] EIP: 0060:[<c0150ca4>] EFLAGS: 00010293 CPU: 0
>> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
>> [ 286.560580] EAX: 6b6b6b6b EBX: 00000000 ECX: 6b6b6b6b EDX: 00000000
>> [ 286.560580] ESI: c054abe0 EDI: c03d03a4 EBP: f1a29e54 ESP: f1a29e44
>> [ 286.560580] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
>> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
>> task.ti=f1a28000)
>> [ 286.560580] Stack:
>> [ 286.560580] f1a29e60 c054abe0 00000001 00000010 f1a29e7c c03d04e4
>> ffffffea 00000010
>> [ 286.560580] 00000001 00000003 00000022 00000001 4321fedc c054abe0
>> f1a29e94 c012a57e
>> [ 286.560580] 00000000 ffffffff 4321fedc 28121969 f1a29e9c c01360c0
>> f1a29fb0 c0136301
>> [ 286.560580] Call Trace:
>> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
>> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
>> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
>> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
>> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
>> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
>> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
>> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
>> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
>> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
>> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
>> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
>> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
>> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
>> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
>> [ 286.560580] Code: c7 05 10 06 62 c0 00 00 00 00 a3 f4 05 62 c0 c7 05 ec
>> 05 62 c0 01 00 00 00 83 cb ff eb 2d a1 1c 06 62 c0 f7 d0 8b 0c 98 8d 41 04
>> <c7> 01 00 00 00 00 89 41 04 89 41 08 c7 41 0c ff 0c 15 c0 89 d8
>> [ 286.560580] EIP: [<c0150ca4>] __stop_machine+0x88/0xe3 SS:ESP
>> 0068:f1a29e44
>> [ 286.639215] ---[ end trace 5b080c1ab14203ae ] ---
>> Segmentation fault
>>
>> after this message appears, if I hold down the start button
>> the system shuts off after a few seconds.
>> (BTW hopefully the number are correct,
>> manually writing this down, is a bit of a pain);

scripts/decodecode gives us:

0: c7 05 10 06 62 c0 00 movl $0x0,0xc0620610
7: 00 00 00
a: a3 f4 05 62 c0 mov %eax,0xc06205f4
f: c7 05 ec 05 62 c0 01 movl $0x1,0xc06205ec
16: 00 00 00
19: 83 cb ff or $0xffffffff,%ebx
1c: eb 2d jmp 0x4b
1e: a1 1c 06 62 c0 mov 0xc062061c,%eax
23: f7 d0 not %eax
25: 8b 0c 98 mov (%eax,%ebx,4),%ecx
28: 8d 41 04 lea 0x4(%ecx),%eax
2b: c7 01 00 00 00 00 movl $0x0,(%ecx)
31: 89 41 04 mov %eax,0x4(%ecx)
34: 89 41 08 mov %eax,0x8(%ecx)
37: c7 41 0c ff 0c 15 c0 movl $0xc0150cff,0xc(%ecx)
3e: 89 d8 mov %ebx,%eax
0: c7 01 00 00 00 00 movl $0x0,(%ecx) <-- oops
6: 89 41 04 mov %eax,0x4(%ecx)
9: 89 41 08 mov %eax,0x8(%ecx)
c: c7 41 0c ff 0c 15 c0 movl $0xc0150cff,0xc(%ecx)
13: 89 d8 mov %ebx,%eax

objdump -S -d kernel/stop_machine.o looks like this on my machine:

/* Schedule the stop_cpu work on all cpus: hold this CPU so one
* doesn't hit this CPU until we're ready. */
get_cpu();
for_each_online_cpu(i) {
sm_work = percpu_ptr(stop_machine_work, i);
INIT_WORK(sm_work, stop_cpu);
8b: c7 01 00 00 00 00 movl $0x0,(%ecx)
91: c7 41 0c e0 00 00 00 movl $0xe0,0xc(%ecx)

where

#define INIT_WORK(_work, _func) \
do { \
(_work)->data = (atomic_long_t) WORK_DATA_INIT(); \

and offset of ->data is zero and WORK_DATA_INIT() expands to
ATOMIC_LONG_INIT(0) so looks to me like 'sm_work' is used after it has
been free'd. Perhaps stop_machine_destroy() was called before or in
parallel to __stop_machine()? So lets cc Heiko as well.

Pekka

2009-01-07 09:15:56

by Heiko Carstens

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

On Wed, Jan 07, 2009 at 10:30:56AM +0200, Pekka Enberg wrote:
> On Wed, Jan 7, 2009 at 8:48 AM, Pekka Enberg <[email protected]> wrote:
> > On Wed, Jan 7, 2009 at 2:12 AM, Justin P. Mattock
> > <[email protected]> wrote:
> >> With pulling git today I'm unable to shut the machine down completely.
> >> (the system just sits there with the message on the screen);
[...]
> >> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
> >
> > That looks like use-after-free in __stop_machine() so lets cc Rusty.
> > If you want, you can convert the oops location into human-readable
> > form. Just search for "GDB" in Documentation/BUG-HUNTING for
> > instructions how to do that. And don't forget to send your .config.
[...]
> >> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
> >> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
> >> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
> >> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
> >> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
> >> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
> >> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
> >> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
> >> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
> >> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
> >> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
> >> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
> >> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
> >> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
> >> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
> >> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
[...]
> and offset of ->data is zero and WORK_DATA_INIT() expands to
> ATOMIC_LONG_INIT(0) so looks to me like 'sm_work' is used after it has
> been free'd. Perhaps stop_machine_destroy() was called before or in
> parallel to __stop_machine()? So lets cc Heiko as well.

I missed to convert disable_nonboot_cpus to stop_machine_create/destroy.
So it's a use before-even-allocated bug.

The patch below should hopefully fix it:

Subject: [PATCH] cpu hotplug: add stop_machine_create/destroy to disable_nonboot_cpus

From: Heiko Carstens <[email protected]>

disable_nonboot_cpus calls directly _cpu_down. _cpu_down however relies on
the in advanced created stop_machine kernel threads which should be created
by the caller (like cpu_down does).

So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus
as well.

Fixes this bug:

[ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
[ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
[ 286.550598] Oops: 0002 [#1] SMP
[ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
[ 286.560580] EIP: is at __stop_machine+0x88/0xe3
[ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
[ 286.560580] Call Trace:
[ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
[ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
[ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
[ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
[ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
[ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
[ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
[ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
[ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
[ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
[ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
[ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
[ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
[ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
[ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34

Reported-by: "Justin P. Mattock" <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
---
kernel/cpu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -379,8 +379,11 @@ static cpumask_var_t frozen_cpus;

int disable_nonboot_cpus(void)
{
- int cpu, first_cpu, error = 0;
+ int cpu, first_cpu, error;

+ error = stop_machine_create();
+ if (error)
+ return error;
cpu_maps_update_begin();
first_cpu = cpumask_first(cpu_online_mask);
/* We take down all of the non-boot CPUs in one shot to avoid races
@@ -409,6 +412,7 @@ int disable_nonboot_cpus(void)
printk(KERN_ERR "Non-boot CPUs are not disabled\n");
}
cpu_maps_update_done();
+ stop_machine_destroy();
return error;
}

2009-01-07 09:19:25

by Pekka Enberg

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

On Wed, 2009-01-07 at 10:15 +0100, Heiko Carstens wrote:
> I missed to convert disable_nonboot_cpus to
> stop_machine_create/destroy.
> So it's a use before-even-allocated bug.
>
> The patch below should hopefully fix it:
>
> Subject: [PATCH] cpu hotplug: add stop_machine_create/destroy to disable_nonboot_cpus
>
> From: Heiko Carstens <[email protected]>
>
> disable_nonboot_cpus calls directly _cpu_down. _cpu_down however relies on
> the in advanced created stop_machine kernel threads which should be created
> by the caller (like cpu_down does).
>
> So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus
> as well.
>
> Fixes this bug:
>
> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
> [ 286.550598] Oops: 0002 [#1] SMP
> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
> [ 286.560580] Call Trace:
> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
>
> Reported-by: "Justin P. Mattock" <[email protected]>
> Signed-off-by: Heiko Carstens <[email protected]>

Looks good to me!

Reviewed-by: Pekka Enberg <[email protected]>

> ---
> kernel/cpu.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/cpu.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpu.c
> +++ linux-2.6/kernel/cpu.c
> @@ -379,8 +379,11 @@ static cpumask_var_t frozen_cpus;
>
> int disable_nonboot_cpus(void)
> {
> - int cpu, first_cpu, error = 0;
> + int cpu, first_cpu, error;
>
> + error = stop_machine_create();
> + if (error)
> + return error;
> cpu_maps_update_begin();
> first_cpu = cpumask_first(cpu_online_mask);
> /* We take down all of the non-boot CPUs in one shot to avoid races
> @@ -409,6 +412,7 @@ int disable_nonboot_cpus(void)
> printk(KERN_ERR "Non-boot CPUs are not disabled\n");
> }
> cpu_maps_update_done();
> + stop_machine_destroy();
> return error;
> }
>

2009-01-07 11:37:13

by Jeff Chua

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

On Wed, Jan 7, 2009 at 5:19 PM, Pekka Enberg <[email protected]> wrote:
> On Wed, 2009-01-07 at 10:15 +0100, Heiko Carstens wrote:
>> I missed to convert disable_nonboot_cpus to
>> stop_machine_create/destroy.
>> So it's a use before-even-allocated bug.
>> The patch below should hopefully fix it:
>> Subject: [PATCH] cpu hotplug: add stop_machine_create/destroy to disable_nonboot_cpus
> Looks good to me!

This also fixes the suspend-to-ram/disk problem. Without it, the
system will just hang.

Thanks for the fix.

Thanks,
Jeff.

2009-01-07 12:27:40

by Heiko Carstens

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

On Wed, Jan 07, 2009 at 07:36:57PM +0800, Jeff Chua wrote:
> On Wed, Jan 7, 2009 at 5:19 PM, Pekka Enberg <[email protected]> wrote:
> > On Wed, 2009-01-07 at 10:15 +0100, Heiko Carstens wrote:
> >> I missed to convert disable_nonboot_cpus to
> >> stop_machine_create/destroy.
> >> So it's a use before-even-allocated bug.
> >> The patch below should hopefully fix it:
> >> Subject: [PATCH] cpu hotplug: add stop_machine_create/destroy to disable_nonboot_cpus
> > Looks good to me!
>
> This also fixes the suspend-to-ram/disk problem. Without it, the
> system will just hang.
>
> Thanks for the fix.

Did you also see the reboot problem and does the patch fix it for you?

2009-01-07 13:51:23

by Jeff Chua

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

On Wed, Jan 7, 2009 at 8:27 PM, Heiko Carstens
<[email protected]> wrote:
> On Wed, Jan 07, 2009 at 07:36:57PM +0800, Jeff Chua wrote:
>> On Wed, Jan 7, 2009 at 5:19 PM, Pekka Enberg <[email protected]> wrote:
>> > On Wed, 2009-01-07 at 10:15 +0100, Heiko Carstens wrote:
>> >> I missed to convert disable_nonboot_cpus to
>> >> stop_machine_create/destroy.
>> >> So it's a use before-even-allocated bug.
>> >> The patch below should hopefully fix it:
>> >> Subject: [PATCH] cpu hotplug: add stop_machine_create/destroy to disable_nonboot_cpus
>> > Looks good to me!
>>
>> This also fixes the suspend-to-ram/disk problem. Without it, the
>> system will just hang.

> Did you also see the reboot problem and does the patch fix it for you?

I never had problem with rebooting. Just suspend hanging which is
really annoying ... walking away and come back hours later and
realized that the suspend is hanging and having to do a hard boot.

Thanks,
Jeff.

2009-01-07 15:19:59

by Heiko Carstens

[permalink] [raw]
Subject: [PATCH] stop_machine/cpu hotplug: fix disable_nonboot_cpus

From: Heiko Carstens <[email protected]>

disable_nonboot_cpus calls _cpu_down. But _cpu_down requires that the
caller already created the stop_machine workqueue (like cpu_down does).
Otherwise a call to stop_machine will lead to accesses to random memory
regions.

When introducing this new interface (9ea09af3bd3090e8349ca2899ca2011bd94cda85
"stop_machine: introduce stop_machine_create/destroy") I missed the second
call site of _cpu_down.
So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus
as well.

Fixes suspend-to-ram/disk and also this bug:

[ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
[ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
[ 286.550598] Oops: 0002 [#1] SMP
[ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
[ 286.560580] EIP: is at __stop_machine+0x88/0xe3
[ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
[ 286.560580] Call Trace:
[ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
[ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
[ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
[ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
[ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
[ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
[ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
[ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
[ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
[ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
[ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
[ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
[ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
[ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
[ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34

Reported-by: "Justin P. Mattock" <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Signed-off-by: Heiko Carstens <[email protected]>
---
kernel/cpu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -379,8 +379,11 @@ static cpumask_var_t frozen_cpus;

int disable_nonboot_cpus(void)
{
- int cpu, first_cpu, error = 0;
+ int cpu, first_cpu, error;

+ error = stop_machine_create();
+ if (error)
+ return error;
cpu_maps_update_begin();
first_cpu = cpumask_first(cpu_online_mask);
/* We take down all of the non-boot CPUs in one shot to avoid races
@@ -409,6 +412,7 @@ int disable_nonboot_cpus(void)
printk(KERN_ERR "Non-boot CPUs are not disabled\n");
}
cpu_maps_update_done();
+ stop_machine_destroy();
return error;
}

2009-01-07 15:24:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] stop_machine/cpu hotplug: fix disable_nonboot_cpus


* Heiko Carstens <[email protected]> wrote:

> From: Heiko Carstens <[email protected]>
>
> disable_nonboot_cpus calls _cpu_down. But _cpu_down requires that the
> caller already created the stop_machine workqueue (like cpu_down does).
> Otherwise a call to stop_machine will lead to accesses to random memory
> regions.

btw., i got this crash earlier today:

CPU0 attaching sched-domain:
domain 0: span 0-1 level CPU
groups: 0 1
CPU1 attaching sched-domain:
domain 0: span 0-1 level CPU
groups: 1 0
eth0: no IPv6 routers present
BUG: Bad page state in process cc1 pfn:00879
page:c101b894 flags:00000400 count:0 mapcount:0 mapping:(null) index:0
Pid: 3060, comm: cc1 Not tainted 2.6.28-tip-07641-gb97d41d-dirty #14985
Call Trace:
[<c016ce8b>] bad_page+0xcf/0xe5
[<c016d3b4>] free_pages_check+0xa7/0xc5
[<c016d400>] free_hot_cold_page+0x2e/0x138
[<c014751c>] ? __lock_acquire+0x127/0x29d
[<c016d558>] free_hot_page+0xf/0x11
[<c0170963>] put_page+0x76/0x7c
[<c0185071>] ? constant_test_bit+0x9/0x20
[<c0187149>] kfree+0x30/0xe5
[<c0164993>] ? trace_hardirqs_on+0x8/0x1c
[<c01547dd>] free_user_ns+0x1d/0x20
[<c01547c0>] ? free_user_ns+0x0/0x20
[<c02c7a41>] kref_put+0x18/0x22
[<c0132d4c>] put_user_ns+0x16/0x18
[<c0132f52>] free_uid+0x59/0xc8
[<c0136239>] ? groups_free+0x36/0x3a
[<c0140406>] put_cred_rcu+0x5f/0x70
[<c01598fb>] __rcu_process_callbacks+0x168/0x1f8
[<c03031be>] ? isicom_tx+0x0/0x31f
[<c01599b1>] rcu_process_callbacks+0x26/0x46
[<c012f11d>] __do_softirq+0x9d/0x139
[<c012f080>] ? __do_softirq+0x0/0x139
<IRQ> [<c012efe2>] ? irq_exit+0x4c/0x83
[<c05cc586>] ? __irqentry_text_start+0x6e/0x7c
[<c0103f61>] ? apic_timer_interrupt+0x2d/0x34

and i applied your patch (from the other thread) and never saw this bug
again.

So if it's the same bug (it appears to be) then you have my:

Tested-by: Ingo Molnar <[email protected]>


Ingo

2009-01-07 15:29:17

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b

Jeff Chua wrote:
> On Wed, Jan 7, 2009 at 8:27 PM, Heiko Carstens
> <[email protected]> wrote:
>
>> On Wed, Jan 07, 2009 at 07:36:57PM +0800, Jeff Chua wrote:
>>
>>> On Wed, Jan 7, 2009 at 5:19 PM, Pekka Enberg <[email protected]> wrote:
>>>
>>>> On Wed, 2009-01-07 at 10:15 +0100, Heiko Carstens wrote:
>>>>
>>>>> I missed to convert disable_nonboot_cpus to
>>>>> stop_machine_create/destroy.
>>>>> So it's a use before-even-allocated bug.
>>>>> The patch below should hopefully fix it:
>>>>> Subject: [PATCH] cpu hotplug: add stop_machine_create/destroy to disable_nonboot_cpus
>>>>>
>>>> Looks good to me!
>>>>
>>> This also fixes the suspend-to-ram/disk problem. Without it, the
>>> system will just hang.
>>>
>
>
>> Did you also see the reboot problem and does the patch fix it for you?
>>
>
> I never had problem with rebooting. Just suspend hanging which is
> really annoying ... walking away and come back hours later and
> realized that the suspend is hanging and having to do a hard boot.
>
> Thanks,
> Jeff.
>
>
Man!! I missed this whole conversation
(passed out, too tired);
I'll go ahead and apply the patch and let you
know If I get the freeze at shutdown.
Then I'm still interested in knowing
how to take a debug messages and dissect
it to find the exact location of the problem.
(but could do that later);

regards;

Justin P. Mattock

2009-01-07 15:31:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] stop_machine/cpu hotplug: fix disable_nonboot_cpus

2009/1/7 Heiko Carstens <[email protected]>:
> From: Heiko Carstens <[email protected]>
>
> disable_nonboot_cpus calls _cpu_down. But _cpu_down requires that the
> caller already created the stop_machine workqueue (like cpu_down does).
> Otherwise a call to stop_machine will lead to accesses to random memory
> regions.
>
> When introducing this new interface (9ea09af3bd3090e8349ca2899ca2011bd94cda85
> "stop_machine: introduce stop_machine_create/destroy") I missed the second
> call site of _cpu_down.
> So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus
> as well.
>
> Fixes suspend-to-ram/disk and also this bug:
>
> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
> [ 286.550598] Oops: 0002 [#1] SMP
> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
> [ 286.560580] Call Trace:
> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
>
> Reported-by: "Justin P. Mattock" <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> Signed-off-by: Heiko Carstens <[email protected]>
> ---
> kernel/cpu.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/cpu.c
> ===================================================================
> --- linux-2.6.orig/kernel/cpu.c
> +++ linux-2.6/kernel/cpu.c
> @@ -379,8 +379,11 @@ static cpumask_var_t frozen_cpus;
>
> int disable_nonboot_cpus(void)
> {
> - int cpu, first_cpu, error = 0;
> + int cpu, first_cpu, error;
>
> + error = stop_machine_create();
> + if (error)
> + return error;
> cpu_maps_update_begin();
> first_cpu = cpumask_first(cpu_online_mask);
> /* We take down all of the non-boot CPUs in one shot to avoid races
> @@ -409,6 +412,7 @@ int disable_nonboot_cpus(void)
> printk(KERN_ERR "Non-boot CPUs are not disabled\n");
> }
> cpu_maps_update_done();
> + stop_machine_destroy();
> return error;
> }
>


That should explain why suspend to disk failed on my box yesterday on
the processors stage...
Thanks!

2009-01-07 15:52:42

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [PATCH] stop_machine/cpu hotplug: fix disable_nonboot_cpus

Fr?d?ric Weisbecker wrote:
> 2009/1/7 Heiko Carstens <[email protected]>:
>
>> From: Heiko Carstens <[email protected]>
>>
>> disable_nonboot_cpus calls _cpu_down. But _cpu_down requires that the
>> caller already created the stop_machine workqueue (like cpu_down does).
>> Otherwise a call to stop_machine will lead to accesses to random memory
>> regions.
>>
>> When introducing this new interface (9ea09af3bd3090e8349ca2899ca2011bd94cda85
>> "stop_machine: introduce stop_machine_create/destroy") I missed the second
>> call site of _cpu_down.
>> So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus
>> as well.
>>
>> Fixes suspend-to-ram/disk and also this bug:
>>
>> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
>> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
>> [ 286.550598] Oops: 0002 [#1] SMP
>> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
>> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
>> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
>> [ 286.560580] Call Trace:
>> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
>> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
>> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
>> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
>> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
>> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
>> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
>> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
>> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
>> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
>> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
>> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
>> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
>> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
>> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
>>
>> Reported-by: "Justin P. Mattock" <[email protected]>
>> Reviewed-by: Pekka Enberg <[email protected]>
>> Signed-off-by: Heiko Carstens <[email protected]>
>> ---
>> kernel/cpu.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6/kernel/cpu.c
>> ===================================================================
>> --- linux-2.6.orig/kernel/cpu.c
>> +++ linux-2.6/kernel/cpu.c
>> @@ -379,8 +379,11 @@ static cpumask_var_t frozen_cpus;
>>
>> int disable_nonboot_cpus(void)
>> {
>> - int cpu, first_cpu, error = 0;
>> + int cpu, first_cpu, error;
>>
>> + error = stop_machine_create();
>> + if (error)
>> + return error;
>> cpu_maps_update_begin();
>> first_cpu = cpumask_first(cpu_online_mask);
>> /* We take down all of the non-boot CPUs in one shot to avoid races
>> @@ -409,6 +412,7 @@ int disable_nonboot_cpus(void)
>> printk(KERN_ERR "Non-boot CPUs are not disabled\n");
>> }
>> cpu_maps_update_done();
>> + stop_machine_destroy();
>> return error;
>> }
>>
>>
>
>
> That should explain why suspend to disk failed on my box yesterday on
> the processors stage...
> Thanks!
>
>
O.K. applied the patch,
and shutdown the machine
a few times; no freeze, no bug message.
sweet!!.
Now I'm gonna try and dismantle
a bug message for educational purposes.
Thanks for the assistance.

regards;

Justin P. Mattock

2009-01-08 05:13:43

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [PATCH] stop_machine/cpu hotplug: fix disable_nonboot_cpus

Fr?d?ric Weisbecker wrote:
> 2009/1/7 Heiko Carstens <[email protected]>:
>
>> From: Heiko Carstens <[email protected]>
>>
>> disable_nonboot_cpus calls _cpu_down. But _cpu_down requires that the
>> caller already created the stop_machine workqueue (like cpu_down does).
>> Otherwise a call to stop_machine will lead to accesses to random memory
>> regions.
>>
>> When introducing this new interface (9ea09af3bd3090e8349ca2899ca2011bd94cda85
>> "stop_machine: introduce stop_machine_create/destroy") I missed the second
>> call site of _cpu_down.
>> So add the missing stop_machine_create/destroy calls to disable_nonboot_cpus
>> as well.
>>
>> Fixes suspend-to-ram/disk and also this bug:
>>
>> [ 286.547348] BUG: unable to handle kernel paging request at 6b6b6b6b
>> [ 286.548940] IP: [<c0150ca4>] __stop_machine+0x88/0xe3
>> [ 286.550598] Oops: 0002 [#1] SMP
>> [ 286.560580] Pid: 3273, comm: halt Not tainted (2.6.28-06127-g238c6d5
>> [ 286.560580] EIP: is at __stop_machine+0x88/0xe3
>> [ 286.560580] Process halt (pid: 3273, ti=f1a28000 task=f4530f30
>> [ 286.560580] Call Trace:
>> [ 286.560580] [<c03d04e4>] ? _cpu_down+0x10f/0x234
>> [ 286.560580] [<c012a57e>] ? disable_nonboot_cpus+0x58/0xdc
>> [ 286.560580] [<c01360c0>] ? kernel_poweroff+0x22/0x39
>> [ 286.560580] [<c0136301>] ? sys_reboot+0xde/0x14c
>> [ 286.560580] [<c01331b2>] ? complete_signal+0x179/0x191
>> [ 286.560580] [<c0133396>] ? send_signal+0x1cc/0x1e1
>> [ 286.560580] [<c03de418>] ? _spin_unlock_irqrestore+0x2d/0x3c
>> [ 286.560580] [<c0133b65>] ? group_send_signal_info+0x58/0x61
>> [ 286.560580] [<c0133b9e>] ? kill_pid_info+0x30/0x3a
>> [ 286.560580] [<c0133d49>] ? sys_kill+0x75/0x13a
>> [ 286.560580] [<c01a06cb>] ? mntput_no_expire+ox1f/0x101
>> [ 286.560580] [<c019b3b3>] ? dput+0x1e/0x105
>> [ 286.560580] [<c018ef87>] ? __fput+0x150/0x158
>> [ 286.560580] [<c0157abf>] ? audit_syscall_entry+0x137/0x159
>> [ 286.560580] [<c010329f>] ? sysenter_do_call+0x12/0x34
>>
>> Reported-by: "Justin P. Mattock" <[email protected]>
>> Reviewed-by: Pekka Enberg <[email protected]>
>> Signed-off-by: Heiko Carstens <[email protected]>
>> ---
>> kernel/cpu.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6/kernel/cpu.c
>> ===================================================================
>> --- linux-2.6.orig/kernel/cpu.c
>> +++ linux-2.6/kernel/cpu.c
>> @@ -379,8 +379,11 @@ static cpumask_var_t frozen_cpus;
>>
>> int disable_nonboot_cpus(void)
>> {
>> - int cpu, first_cpu, error = 0;
>> + int cpu, first_cpu, error;
>>
>> + error = stop_machine_create();
>> + if (error)
>> + return error;
>> cpu_maps_update_begin();
>> first_cpu = cpumask_first(cpu_online_mask);
>> /* We take down all of the non-boot CPUs in one shot to avoid races
>> @@ -409,6 +412,7 @@ int disable_nonboot_cpus(void)
>> printk(KERN_ERR "Non-boot CPUs are not disabled\n");
>> }
>> cpu_maps_update_done();
>> + stop_machine_destroy();
>> return error;
>> }
>>
>>
>
>
> That should explain why suspend to disk failed on my box yesterday on
> the processors stage...
> Thanks!
>
>
I hate to ask this, but I'm going to
anyway:
when running
gdb /usr/src/linux/vmlinux
(hoping to see if gdb will catch the bug);
I keep getting:
Program terminated with signal SIGKILL, Killed.
The program no longer exists.
You can't do that without a process to debug.

if i do a:
(gdb) disassemble __stop_machine
(as described in Documentation);
I'll see a bit of info.

How do I start/or figure out a process
to debug? i.g. under the bug message
that I wrote down, it says Pid: 3273
entering that in (gdb) r 3273
results in a SIGKILL.

regards;

Justin P. Mattock