2004-04-01 16:57:34

by Olof Johansson

[permalink] [raw]
Subject: Oops in get_boot_pages at reboot

Hi,

We've started to see Oopses when rebooting some ppc64 systems built with
CONFIG_NUMA after the distribute-early-allocations-across-nodes patch went
in. This happens on the way down at a reboot:

Please stand by while rebooting the system...
md: stopping all md devices.
md: md0 switched to read-only mode.
Oops: Exception in kernel mode, sig: 4 [#1]
SMP NR_CPUS=32 NUMA
NIP: C00000000046D424 XER: 0000000020000000 LR: C0000000000D36CC
REGS: c000000033f8f6b0 TRAP: 0700 Tainted: GF (2.6.5-rc2-ames)
MSR: 8000000000089032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK: c0000000018ff240[1] 'init' THREAD: c000000033f8c000 CPU: 7
GPR00: C0000000000D36CC C000000033F8F930 C000000000605100 00000000000000D0
GPR04: 0000000000000000 C000000033F8FAC0 0000000000000000 0000000000000000
GPR08: C000000033F8FAC0 0000000000001000 C00000003374B680 C000000000603000
GPR12: C0000000078A3BB0 C00000000049E000 0000000000000000 0000000000000400
GPR16: 0000000000000000 0000000000000000 C00000002FEBEB28 C00000002FEBEB18
GPR20: C00000002FEBEB20 C00000002FEBEB08 C00000002FEBEB10 C00000002FEBEB18
GPR24: 000000000000000B 0000000000000000 0000000000000400 C00000003374B680
GPR28: C000000033F8FAC0 C000000033ABFE80 C000000000549008 0000000000000000
NIP [c00000000046d424] .get_boot_pages+0x0/0x1ac
LR [c0000000000d36cc] .__pollwait+0x5c/0x108
Call Trace:
[c0000000000c9984] .pipe_poll+0xc4/0xd0
[c0000000000d3bac] .do_select+0x334/0x400
[c000000000020828] .sys32_select+0x440/0x654
[c000000000020a50] .ppc32_select+0x14/0x28
[c000000000011bdc] .ret_from_syscall_1+0x0/0xa4

So __pollwait() calls __get_free_page(), system_running is 0 so
get_boot_pages is called. Since get_boot_pages is labeled __init, badness
happens.

How about checking against mem_init_done instead of system_running? It
helps against the oops, but there might be some good reason not to do
it. I don't claim to know the intrisic details about the MM. :-)

I'm not sure yet why we don't see this on all kernels at all times. Most
likely seems to be that it's a timing issue with init doing select (it
seems to use it to sleep). Either way, the test of system_running still
seems risky.


Thanks,

Olof

===== mm/page_alloc.c 1.190 vs edited =====
--- 1.190/mm/page_alloc.c Sat Mar 20 05:56:20 2004
+++ edited/mm/page_alloc.c Wed Mar 31 18:22:48 2004
@@ -732,9 +732,10 @@
fastcall unsigned long __get_free_pages(unsigned int gfp_mask, unsigned int order)
{
struct page * page;
-
#ifdef CONFIG_NUMA
- if (unlikely(!system_running))
+ extern int mem_init_done;
+
+ if (unlikely(!mem_init_done))
return get_boot_pages(gfp_mask, order);
#endif
page = alloc_pages(gfp_mask, order);





2004-04-01 17:12:20

by Manfred Spraul

[permalink] [raw]
Subject: Re: Oops in get_boot_pages at reboot

Olof Johansson wrote:

>So __pollwait() calls __get_free_page(), system_running is 0 so
>get_boot_pages is called. Since get_boot_pages is labeled __init, badness
>happens.
>
>
I didn't notice that the reboot code resets system_running to 0, sorry.

>How about checking against mem_init_done instead of system_running? It
>helps against the oops, but there might be some good reason not to do
>it.
>
mem_init_done is ppc specific. Is there an equivalent to system_running
that is not set to 0 during reboot?

--
Manfred

2004-04-01 17:37:58

by Olof Johansson

[permalink] [raw]
Subject: Re: Oops in get_boot_pages at reboot

Manfred Spraul wrote:

> mem_init_done is ppc specific. Is there an equivalent to system_running
> that is not set to 0 during reboot?

Doh, my bad. :( I'm not able to spot any other suitable variable myself.

system_running is checked in call_usermodehelper(), so keeping it at 1
across devices_shutdown() might cause problems(?).


-Olof

--
Olof Johansson Office: 4F005/905
Linux on Power Development IBM Systems Group
Email: [email protected] Phone: 512-838-9858
All opinions are my own and not those of IBM

2004-04-01 18:57:18

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in get_boot_pages at reboot

Olof Johansson <[email protected]> wrote:
>
> So __pollwait() calls __get_free_page(), system_running is 0 so
> get_boot_pages is called. Since get_boot_pages is labeled __init, badness
> happens.
>
> How about checking against mem_init_done instead of system_running? It
> helps against the oops, but there might be some good reason not to do
> it. I don't claim to know the intrisic details about the MM. :-)

Do we really need to clear system_running at reboot? I'd always viewed it
as "everything is initialised and usable", and that's generally true at
reboot time.

2004-04-02 01:26:53

by Olof Johansson

[permalink] [raw]
Subject: Re: Oops in get_boot_pages at reboot

On Thu, 1 Apr 2004, Andrew Morton wrote:

> Do we really need to clear system_running at reboot? I'd always viewed it
> as "everything is initialised and usable", and that's generally true at
> reboot time.

Seems like it was added explicitly to avoid user mode helpers from being
invoked during early boot and shutdown/reboot:

http://linux.bkbits.net:8080/linux-2.5/cset@3d5c548dNM0Kqg9aP03NhtxXmntmKQ

Well, how about making system_running a state instead of a flag, and
defining states SYSTEM_BOOTING/RUNNING/SHUTDOWN? See patch below.


-Olof


Olof Johansson Office: 4F005/905
Linux on Power Development IBM Systems Group
Email: [email protected] Phone: 512-838-9858
All opinions are my own and not those of IBM


===== arch/ppc/platforms/pmac_nvram.c 1.12 vs edited =====
--- 1.12/arch/ppc/platforms/pmac_nvram.c Wed Feb 4 23:20:40 2004
+++ edited/arch/ppc/platforms/pmac_nvram.c Thu Apr 1 19:13:02 2004
@@ -154,11 +154,11 @@
struct adb_request req;
DECLARE_COMPLETION(req_complete);

- req.arg = system_running ? &req_complete : NULL;
+ req.arg = system_running == SYSTEM_RUNNING ? &req_complete : NULL;
if (pmu_request(&req, pmu_nvram_complete, 3, PMU_READ_NVRAM,
(addr >> 8) & 0xff, addr & 0xff))
return 0xff;
- if (system_running)
+ if (system_running == SYSTEM_RUNNING)
wait_for_completion(&req_complete);
while (!req.complete)
pmu_poll();
@@ -170,11 +170,11 @@
struct adb_request req;
DECLARE_COMPLETION(req_complete);

- req.arg = system_running ? &req_complete : NULL;
+ req.arg = system_running == SYSTEM_RUNNING ? &req_complete : NULL;
if (pmu_request(&req, pmu_nvram_complete, 4, PMU_WRITE_NVRAM,
(addr >> 8) & 0xff, addr & 0xff, val))
return;
- if (system_running)
+ if (system_running == SYSTEM_RUNNING)
wait_for_completion(&req_complete);
while (!req.complete)
pmu_poll();
===== include/linux/kernel.h 1.47 vs edited =====
--- 1.47/include/linux/kernel.h Wed Mar 10 10:45:49 2004
+++ edited/include/linux/kernel.h Thu Apr 1 19:11:16 2004
@@ -109,9 +109,15 @@
extern void bust_spinlocks(int yes);
extern int oops_in_progress; /* If set, an oops, panic(), BUG() or die() is in progress */
extern int panic_on_oops;
-extern int system_running;
+extern int system_running; /* See values below */
extern int tainted;
extern const char *print_tainted(void);
+
+/* Values used for system_running */
+#define SYSTEM_BOOTING 0
+#define SYSTEM_RUNNING 1
+#define SYSTEM_SHUTDOWN 2
+
#define TAINT_PROPRIETARY_MODULE (1<<0)
#define TAINT_FORCED_MODULE (1<<1)
#define TAINT_UNSAFE_SMP (1<<2)
===== init/main.c 1.127 vs edited =====
--- 1.127/init/main.c Tue Mar 16 04:10:35 2004
+++ edited/init/main.c Thu Apr 1 19:11:17 2004
@@ -613,7 +613,7 @@
*/
free_initmem();
unlock_kernel();
- system_running = 1;
+ system_running = SYSTEM_RUNNING;

if (sys_open("/dev/console", O_RDWR, 0) < 0)
printk("Warning: unable to open an initial console.\n");
===== kernel/kmod.c 1.36 vs edited =====
--- 1.36/kernel/kmod.c Wed Feb 25 04:31:13 2004
+++ edited/kernel/kmod.c Thu Apr 1 19:11:17 2004
@@ -249,7 +249,7 @@
};
DECLARE_WORK(work, __call_usermodehelper, &sub_info);

- if (!system_running)
+ if (system_running != SYSTEM_RUNNING)
return -EBUSY;

if (path[0] == '\0')
===== kernel/printk.c 1.35 vs edited =====
--- 1.35/kernel/printk.c Mon Mar 8 18:57:46 2004
+++ edited/kernel/printk.c Thu Apr 1 19:11:17 2004
@@ -522,7 +522,8 @@
log_level_unknown = 1;
}

- if (!cpu_online(smp_processor_id()) && !system_running) {
+ if (!cpu_online(smp_processor_id()) &&
+ system_running != SYSTEM_RUNNING) {
/*
* Some console drivers may assume that per-cpu resources have
* been allocated. So don't allow them to be called by this
===== kernel/sched.c 1.255 vs edited =====
--- 1.255/kernel/sched.c Thu Mar 18 22:54:57 2004
+++ edited/kernel/sched.c Thu Apr 1 19:11:17 2004
@@ -2982,7 +2982,8 @@
#if defined(in_atomic)
static unsigned long prev_jiffy; /* ratelimiting */

- if ((in_atomic() || irqs_disabled()) && system_running) {
+ if ((in_atomic() || irqs_disabled()) &&
+ system_running == SYSTEM_RUNNING) {
if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
return;
prev_jiffy = jiffies;
===== kernel/sys.c 1.73 vs edited =====
--- 1.73/kernel/sys.c Mon Feb 23 13:46:54 2004
+++ edited/kernel/sys.c Thu Apr 1 19:11:17 2004
@@ -436,7 +436,7 @@
switch (cmd) {
case LINUX_REBOOT_CMD_RESTART:
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
- system_running = 0;
+ system_running = SYSTEM_SHUTDOWN;
device_shutdown();
printk(KERN_EMERG "Restarting system.\n");
machine_restart(NULL);
@@ -452,7 +452,7 @@

case LINUX_REBOOT_CMD_HALT:
notifier_call_chain(&reboot_notifier_list, SYS_HALT, NULL);
- system_running = 0;
+ system_running = SYSTEM_SHUTDOWN;
device_shutdown();
printk(KERN_EMERG "System halted.\n");
machine_halt();
@@ -462,7 +462,7 @@

case LINUX_REBOOT_CMD_POWER_OFF:
notifier_call_chain(&reboot_notifier_list, SYS_POWER_OFF, NULL);
- system_running = 0;
+ system_running = SYSTEM_SHUTDOWN;
device_shutdown();
printk(KERN_EMERG "Power down.\n");
machine_power_off();
@@ -478,7 +478,7 @@
buffer[sizeof(buffer) - 1] = '\0';

notifier_call_chain(&reboot_notifier_list, SYS_RESTART, buffer);
- system_running = 0;
+ system_running = SYSTEM_SHUTDOWN;
device_shutdown();
printk(KERN_EMERG "Restarting system with command '%s'.\n", buffer);
machine_restart(buffer);
===== mm/page_alloc.c 1.197 vs edited =====
--- 1.197/mm/page_alloc.c Tue Mar 16 20:10:10 2004
+++ edited/mm/page_alloc.c Thu Apr 1 19:11:18 2004
@@ -734,7 +734,7 @@
struct page * page;

#ifdef CONFIG_NUMA
- if (unlikely(!system_running))
+ if (unlikely(system_running == SYSTEM_BOOTING))
return get_boot_pages(gfp_mask, order);
#endif
page = alloc_pages(gfp_mask, order);

2004-04-02 01:42:28

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in get_boot_pages at reboot

Olof Johansson <[email protected]> wrote:
>
> Well, how about making system_running a state instead of a flag, and
> defining states SYSTEM_BOOTING/RUNNING/SHUTDOWN?

Eminently sane, of course. I'll rename it to system_state, to reflect the
new reality and to catch unconverted users.