LinuxLists.cc - OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

2006-12-18 12:25:58

Subject: OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

Hi.

I got this oops while suspending:
[ 309.366557] Disabling non-boot CPUs ...
[ 309.386563] CPU 1 is now offline
[ 309.387625] CPU1 is down
[ 309.387704] Stopping tasks ... done.
[ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
[ 310.456669] SMP
[ 310.456814] last sysfs file:
/devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
[ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
[ 310.457259] CPU: 0
[ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
[ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
[ 310.457478] EIP is at shrink_slab+0x9e/0x169
[ 310.457548] eax: 00000000 ebx: 00000000 ecx: 00000000 edx: 00000000
[ 310.457623] esi: 00000000 edi: c18fe500 ebp: f7b3fe3c esp: f7b3fe08
[ 310.457696] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
[ 310.457772] Process swsusp (pid: 3243, ti=f7b3e000 task=f756f030
task.ti=f7b3e000)
[ 310.457845] Stack: c175d8a0 c175daa0 c175db00 00000000 00000000 c053cec0
000045ec 000000d0
[ 310.458286] 00000000 00000000 00001179 00001179 00000000 f7b3fe94
c0151445 00000001
[ 310.458723] f7b3fe64 00001df1 00000002 00000001 00000001 00038000
00000c79 0000117b
[ 310.459199] Call Trace:
[ 310.459334] [<c0103f1b>] show_trace_log_lvl+0x1a/0x30
[ 310.459450] [<c0103fd6>] show_stack_log_lvl+0xa5/0xca
[ 310.459562] [<c01041ce>] show_registers+0x1d3/0x2b8
[ 310.459673] [<c01043d4>] die+0x121/0x243
[ 310.459781] [<c010456c>] do_trap+0x76/0x9c
[ 310.459892] [<c0104bd8>] do_divide_error+0x94/0x9e
[ 310.460001] [<c038a7e4>] error_code+0x7c/0x84
[ 310.460113] [<c0151445>] shrink_all_memory+0x211/0x2eb
[ 310.460225] [<c01418c1>] swsusp_shrink_memory+0x187/0x196
[ 310.460335] [<c0141a07>] prepare_processes+0x35/0xc8
[ 310.460446] [<c0141cce>] pm_suspend_disk+0xd/0x16f
[ 310.460558] [<c0140c87>] enter_state+0x129/0x19b
[ 310.460668] [<c0140d9c>] state_store+0xa3/0xac
[ 310.460777] [<c0198ab0>] subsys_attr_store+0x20/0x25
[ 310.460889] [<c0198b9f>] sysfs_write_file+0x97/0xd8
[ 310.460998] [<c0165262>] vfs_write+0x8b/0x149
[ 310.461108] [<c01658cb>] sys_write+0x3d/0x64
[ 310.461216] [<c0102fe4>] syscall_call+0x7/0xb
[ 310.461328] =======================
[ 310.461397] Code: 31 c0 ff 17 89 c3 8b 45 e4 31 d2 f7 77 0c f7 e3 89 45 d8 89
55 dc 89 d1 89 c6 31 d2 85 c9 74 09 89 c8 31 d2 f7 75 f0 89 c1 89 f0 <f7> 75 f0
89 ca 89 45 d8 89 55 dc 8b 45 d8 03 47 10 89 47 10 85
[ 310.464079] EIP: [<c0150c9a>] shrink_slab+0x9e/0x169 SS:ESP 0068:f7b3fe08
[ 310.464228]

swsusp script is something like this:
echo platform > /sys/power/disk
echo disk > /sys/power/state

regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-12-18 15:44:31

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

Hi,

On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
> Hi.
>
> I got this oops while suspending:
> [ 309.366557] Disabling non-boot CPUs ...
> [ 309.386563] CPU 1 is now offline
> [ 309.387625] CPU1 is down
> [ 309.387704] Stopping tasks ... done.
> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
> [ 310.456669] SMP
> [ 310.456814] last sysfs file:
> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
> [ 310.457259] CPU: 0
> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
> [ 310.457478] EIP is at shrink_slab+0x9e/0x169

Looks like we have a problem with slab shrinking here.

Could you please use gdb to check what exactly is at shrink_slab+0x9e?

> [ 310.457548] eax: 00000000 ebx: 00000000 ecx: 00000000 edx: 00000000
> [ 310.457623] esi: 00000000 edi: c18fe500 ebp: f7b3fe3c esp: f7b3fe08
> [ 310.457696] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
> [ 310.457772] Process swsusp (pid: 3243, ti=f7b3e000 task=f756f030
> task.ti=f7b3e000)
> [ 310.457845] Stack: c175d8a0 c175daa0 c175db00 00000000 00000000 c053cec0
> 000045ec 000000d0
> [ 310.458286] 00000000 00000000 00001179 00001179 00000000 f7b3fe94
> c0151445 00000001
> [ 310.458723] f7b3fe64 00001df1 00000002 00000001 00000001 00038000
> 00000c79 0000117b
> [ 310.459199] Call Trace:
> [ 310.459334] [<c0103f1b>] show_trace_log_lvl+0x1a/0x30
> [ 310.459450] [<c0103fd6>] show_stack_log_lvl+0xa5/0xca
> [ 310.459562] [<c01041ce>] show_registers+0x1d3/0x2b8
> [ 310.459673] [<c01043d4>] die+0x121/0x243
> [ 310.459781] [<c010456c>] do_trap+0x76/0x9c
> [ 310.459892] [<c0104bd8>] do_divide_error+0x94/0x9e
> [ 310.460001] [<c038a7e4>] error_code+0x7c/0x84
> [ 310.460113] [<c0151445>] shrink_all_memory+0x211/0x2eb
> [ 310.460225] [<c01418c1>] swsusp_shrink_memory+0x187/0x196
> [ 310.460335] [<c0141a07>] prepare_processes+0x35/0xc8
> [ 310.460446] [<c0141cce>] pm_suspend_disk+0xd/0x16f
> [ 310.460558] [<c0140c87>] enter_state+0x129/0x19b
> [ 310.460668] [<c0140d9c>] state_store+0xa3/0xac
> [ 310.460777] [<c0198ab0>] subsys_attr_store+0x20/0x25
> [ 310.460889] [<c0198b9f>] sysfs_write_file+0x97/0xd8
> [ 310.460998] [<c0165262>] vfs_write+0x8b/0x149
> [ 310.461108] [<c01658cb>] sys_write+0x3d/0x64
> [ 310.461216] [<c0102fe4>] syscall_call+0x7/0xb
> [ 310.461328] =======================
> [ 310.461397] Code: 31 c0 ff 17 89 c3 8b 45 e4 31 d2 f7 77 0c f7 e3 89 45 d8 89
> 55 dc 89 d1 89 c6 31 d2 85 c9 74 09 89 c8 31 d2 f7 75 f0 89 c1 89 f0 <f7> 75 f0
> 89 ca 89 45 d8 89 55 dc 8b 45 d8 03 47 10 89 47 10 85
> [ 310.464079] EIP: [<c0150c9a>] shrink_slab+0x9e/0x169 SS:ESP 0068:f7b3fe08
> [ 310.464228]
>
> swsusp script is something like this:
> echo platform > /sys/power/disk
> echo disk > /sys/power/state
>
> regards,

Greetings,
Rafael

--
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King

2006-12-18 17:03:07

by Jiri Slaby

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

Rafael J. Wysocki wrote:
> Hi,
>
> On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
>> Hi.
>>
>> I got this oops while suspending:
>> [ 309.366557] Disabling non-boot CPUs ...
>> [ 309.386563] CPU 1 is now offline
>> [ 309.387625] CPU1 is down
>> [ 309.387704] Stopping tasks ... done.
>> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
>> [ 310.456669] SMP
>> [ 310.456814] last sysfs file:
>> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
>> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
>> [ 310.457259] CPU: 0
>> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
>> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
>> [ 310.457478] EIP is at shrink_slab+0x9e/0x169
>
> Looks like we have a problem with slab shrinking here.
>
> Could you please use gdb to check what exactly is at shrink_slab+0x9e?

Sure, but not till Friday, sorry (I am away).

>> [ 310.457548] eax: 00000000 ebx: 00000000 ecx: 00000000 edx: 00000000
>> [ 310.457623] esi: 00000000 edi: c18fe500 ebp: f7b3fe3c esp: f7b3fe08
>> [ 310.457696] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
>> [ 310.457772] Process swsusp (pid: 3243, ti=f7b3e000 task=f756f030
>> task.ti=f7b3e000)
>> [ 310.457845] Stack: c175d8a0 c175daa0 c175db00 00000000 00000000 c053cec0
>> 000045ec 000000d0
>> [ 310.458286] 00000000 00000000 00001179 00001179 00000000 f7b3fe94
>> c0151445 00000001
>> [ 310.458723] f7b3fe64 00001df1 00000002 00000001 00000001 00038000
>> 00000c79 0000117b
>> [ 310.459199] Call Trace:
>> [ 310.459334] [<c0103f1b>] show_trace_log_lvl+0x1a/0x30
>> [ 310.459450] [<c0103fd6>] show_stack_log_lvl+0xa5/0xca
>> [ 310.459562] [<c01041ce>] show_registers+0x1d3/0x2b8
>> [ 310.459673] [<c01043d4>] die+0x121/0x243
>> [ 310.459781] [<c010456c>] do_trap+0x76/0x9c
>> [ 310.459892] [<c0104bd8>] do_divide_error+0x94/0x9e
>> [ 310.460001] [<c038a7e4>] error_code+0x7c/0x84
>> [ 310.460113] [<c0151445>] shrink_all_memory+0x211/0x2eb
>> [ 310.460225] [<c01418c1>] swsusp_shrink_memory+0x187/0x196
>> [ 310.460335] [<c0141a07>] prepare_processes+0x35/0xc8
>> [ 310.460446] [<c0141cce>] pm_suspend_disk+0xd/0x16f
>> [ 310.460558] [<c0140c87>] enter_state+0x129/0x19b
>> [ 310.460668] [<c0140d9c>] state_store+0xa3/0xac
>> [ 310.460777] [<c0198ab0>] subsys_attr_store+0x20/0x25
>> [ 310.460889] [<c0198b9f>] sysfs_write_file+0x97/0xd8
>> [ 310.460998] [<c0165262>] vfs_write+0x8b/0x149
>> [ 310.461108] [<c01658cb>] sys_write+0x3d/0x64
>> [ 310.461216] [<c0102fe4>] syscall_call+0x7/0xb
>> [ 310.461328] =======================
>> [ 310.461397] Code: 31 c0 ff 17 89 c3 8b 45 e4 31 d2 f7 77 0c f7 e3 89 45 d8 89
>> 55 dc 89 d1 89 c6 31 d2 85 c9 74 09 89 c8 31 d2 f7 75 f0 89 c1 89 f0 <f7> 75 f0
>> 89 ca 89 45 d8 89 55 dc 8b 45 d8 03 47 10 89 47 10 85
>> [ 310.464079] EIP: [<c0150c9a>] shrink_slab+0x9e/0x169 SS:ESP 0068:f7b3fe08
>> [ 310.464228]
>>
>> swsusp script is something like this:
>> echo platform > /sys/power/disk
>> echo disk > /sys/power/state

regards,
--
http://www.fi.muni.cz/~xslaby/ Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-12-18 20:59:55

by Andrew Morton

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

On Mon, 18 Dec 2006 18:02:20 +0100
Jiri Slaby <[email protected]> wrote:

> Rafael J. Wysocki wrote:
> > Hi,
> >
> > On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
> >> Hi.
> >>
> >> I got this oops while suspending:
> >> [ 309.366557] Disabling non-boot CPUs ...
> >> [ 309.386563] CPU 1 is now offline
> >> [ 309.387625] CPU1 is down
> >> [ 309.387704] Stopping tasks ... done.
> >> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
> >> [ 310.456669] SMP
> >> [ 310.456814] last sysfs file:
> >> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
> >> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
> >> [ 310.457259] CPU: 0
> >> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
> >> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
> >> [ 310.457478] EIP is at shrink_slab+0x9e/0x169
> >
> > Looks like we have a problem with slab shrinking here.
> >
> > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
>
> Sure, but not till Friday, sorry (I am away).

I think there's only one divide in there which can do this, so...

--- a/mm/vmscan.c~shrink_slab-handle-bad-shrinkers
+++ a/mm/vmscan.c
@@ -20,6 +20,7 @@
#include <linux/pagemap.h>
#include <linux/init.h>
#include <linux/highmem.h>
+#include <linux/kallsyms.h>
#include <linux/vmstat.h>
#include <linux/file.h>
#include <linux/writeback.h>
@@ -190,7 +191,13 @@ unsigned long shrink_slab(unsigned long
unsigned long total_scan;
unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);

- delta = (4 * scanned) / shrinker->seeks;
+ if (!shrinker->seeks) {
+ print_symbol("shrinker %s has zero seeks\n",
+ (unsigned long)shrinker->shrinker);
+ delta = (4 * scanned) / DEFAULT_SEEKS;
+ } else {
+ delta = (4 * scanned) / shrinker->seeks;
+ }
delta *= max_pass;
do_div(delta, lru_pages + 1);
shrinker->nr += delta;
_

A quick grep shows that all set_shrinker() callers are doing the right
thing, so something kooky has happened.

2006-12-18 22:36:33

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

On Monday, 18 December 2006 18:02, Jiri Slaby wrote:
> Rafael J. Wysocki wrote:
> > Hi,
> >
> > On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
> >> Hi.
> >>
> >> I got this oops while suspending:
> >> [ 309.366557] Disabling non-boot CPUs ...
> >> [ 309.386563] CPU 1 is now offline
> >> [ 309.387625] CPU1 is down
> >> [ 309.387704] Stopping tasks ... done.
> >> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
> >> [ 310.456669] SMP
> >> [ 310.456814] last sysfs file:
> >> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
> >> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
> >> [ 310.457259] CPU: 0
> >> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
> >> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
> >> [ 310.457478] EIP is at shrink_slab+0x9e/0x169
> >
> > Looks like we have a problem with slab shrinking here.
> >
> > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
>
> Sure, but not till Friday, sorry (I am away).

I reproduced this on one box, but then it turned out that EIP was at line 195
of mm/vmscan.c where there was

do_div(delta, lru_pages + 1);

Well, I have no idea how this can lead to a divide error (lru_pages is
unsigned).

I'm unable to reproduce this on another i386 box, so it seems to be somewhat
configuration specific.

Does 2.6.20-rc1 work for you?

Rafael

--
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King

2006-12-18 22:44:41

by Nigel Cunningham

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

Hi.

On Mon, 2006-12-18 at 23:38 +0100, Rafael J. Wysocki wrote:
> On Monday, 18 December 2006 18:02, Jiri Slaby wrote:
> > Rafael J. Wysocki wrote:
> > > Hi,
> > >
> > > On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
> > >> Hi.
> > >>
> > >> I got this oops while suspending:
> > >> [ 309.366557] Disabling non-boot CPUs ...
> > >> [ 309.386563] CPU 1 is now offline
> > >> [ 309.387625] CPU1 is down
> > >> [ 309.387704] Stopping tasks ... done.
> > >> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
> > >> [ 310.456669] SMP
> > >> [ 310.456814] last sysfs file:
> > >> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
> > >> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
> > >> [ 310.457259] CPU: 0
> > >> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
> > >> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
> > >> [ 310.457478] EIP is at shrink_slab+0x9e/0x169
> > >
> > > Looks like we have a problem with slab shrinking here.
> > >
> > > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
> >
> > Sure, but not till Friday, sorry (I am away).
>
> I reproduced this on one box, but then it turned out that EIP was at line 195
> of mm/vmscan.c where there was
>
> do_div(delta, lru_pages + 1);
>
> Well, I have no idea how this can lead to a divide error (lru_pages is
> unsigned).
>
> I'm unable to reproduce this on another i386 box, so it seems to be somewhat
> configuration specific.
>
> Does 2.6.20-rc1 work for you?

I have a patch in -mm that reduces lru_pages by what shrink_all_zones
returns. Could shrink_all_zones perhaps be returning incorrect values
such that lru_pages ends up becoming -1?

Regards,

Nigel

2006-12-18 23:07:10

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

Hi,

On Monday, 18 December 2006 23:44, Nigel Cunningham wrote:
> Hi.
>
> On Mon, 2006-12-18 at 23:38 +0100, Rafael J. Wysocki wrote:
> > On Monday, 18 December 2006 18:02, Jiri Slaby wrote:
> > > Rafael J. Wysocki wrote:
> > > > Hi,
> > > >
> > > > On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
> > > >> Hi.
> > > >>
> > > >> I got this oops while suspending:
> > > >> [ 309.366557] Disabling non-boot CPUs ...
> > > >> [ 309.386563] CPU 1 is now offline
> > > >> [ 309.387625] CPU1 is down
> > > >> [ 309.387704] Stopping tasks ... done.
> > > >> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
> > > >> [ 310.456669] SMP
> > > >> [ 310.456814] last sysfs file:
> > > >> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
> > > >> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
> > > >> [ 310.457259] CPU: 0
> > > >> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
> > > >> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
> > > >> [ 310.457478] EIP is at shrink_slab+0x9e/0x169
> > > >
> > > > Looks like we have a problem with slab shrinking here.
> > > >
> > > > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
> > >
> > > Sure, but not till Friday, sorry (I am away).
> >
> > I reproduced this on one box, but then it turned out that EIP was at line 195
> > of mm/vmscan.c where there was
> >
> > do_div(delta, lru_pages + 1);
> >
> > Well, I have no idea how this can lead to a divide error (lru_pages is
> > unsigned).
> >
> > I'm unable to reproduce this on another i386 box, so it seems to be somewhat
> > configuration specific.
> >
> > Does 2.6.20-rc1 work for you?
>
> I have a patch in -mm that reduces lru_pages by what shrink_all_zones
> returns. Could shrink_all_zones perhaps be returning incorrect values
> such that lru_pages ends up becoming -1?

I don't think so, but look at the appended patch. ;-)

Greetings,
Rafael

---
Fix a (really bad) typo in shrink_all_memory().

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.20-rc1-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.20-rc1-mm1.orig/mm/vmscan.c
+++ linux-2.6.20-rc1-mm1/mm/vmscan.c
@@ -1569,7 +1569,7 @@ unsigned long shrink_all_memory(unsigned
sc.swap_cluster_max = nr_pages - ret;
freed = shrink_all_zones(nr_to_scan, prio, pass, &sc);
ret += freed;
- lru_pages =- freed;
+ lru_pages -= freed;
nr_to_scan = nr_pages - ret;
if (ret >= nr_pages)
goto out;

2006-12-18 23:16:38

by Nigel Cunningham

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

Hi.

On Tue, 2006-12-19 at 00:09 +0100, Rafael J. Wysocki wrote:
> Hi,
>
> On Monday, 18 December 2006 23:44, Nigel Cunningham wrote:
> > Hi.
> >
> > On Mon, 2006-12-18 at 23:38 +0100, Rafael J. Wysocki wrote:
> > > On Monday, 18 December 2006 18:02, Jiri Slaby wrote:
> > > > Rafael J. Wysocki wrote:
> > > > > Hi,
> > > > >
> > > > > On Monday, 18 December 2006 12:20, Jiri Slaby wrote:
> > > > >> Hi.
> > > > >>
> > > > >> I got this oops while suspending:
> > > > >> [ 309.366557] Disabling non-boot CPUs ...
> > > > >> [ 309.386563] CPU 1 is now offline
> > > > >> [ 309.387625] CPU1 is down
> > > > >> [ 309.387704] Stopping tasks ... done.
> > > > >> [ 310.030991] Shrinking memory... -<0>divide error: 0000 [#1]
> > > > >> [ 310.456669] SMP
> > > > >> [ 310.456814] last sysfs file:
> > > > >> /devices/pci0000:00/0000:00:1e.0/0000:02:08.0/eth0/statistics/collisions
> > > > >> [ 310.456919] Modules linked in: eth1394 floppy ohci1394 ide_cd ieee1394 cdrom
> > > > >> [ 310.457259] CPU: 0
> > > > >> [ 310.457260] EIP: 0060:[<c0150c9a>] Not tainted VLI
> > > > >> [ 310.457261] EFLAGS: 00210246 (2.6.20-rc1-mm1 #207)
> > > > >> [ 310.457478] EIP is at shrink_slab+0x9e/0x169
> > > > >
> > > > > Looks like we have a problem with slab shrinking here.
> > > > >
> > > > > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
> > > >
> > > > Sure, but not till Friday, sorry (I am away).
> > >
> > > I reproduced this on one box, but then it turned out that EIP was at line 195
> > > of mm/vmscan.c where there was
> > >
> > > do_div(delta, lru_pages + 1);
> > >
> > > Well, I have no idea how this can lead to a divide error (lru_pages is
> > > unsigned).
> > >
> > > I'm unable to reproduce this on another i386 box, so it seems to be somewhat
> > > configuration specific.
> > >
> > > Does 2.6.20-rc1 work for you?
> >
> > I have a patch in -mm that reduces lru_pages by what shrink_all_zones
> > returns. Could shrink_all_zones perhaps be returning incorrect values
> > such that lru_pages ends up becoming -1?
>
> I don't think so, but look at the appended patch. ;-)
>
> Greetings,
> Rafael
>
>
> ---
> Fix a (really bad) typo in shrink_all_memory().
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6.20-rc1-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.20-rc1-mm1.orig/mm/vmscan.c
> +++ linux-2.6.20-rc1-mm1/mm/vmscan.c
> @@ -1569,7 +1569,7 @@ unsigned long shrink_all_memory(unsigned
> sc.swap_cluster_max = nr_pages - ret;
> freed = shrink_all_zones(nr_to_scan, prio, pass, &sc);
> ret += freed;
> - lru_pages =- freed;
> + lru_pages -= freed;
> nr_to_scan = nr_pages - ret;
> if (ret >= nr_pages)
> goto out;

Heh, yeah.

Definitely acked! :)

Nigel

2006-12-18 23:17:26

by Andrew Morton

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

On Mon, 18 Dec 2006 23:38:23 +0100
"Rafael J. Wysocki" <[email protected]> wrote:

> > > Looks like we have a problem with slab shrinking here.
> > >
> > > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
> >
> > Sure, but not till Friday, sorry (I am away).
>
> I reproduced this on one box, but then it turned out that EIP was at line 195
> of mm/vmscan.c where there was
>
> do_div(delta, lru_pages + 1);

That implies that we passed it lru_pages=-1.

Presumably the logic in
vmscanc-account-for-memory-already-freed-in-seeking-to.patch caused that.

> Well, I have no idea how this can lead to a divide error (lru_pages is
> unsigned).
>
> I'm unable to reproduce this on another i386 box, so it seems to be somewhat
> configuration specific.
>

There is one wart in shrink_all_memory() and I think we should fix that in
2.6.20.

Please check the below. I'll drop
vmscanc-account-for-memory-already-freed-in-seeking-to.patch. It has other
stuff in it which we might still need. But altering sc->swap_cluster_max
in that manner looks odd.

From: Andrew Morton <[email protected]>

At the end of shrink_all_memory() we forget to recalculate lru_pages: it can
be zero.

Fix that up, and add a helper function for this operation too.

Also, recalculate lru_pages each time around the inner loop to get the
balancing correct.

Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/vmscan.c | 33 ++++++++++++++++-----------------
1 files changed, 16 insertions(+), 17 deletions(-)

diff -puN mm/vmscan.c~shrink_all_memory-fix-lru_pages-handling mm/vmscan.c
--- a/mm/vmscan.c~shrink_all_memory-fix-lru_pages-handling
+++ a/mm/vmscan.c
@@ -1484,6 +1484,16 @@ static unsigned long shrink_all_zones(un
return ret;
}

+static unsigned long count_lru_pages(void)
+{
+ struct zone *zone;
+ unsigned long ret = 0;
+
+ for_each_zone(zone);
+ ret += zone->nr_active + zone->nr_inactive;
+ return ret;
+}
+
/*
* Try to free `nr_pages' of memory, system-wide, and return the number of
* freed pages.
@@ -1498,7 +1508,6 @@ unsigned long shrink_all_memory(unsigned
unsigned long ret = 0;
int pass;
struct reclaim_state reclaim_state;
- struct zone *zone;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_swap = 0,
@@ -1509,10 +1518,7 @@ unsigned long shrink_all_memory(unsigned

current->reclaim_state = &reclaim_state;

- lru_pages = 0;
- for_each_zone(zone)
- lru_pages += zone->nr_active + zone->nr_inactive;
-
+ lru_pages = count_lru_pages();
nr_slab = global_page_state(NR_SLAB_RECLAIMABLE);
/* If slab caches are huge, it's better to hit them first */
while (nr_slab >= lru_pages) {
@@ -1539,13 +1545,6 @@ unsigned long shrink_all_memory(unsigned
for (pass = 0; pass < 5; pass++) {
int prio;

- /* Needed for shrinking slab caches later on */
- if (!lru_pages)
- for_each_zone(zone) {
- lru_pages += zone->nr_active;
- lru_pages += zone->nr_inactive;
- }
-
/* Force reclaiming mapped pages in the passes #3 and #4 */
if (pass > 2) {
sc.may_swap = 1;
@@ -1561,7 +1560,8 @@ unsigned long shrink_all_memory(unsigned
goto out;

reclaim_state.reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, sc.gfp_mask, lru_pages);
+ shrink_slab(sc.nr_scanned, sc.gfp_mask,
+ count_lru_pages());
ret += reclaim_state.reclaimed_slab;
if (ret >= nr_pages)
goto out;
@@ -1569,20 +1569,19 @@ unsigned long shrink_all_memory(unsigned
if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
congestion_wait(WRITE, HZ / 10);
}
-
- lru_pages = 0;
}

/*
* If ret = 0, we could not shrink LRUs, but there may be something
* in slab caches
*/
- if (!ret)
+ if (!ret) {
do {
reclaim_state.reclaimed_slab = 0;
- shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+ shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
ret += reclaim_state.reclaimed_slab;
} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
+ }

out:
current->reclaim_state = NULL;
_

2006-12-19 00:50:29

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

On Tuesday, 19 December 2006 00:17, Andrew Morton wrote:
> On Mon, 18 Dec 2006 23:38:23 +0100
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > > > Looks like we have a problem with slab shrinking here.
> > > >
> > > > Could you please use gdb to check what exactly is at shrink_slab+0x9e?
> > >
> > > Sure, but not till Friday, sorry (I am away).
> >
> > I reproduced this on one box, but then it turned out that EIP was at line 195
> > of mm/vmscan.c where there was
> >
> > do_div(delta, lru_pages + 1);
>
> That implies that we passed it lru_pages=-1.
>
> Presumably the logic in
> vmscanc-account-for-memory-already-freed-in-seeking-to.patch caused that.
>
> > Well, I have no idea how this can lead to a divide error (lru_pages is
> > unsigned).
> >
> > I'm unable to reproduce this on another i386 box, so it seems to be somewhat
> > configuration specific.
> >
>
> There is one wart in shrink_all_memory() and I think we should fix that in
> 2.6.20.
>
> Please check the below.

Fine by me.

> I'll drop vmscanc-account-for-memory-already-freed-in-seeking-to.patch. It
> has other stuff in it which we might still need. But altering
> sc->swap_cluster_max in that manner looks odd.

Agreed.

Greetings,
Rafael

2006-12-19 01:18:40

by David Rientjes

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

On Mon, 18 Dec 2006, Andrew Morton wrote:

> diff -puN mm/vmscan.c~shrink_all_memory-fix-lru_pages-handling mm/vmscan.c
> --- a/mm/vmscan.c~shrink_all_memory-fix-lru_pages-handling
> +++ a/mm/vmscan.c
> @@ -1484,6 +1484,16 @@ static unsigned long shrink_all_zones(un
> return ret;
> }
>
> +static unsigned long count_lru_pages(void)
> +{
> + struct zone *zone;
> + unsigned long ret = 0;
> +
> + for_each_zone(zone);
> + ret += zone->nr_active + zone->nr_inactive;
> + return ret;
> +}
> +
> /*
> * Try to free `nr_pages' of memory, system-wide, and return the number of
> * freed pages.

There's an extra semicolon there that results in only the final zone being
used.

David

2006-12-19 01:28:52

by Andrew Morton

[permalink] [raw]

Subject: Re: [linux-pm] OOPS: divide error while s2dsk (2.6.20-rc1-mm1)

On Mon, 18 Dec 2006 17:18:12 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Mon, 18 Dec 2006, Andrew Morton wrote:
>
> > diff -puN mm/vmscan.c~shrink_all_memory-fix-lru_pages-handling mm/vmscan.c
> > --- a/mm/vmscan.c~shrink_all_memory-fix-lru_pages-handling
> > +++ a/mm/vmscan.c
> > @@ -1484,6 +1484,16 @@ static unsigned long shrink_all_zones(un
> > return ret;
> > }
> >
> > +static unsigned long count_lru_pages(void)
> > +{
> > + struct zone *zone;
> > + unsigned long ret = 0;
> > +
> > + for_each_zone(zone);
> > + ret += zone->nr_active + zone->nr_inactive;
> > + return ret;
> > +}
> > +
> > /*
> > * Try to free `nr_pages' of memory, system-wide, and return the number of
> > * freed pages.
>
> There's an extra semicolon there

Sigh. coding-while-diseased.

> that results in only the final zone being
> used.
>

Actually it'll go oops. Fixed, thanks.