2006-03-01 17:02:12

by J M Cerqueira Esteves

[permalink] [raw]
Subject: oom-killer: gfp_mask=0xd1 with 2.6.12 on EM64T

Greetings

On a dual EM64T Xeon with 4GB of RAM, I am getting apparently "innocent"
processes killed by oom-killer with gfp_mask=0xd1 (with all or almost
all swap space still available).

This happens when running a couple of Gaussian and other computational
chemistry software processes each using ~ 800MB-1GB of RAM. Sometimes
oom-killer kills one or two of those processes, with the kernel messages
shown below. Unfortunately I don't have a simpler recipe to induce this
behavior... (it may even be triggered only by some Gaussian runs with a
particular set of input parameters; it doesn't happen always).

I haven't tried 2.6.15 kernels yet, but according to recent reports in
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175173
even those may still have oom-killer problems (like this?).

Since I'm not yet familiar with the meaning of much of the data output
in the following kernel messages, could someone suggest some appropriate
course of action to troubleshoot this? Any recommended kernel
versions/patches/settings?


oom-killer: gfp_mask=0xd1
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 62, high 186, batch 31
cpu 0 cold: low 0, high 62, batch 31
cpu 1 hot: low 62, high 186, batch 31
cpu 1 cold: low 0, high 62, batch 31
cpu 2 hot: low 62, high 186, batch 31
cpu 2 cold: low 0, high 62, batch 31
cpu 3 hot: low 62, high 186, batch 31
cpu 3 cold: low 0, high 62, batch 31
HighMem per-cpu: empty
Free pages: 13436kB (0kB HighMem)
Active:396041 inactive:586624 dirty:180807 writeback:0 unstable:0
free:3359 slab:22149 mapped:256439 pagetables:1997
DMA free:24kB min:28kB low:32kB high:40kB active:0kB inactive:0kB
present:16384kB pages_scanned:2 all_unreclaimable? yes
lowmem_reserve[]: 0 4848 4848
Normal free:13412kB min:8892kB low:11112kB high:13336kB active:1584168kB
inactive:2346492kB present:4964352kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 24kB
Normal: 1*4kB 4*8kB 6*16kB 113*32kB 1*64kB 5*128kB 3*256kB 0*512kB
0*1024kB 0*2048kB 2*4096kB = 13412kB
HighMem: empty
Swap cache: add 10, delete 10, find 3/6, race 0+0
Free swap = 7036300kB
Total swap = 7036304kB
Out of Memory: Killed process 9308 (l804.exe).


I also got similar messages running the same software
after setting /proc/sys/vm/overcommit_memory as "2".

Some machine details:
Tyan Tiger i7525 (S2672) motherboard;
2 Intel Xeon 3.2 GHz CPUs (2MB cache each);
4GB of ECC RAM;
Radeon X300 VGA card.

This is running Ubuntu 5.10 (Breezy) for x8_64 systems, with a 2.6.12
kernel (Ubuntu linux-source-2.6.12-10.28) recompiled
with small configuration differences from the default Ubuntu one
(no K8 NUMA, processor family Intel EM64T, SMP support,
CONFIG_SCHED_SMT, some unneeded hardware support suppressed, ...).

Feel free to request any additional data which could be helpful (kernel
configuration. hardware details, ...).

Best regards and thanks in advance

J Esteves
--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-02 09:18:57

by Andrew Morton

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.12 on EM64T

J M Cerqueira Esteves <[email protected]> wrote:
>
> On a dual EM64T Xeon with 4GB of RAM, I am getting apparently "innocent"
> processes killed by oom-killer with gfp_mask=0xd1 (with all or almost
> all swap space still available).
>

That's quite an old kernel. If this is the notorious bio-uses-GFP_DMA bug
then I'd have expected this kernel to be useless from day one. Did you
install it recently?

> I haven't tried 2.6.15 kernels yet, but according to recent reports in
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175173
> even those may still have oom-killer problems (like this?).

Yes, I expect it's the same still-unfixed bug.

If you're feeling keen you could add this patch which would confirm it:

--- devel/mm/oom_kill.c~a 2006-03-02 01:16:17.000000000 -0800
+++ devel-akpm/mm/oom_kill.c 2006-03-02 01:16:32.000000000 -0800
@@ -258,6 +258,8 @@ void out_of_memory(unsigned int __nocast
struct mm_struct *mm = NULL;
task_t * p;

+ dump_stack();
+
read_lock(&tasklist_lock);
retry:
p = select_bad_process();
_


And if it's that bug then I'm afraid you'll have to sit tight until 2.6.16.
We shouldn't release 2.6.16 until this thing is fixed.

2006-03-03 15:50:19

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.12 on EM64T

Andrew Morton wrote:
> That's quite an old kernel. If this is the notorious bio-uses-GFP_DMA bug
> then I'd have expected this kernel to be useless from day one. Did you
> install it recently?

On this double Xeon, yes. I had no problems before with 2.6.12 and the
same "heavy" software on dual Opteron and dual dual core Opteron
machines, and this is my first installation on a EM64T.
At first it seemed everything was ok with 2.6.12 here too, but in a
couple of days we started gettings some of those oom killings when
running some Gaussian jobs. In at least a pair of cases the system froze
completely.

> If you're feeling keen you could add this patch which would confirm it:

Added it and already got output for a similar "killing". Since I'm not
sure what could be most relevant among those messages, I refrained from
attaching them all here, and instead put them at
http://jmce.artenumerica.org/tmp/linux-2.6.12-oom_killings/EM64T-kern.log

> And if it's that bug then I'm afraid you'll have to sit tight until 2.6.16.
> We shouldn't release 2.6.16 until this thing is fixed.

Do those call traces suggest that uncorrected bug you mention?
(And if yes, is there any known way to mitigate the problem? Could it
depend on BIOS settings?)
I'll also be able to try a 2.6.15 kernel (eventually with any suggested
patches) later today...

Thanks again and best regards

J Esteves
--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-04 15:57:33

by J M Cerqueira Esteves

[permalink] [raw]
Subject: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

Hi again

Still on the same dual EM64T machine with a Tyan Tiger i7525 (S2672)
motherboard and 4 GB RAM for which I reported 2.6.12 oom killings a few
days ago:

I upgraded to Ubuntu Dapper and installed its latest 2.6.15 kernel,
which incorporates 2.6.15.4. Started with the original "binary"
linux-image-2.6.15-16-amd64-xeon package,
and got a few oom killings even without running the same large test
programs as before. Then recompiled the kernel with
CONFIG_PREEMPT_NONE, CONFIG_SCHED_SMT, no CONFIG_PREEMPT_BKL,
and the dump_stack() call suggested by Andrew Morton for
mm/oom_kill.c [in out_of_memory()].

Repeated tests with Gaussian... and got oom-killer events similar to
those found with 2.6.12. At
http://jmce.artenumerica.org/en/tmp/linux-2.6.15-oom_killings/kern.log
are the kernel messages from the killing of two Gaussian runs;
I just show below the beginning, until the first killing.

Any suggestions on patches or some pre-2.6.16 version I should try?


Call Trace:<ffffffff8015efcb>{out_of_memory+23}
<ffffffff80130465>{__wake_up+56}
<ffffffff80161177>{__alloc_pages+572}
<ffffffff8017fc25>{bio_copy_user+219}
<ffffffff801debbf>{blk_rq_map_user+133} <ffffffff801e1b61>{sg_io+351}
<ffffffff801e1ff8>{scsi_cmd_ioctl+494}
<ffffffff80130465>{__wake_up+56}
<ffffffff80265aac>{sock_def_readable+52}
<ffffffff802c5d68>{unix_dgram_sendmsg+1085}
<ffffffff88077e35>{:sd_mod:sd_ioctl+371}
<ffffffff801e0058>{blkdev_driver_ioctl+93}
<ffffffff801e0726>{blkdev_ioctl+1613}
<ffffffff8018ce76>{do_select+1137}
<ffffffff8026321e>{sys_sendto+251} <ffffffff8018c941>{__pollwait+0}
<ffffffff801813d2>{block_ioctl+27} <ffffffff8018c091>{do_ioctl+33}
<ffffffff8018c36c>{vfs_ioctl+643} <ffffffff8018c3e0>{sys_ioctl+91}
<ffffffff8010fa46>{system_call+126}
oom-killer: gfp_mask=0xd1, order=0
Mem-info:
DMA per-cpu:
cpu 0 hot: low 0, high 0, batch 1 used:0
cpu 0 cold: low 0, high 0, batch 1 used:0
cpu 1 hot: low 0, high 0, batch 1 used:0
cpu 1 cold: low 0, high 0, batch 1 used:0
cpu 2 hot: low 0, high 0, batch 1 used:0
cpu 2 cold: low 0, high 0, batch 1 used:0
cpu 3 hot: low 0, high 0, batch 1 used:0
cpu 3 cold: low 0, high 0, batch 1 used:0
DMA32 per-cpu:
cpu 0 hot: low 0, high 186, batch 31 used:151
cpu 0 cold: low 0, high 62, batch 15 used:50
cpu 1 hot: low 0, high 186, batch 31 used:165
cpu 1 cold: low 0, high 62, batch 15 used:56
cpu 2 hot: low 0, high 186, batch 31 used:5
cpu 2 cold: low 0, high 62, batch 15 used:54
cpu 3 hot: low 0, high 186, batch 31 used:12
cpu 3 cold: low 0, high 62, batch 15 used:2
Normal per-cpu:
cpu 0 hot: low 0, high 186, batch 31 used:118
cpu 0 cold: low 0, high 62, batch 15 used:56
cpu 1 hot: low 0, high 186, batch 31 used:82
cpu 1 cold: low 0, high 62, batch 15 used:59
cpu 2 hot: low 0, high 186, batch 31 used:30
cpu 2 cold: low 0, high 62, batch 15 used:53
cpu 3 hot: low 0, high 186, batch 31 used:10
cpu 3 cold: low 0, high 62, batch 15 used:14
HighMem per-cpu: empty
Free pages: 14924kB (0kB HighMem)
Active:491535 inactive:494210 dirty:148758 writeback:0 unstable:0
free:3731 slab:13610 mapped:483530 pagetables:1658
DMA free:20kB min:24kB low:28kB high:36kB active:0kB inactive:0kB
present:12464kB pages_scanned:2 all_unreclaimable? yes
lowmem_reserve[]: 0 3255 4013 4013
DMA32 free:12752kB min:6564kB low:8204kB high:9844kB active:1475208kB
inactive:1723432kB present:3333792kB pages_scanned:66 all_unreclaimable? no
lowmem_reserve[]: 0 0 757 757
Normal free:2152kB min:1528kB low:1908kB high:2292kB active:490804kB
inactive:253408kB present:775680kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 20kB
DMA32: 28*4kB 27*8kB 20*16kB 20*32kB 33*64kB 2*128kB 0*256kB 0*512kB
1*1024kB 0*2048kB 2*4096kB = 12872kB
Normal: 16*4kB 9*8kB 14*16kB 0*32kB 6*64kB 1*128kB 1*256kB 0*512kB
1*1024kB 0*2048kB 0*4096kB = 2152kB
HighMem: empty
Swap cache: add 79, delete 77, find 7/11, race 0+0
Free swap = 3517888kB
Total swap = 3518152kB
Free swap: 3517888kB
1245184 pages of RAM
232833 reserved pages
219286 pages shared
2 pages swap cached
Out of Memory: Killed process 13792 (l502.exe).


Best regards
J Esteves

--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-05 00:14:11

by Andrew Morton

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.12 on EM64T

J M Cerqueira Esteves <[email protected]> wrote:
>

argh. Please always do reply-to-all. I almost missed this one.

> Andrew Morton wrote:
> > That's quite an old kernel. If this is the notorious bio-uses-GFP_DMA bug
> > then I'd have expected this kernel to be useless from day one. Did you
> > install it recently?
>
> On this double Xeon, yes. I had no problems before with 2.6.12 and the
> same "heavy" software on dual Opteron and dual dual core Opteron
> machines, and this is my first installation on a EM64T.
> At first it seemed everything was ok with 2.6.12 here too, but in a
> couple of days we started gettings some of those oom killings when
> running some Gaussian jobs. In at least a pair of cases the system froze
> completely.
>
> > If you're feeling keen you could add this patch which would confirm it:
>
> Added it and already got output for a similar "killing". Since I'm not
> sure what could be most relevant among those messages, I refrained from
> attaching them all here, and instead put them at
> http://jmce.artenumerica.org/tmp/linux-2.6.12-oom_killings/EM64T-kern.log

Those x86_64 backtraces are quite hard to follow. They get much better if
you enable CONFIG_FRAME_POINTER, and that makes very little difference to
code quality.

> > And if it's that bug then I'm afraid you'll have to sit tight until 2.6.16.
> > We shouldn't release 2.6.16 until this thing is fixed.
>
> Do those call traces suggest that uncorrected bug you mention?

It's hard to say what happened there. I _think_ it went oom in
get_sectorsize()'s GFP_KERNEL|GFP_DMA allocation. (Jens, do we really need
GFP_DMA in there?)

But that's only a 512-byte allocation. Something else must have used up
all the DMA zone.

2006-03-05 00:16:57

by Andrew Morton

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

J M Cerqueira Esteves <[email protected]> wrote:
>
> Still on the same dual EM64T machine with a Tyan Tiger i7525 (S2672)
> motherboard and 4 GB RAM for which I reported 2.6.12 oom killings a few
> days ago:
>
> I upgraded to Ubuntu Dapper and installed its latest 2.6.15 kernel,
> which incorporates 2.6.15.4. Started with the original "binary"
> linux-image-2.6.15-16-amd64-xeon package,
> and got a few oom killings even without running the same large test
> programs as before. Then recompiled the kernel with
> CONFIG_PREEMPT_NONE, CONFIG_SCHED_SMT, no CONFIG_PREEMPT_BKL,
> and the dump_stack() call suggested by Andrew Morton for
> mm/oom_kill.c [in out_of_memory()].
>
> Repeated tests with Gaussian... and got oom-killer events similar to
> those found with 2.6.12. At
> http://jmce.artenumerica.org/en/tmp/linux-2.6.15-oom_killings/kern.log
> are the kernel messages from the killing of two Gaussian runs;
> I just show below the beginning, until the first killing.
>
> Any suggestions on patches or some pre-2.6.16 version I should try?
>
>
> Call Trace:<ffffffff8015efcb>{out_of_memory+23}
> <ffffffff80130465>{__wake_up+56}
> <ffffffff80161177>{__alloc_pages+572}
> <ffffffff8017fc25>{bio_copy_user+219}
> <ffffffff801debbf>{blk_rq_map_user+133} <ffffffff801e1b61>{sg_io+351}
> <ffffffff801e1ff8>{scsi_cmd_ioctl+494}
> <ffffffff80130465>{__wake_up+56}
> <ffffffff80265aac>{sock_def_readable+52}
> <ffffffff802c5d68>{unix_dgram_sendmsg+1085}
> <ffffffff88077e35>{:sd_mod:sd_ioctl+371}
> <ffffffff801e0058>{blkdev_driver_ioctl+93}
> <ffffffff801e0726>{blkdev_ioctl+1613}
> <ffffffff8018ce76>{do_select+1137}
> <ffffffff8026321e>{sys_sendto+251} <ffffffff8018c941>{__pollwait+0}
> <ffffffff801813d2>{block_ioctl+27} <ffffffff8018c091>{do_ioctl+33}
> <ffffffff8018c36c>{vfs_ioctl+643} <ffffffff8018c3e0>{sys_ioctl+91}
> <ffffffff8010fa46>{system_call+126}
> oom-killer: gfp_mask=0xd1, order=0

Yup, that looks like the same bug.

We have a candidate fix at
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
Could you test that? (and don't alter the Cc: list!). The patch is
against 2.6.16-rc5.

Thanks.

2006-03-05 02:54:05

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

Andrew Morton wrote:
> We have a candidate fix at
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
> Could you test that? (and don't alter the Cc: list!). The patch is
> against 2.6.16-rc5.

Thanks! I'll test it in a few hours, after a short "barbaric" test
inspired (perhaps naively) by those call traces: running with a 2.6.15.4
without SCSI cd-rom support (with multiple Gaussian processes, no
oom-killings until now (3 hours)).

Best regards
J Esteves


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-05 10:08:03

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

J M Cerqueira Esteves wrote:
> I'll test it in a few hours, after a short "barbaric" test
> inspired (perhaps naively) by those call traces: running with a 2.6.15.4
> without SCSI cd-rom support (with multiple Gaussian processes, no
> oom-killings until now (3 hours)).

And still running after 10 hours, but now I increased the load adding
another Gaussian run (still not requiring swap) and oom-killer
manifested itself again, although no killings were reported:

[35948.126969] Call Trace:<ffffffff8015efcb>{out_of_memory+23}
<ffffffff80161177>{__alloc_pages+572}
[35948.127018] <ffffffff8017fc25>{bio_copy_user+219}
<ffffffff801debbf>{blk_rq_map_user+133}
[35948.127073] <ffffffff801e1b61>{sg_io+351}
<ffffffff801e1ff8>{scsi_cmd_ioctl+494}
[35948.127135] <ffffffff80130465>{__wake_up+56}
<ffffffff80265aac>{sock_def_readable+52}
[35948.127162] <ffffffff802c5d68>{unix_dgram_sendmsg+1085}
<ffffffff88077e35>{:sd_mod:sd_ioctl+371}
[35948.127231] <ffffffff801e0058>{blkdev_driver_ioctl+93}
<ffffffff801e0726>{blkdev_ioctl+1613}
[35948.127277] <ffffffff8018ce76>{do_select+1137}
<ffffffff8026321e>{sys_sendto+251}
[35948.127334] <ffffffff8018c941>{__pollwait+0}
<ffffffff801813d2>{block_ioctl+27}
[35948.127367] <ffffffff8018c091>{do_ioctl+33}
<ffffffff8018c36c>{vfs_ioctl+643}
[35948.127383] <ffffffff8018c3e0>{sys_ioctl+91}
<ffffffff8010fa46>{system_call+126}
[35948.127419]
[35948.127453] oom-killer: gfp_mask=0xd1, order=0
[35948.127456] Mem-info:
[35948.127458] DMA per-cpu:
[35948.127461] cpu 0 hot: low 0, high 0, batch 1 used:0
[35948.127464] cpu 0 cold: low 0, high 0, batch 1 used:0
[35948.127468] cpu 1 hot: low 0, high 0, batch 1 used:0
[35948.127471] cpu 1 cold: low 0, high 0, batch 1 used:0
[35948.127474] cpu 2 hot: low 0, high 0, batch 1 used:0
[35948.127478] cpu 2 cold: low 0, high 0, batch 1 used:0
[35948.127481] cpu 3 hot: low 0, high 0, batch 1 used:0
[35948.127484] cpu 3 cold: low 0, high 0, batch 1 used:0
[35948.127487] DMA32 per-cpu:
[35948.127490] cpu 0 hot: low 0, high 186, batch 31 used:173
[35948.127494] cpu 0 cold: low 0, high 62, batch 15 used:55
[35948.127497] cpu 1 hot: low 0, high 186, batch 31 used:64
[35948.127501] cpu 1 cold: low 0, high 62, batch 15 used:10
[35948.127504] cpu 2 hot: low 0, high 186, batch 31 used:148
[35948.127508] cpu 2 cold: low 0, high 62, batch 15 used:5
[35948.127511] cpu 3 hot: low 0, high 186, batch 31 used:157
[35948.127515] cpu 3 cold: low 0, high 62, batch 15 used:11
[35948.127517] Normal per-cpu:
[35948.127521] cpu 0 hot: low 0, high 186, batch 31 used:144
[35948.127524] cpu 0 cold: low 0, high 62, batch 15 used:50
[35948.127528] cpu 1 hot: low 0, high 186, batch 31 used:24
[35948.127531] cpu 1 cold: low 0, high 62, batch 15 used:7
[35948.127535] cpu 2 hot: low 0, high 186, batch 31 used:30
[35948.127538] cpu 2 cold: low 0, high 62, batch 15 used:15
[35948.127541] cpu 3 hot: low 0, high 186, batch 31 used:17
[35948.127545] cpu 3 cold: low 0, high 62, batch 15 used:14
[35948.127548] HighMem per-cpu: empty
[35948.127552] Free pages: 84424kB (0kB HighMem)
[35948.127557] Active:827765 inactive:133975 dirty:283294 writeback:0
unstable:0 free:21106 slab:20548 mapped:432093 pagetables:1441
[35948.127563] DMA free:20kB min:24kB low:28kB high:36kB active:0kB
inactive:0kB present:12464kB pages_scanned:2 all_unreclaimable? yes
[35948.127568] lowmem_reserve[]: 0 3255 4013 4013
[35948.127576] DMA32 free:82028kB min:6564kB low:8204kB high:9844kB
active:2587856kB inactive:511984kB present:3333792kB pages_scanned:0
all_unreclaimable? no
[35948.127580] lowmem_reserve[]: 0 0 757 757
[35948.127587] Normal free:2376kB min:1528kB low:1908kB high:2292kB
active:723204kB inactive:23916kB present:775680kB pages_scanned:0
all_unreclaimable? no
[35948.127592] lowmem_reserve[]: 0 0 0 0
[35948.127598] HighMem free:0kB min:128kB low:128kB high:128kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
[35948.127602] lowmem_reserve[]: 0 0 0 0
[35948.127606] DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 20kB
[35948.127619] DMA32: 931*4kB 664*8kB 216*16kB 169*32kB 94*64kB 6*128kB
36*256kB 0*512kB 1*1024kB 19*2048kB 2*4096kB = 82028kB
[35948.127632] Normal: 8*4kB 51*8kB 1*16kB 0*32kB 4*64kB 1*128kB 0*256kB
1*512kB 1*1024kB 0*2048kB 0*4096kB = 2376kB
[35948.127644] HighMem: empty
[35948.127648] Swap cache: add 120, delete 120, find 14/25, race 0+1
[35948.127651] Free swap = 3517904kB
[35948.127654] Total swap = 3518152kB
[35948.127656] Free swap: 3517904kB
[35948.146970] 1245184 pages of RAM
[35948.146974] 232833 reserved pages
[35948.146977] 424256 pages shared
[35948.146980] 0 pages swap cached

I'll now test the x86_64-mm-blk-bounce.patch (with CONFIG_FRAME_POINTER
enabled).

Best regards
J Esteves


--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-06 08:47:35

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

Andrew Morton wrote:
> We have a candidate fix at
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
> Could you test that? (and don't alter the Cc: list!). The patch is
> against 2.6.16-rc5.

Testing that kernel now, with good news: the machine has been apparently
stable, running Gaussian processes for the last 20 hours, with no
oom-killer messages.

A new "feature": 36 of these kernel message pairs as boot time:
device-mapper: dm-linear: Device lookup failed
device-mapper: error adding target to table

Many thanks and best regards
J Esteves

--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-06 09:04:36

by Andrew Morton

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

J M Cerqueira Esteves <[email protected]> wrote:
>
> Andrew Morton wrote:
> > We have a candidate fix at
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
> > Could you test that? (and don't alter the Cc: list!). The patch is
> > against 2.6.16-rc5.
>
> Testing that kernel now, with good news: the machine has been apparently
> stable, running Gaussian processes for the last 20 hours, with no
> oom-killer messages.

OK, thanks. The first iteration of that patch caused ia64 to go BUG, so we
took the BUG out. We're calling init_emergency_isa_pool() on ia64 which
seems rather silly. So my confidence level in that patch remains low, and
our need for it is high.

> A new "feature": 36 of these kernel message pairs as boot time:
> device-mapper: dm-linear: Device lookup failed
> device-mapper: error adding target to table
>

OK, there were some fairly large DM patches touching on
dm_get_device(). Cc added ;)

2006-03-06 09:19:40

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
> Could you test that? (and don't alter the Cc: list!). The patch is
> against 2.6.16-rc5.

I forgot to mention that the DVD drive was not automatically recognized:

ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0x18F0 irq 14
ata1: dev 0 cfg 49:0f00 82:0218 83:4000 84:4000 85:0218 86:0000 87:4000
88:041f
ata1: dev 0 ATAPI, max UDMA/66
ata1: dev 0 configured for UDMA/33
scsi0 : ata_piix
ata1(0): WARNING: ATAPI is disabled, device ignored.

Is this still as described in
http://www.thinkwiki.org/wiki/Problems_with_SATA_and_Linux
under "DVD drive not recognized"? Perhaps I'll be able to do some tests
on that later, too.

Best regards
J Esteves
--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-06 09:30:40

by Andrew Morton

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

J M Cerqueira Esteves <[email protected]> wrote:
>
> Andrew Morton wrote:
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
> > Could you test that? (and don't alter the Cc: list!). The patch is
> > against 2.6.16-rc5.
>
> I forgot to mention that the DVD drive was not automatically recognized:
>
> ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0x18F0 irq 14
> ata1: dev 0 cfg 49:0f00 82:0218 83:4000 84:4000 85:0218 86:0000 87:4000
> 88:041f
> ata1: dev 0 ATAPI, max UDMA/66
> ata1: dev 0 configured for UDMA/33
> scsi0 : ata_piix
> ata1(0): WARNING: ATAPI is disabled, device ignored.
>
> Is this still as described in
> http://www.thinkwiki.org/wiki/Problems_with_SATA_and_Linux
> under "DVD drive not recognized"? Perhaps I'll be able to do some tests
> on that later, too.
>

I've not been following the saga of atapi-versus-libata at all closely.
Booting with libata.atapi_enabled=1 might make things work. I think Randy
should know what happened here?

You were testing 2.6.16-rc5, yes? What did you expect to see and what were
you seeing in earlier kernels (which versions?) (IOW: what did we break
this time?)

2006-03-06 10:45:52

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

Andrew Morton wrote:
> I've not been following the saga of atapi-versus-libata at all closely.
> Booting with libata.atapi_enabled=1 might make things work. I think Randy
> should know what happened here?
>
> You were testing 2.6.16-rc5, yes? What did you expect to see and what were
> you seeing in earlier kernels (which versions?) (IOW: what did we break
> this time?)

Yes, this was with 2.6.16-rc5 with the suggested patch.

I haven't tried libata.atapi_enabled=1 yet (I'll do it on the first
reboot after this set of tests with Gaussian).


Under both 2.6.12 (as supplied with Ubuntu Breezy) and 2.6.15 (as
supplied with the current Ubuntu Dapper, incorporating 2.6.15.4) we had:

ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0x18F0 irq 14
ata1: dev 0 cfg 49:0f00 82:0218 83:4000 84:4000 85:0218 86:0000 87:4000
88:041f
ata1: dev 0 ATAPI, max UDMA/66
ata1: dev 0 configured for UDMA/33
scsi0 : ata_piix
isa bounce pool size: 16 pages
Vendor: ASUS Model: DRW-1608P2S Rev: 1.37
Type: CD-ROM ANSI SCSI revision: 05

But we didn't use the drive much until now (mostly just for Linux
installation, without CD reading problems) so I have no additional data
on possible issues with previous kernels...

Best regards
J Esteves
--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-03-06 15:55:26

by Randy Dunlap

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

On Mon, 06 Mar 2006 10:45:46 +0000 J M Cerqueira Esteves wrote:

> Andrew Morton wrote:
> > I've not been following the saga of atapi-versus-libata at all closely.
> > Booting with libata.atapi_enabled=1 might make things work. I think Randy
> > should know what happened here?
> >
> > You were testing 2.6.16-rc5, yes? What did you expect to see and what were
> > you seeing in earlier kernels (which versions?) (IOW: what did we break
> > this time?)
>
> Yes, this was with 2.6.16-rc5 with the suggested patch.
>
> I haven't tried libata.atapi_enabled=1 yet (I'll do it on the first
> reboot after this set of tests with Gaussian).

Yes, that should be all you need to do in current kernels.
Maybe Ubuntu already has that enabled for you. :)


> Under both 2.6.12 (as supplied with Ubuntu Breezy) and 2.6.15 (as
> supplied with the current Ubuntu Dapper, incorporating 2.6.15.4) we had:
>
> ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0x18F0 irq 14
> ata1: dev 0 cfg 49:0f00 82:0218 83:4000 84:4000 85:0218 86:0000 87:4000
> 88:041f
> ata1: dev 0 ATAPI, max UDMA/66
> ata1: dev 0 configured for UDMA/33
> scsi0 : ata_piix
> isa bounce pool size: 16 pages
> Vendor: ASUS Model: DRW-1608P2S Rev: 1.37
> Type: CD-ROM ANSI SCSI revision: 05
>
> But we didn't use the drive much until now (mostly just for Linux
> installation, without CD reading problems) so I have no additional data
> on possible issues with previous kernels...
>
> Best regards
> J Esteves
> --
> +351 939838775 Skype:jmcerqueira http://del.icio.us/jmce
>


---
~Randy

2006-03-06 18:01:57

by Junichi Nomura

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

Andrew Morton wrote:
> J M Cerqueira Esteves <[email protected]> wrote:
>>>We have a candidate fix at
>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
>>> Could you test that? (and don't alter the Cc: list!). The patch is
>>>against 2.6.16-rc5.
>
>>A new "feature": 36 of these kernel message pairs as boot time:
>> device-mapper: dm-linear: Device lookup failed
>> device-mapper: error adding target to table
>
> OK, there were some fairly large DM patches touching on
> dm_get_device(). Cc added ;)

Thanks Andrew for Cc-ing.

Sorry but I don't think my bd_claim patches affect on this problem
as the patches are neither bug fixes nor included in
2.6.16-rc5-mm2 yet.

So if the problem persists, I would suggest to consult with
[email protected] about the problem.

If it's possible to do some testings on the system,
I think the followings are worth trying:
- Checking if the problem occurs with plain 2.6.15
(not the one from distributor).
- Checking how the device-mapper devices are configured.
(e.g. comparing the output of "dmsetup table" command
with the one on the original kernel)
- Checking what lookup failed (printk below will show them).
[It's better if dm shows this information from the first time..]
Then checking whether the failed devices exist in the system
or initrds, whether they are mounted or used by md.

--- linux-2.6.16-rc5-mm2.tmp/drivers/md/dm-linear.c 2006-03-03 15:42:32.000000000 -0500
+++ linux-2.6.16-rc5-mm2/drivers/md/dm-linear.c 2006-03-06 10:17:16.000000000 -0500
@@ -47,6 +47,7 @@ static int linear_ctr(struct dm_target *

if (dm_get_device(ti, argv[0], lc->start, ti->len,
dm_table_get_mode(ti->table), &lc->dev)) {
+ printk("dm-linear: failed to lookup %s\n", argv[0]);
ti->error = "dm-linear: Device lookup failed";
goto bad;
}

Thanks,
--
Jun'ichi Nomura, NEC Solutions (America), Inc.

2006-03-17 09:48:06

by J M Cerqueira Esteves

[permalink] [raw]
Subject: Re: oom-killer: gfp_mask=0xd1 with 2.6.15.4 on EM64T [previously 2.6.12]

J M Cerqueira Esteves wrote:
> Andrew Morton wrote:
>>We have a candidate fix at
>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/x86_64-mm-blk-bounce.patch.
>> [...] The patch is against 2.6.16-rc5.
>
> Testing that kernel now, with good news: the machine has been apparently
> stable, running Gaussian processes for the last 20 hours, with no
> oom-killer messages.

... and still using that 2.6.16-rc5 with the suggested patch,
during the last 11 days, always doing a lot of number-crunching with
Gaussian and other programs, we had no more oom-killings or other
noticeable instabilities.

I did take the opportunity to configure the kernel with CONFIG_EDAC,
CONFIG_EDAC_MM_EDAC and CONFIG_EDAC_E752X, and during this period (11
days) got about 20 messages like these:

Mar 7 15:25:08 localhost kernel: [182069.699544] Non-Fatal Error DRAM
Controler
Mar 7 15:25:08 localhost kernel: [182069.699559] EDAC MC0: CE page
0x9c334, offset 0x0, grain 4096, syndrome 0x2510, row 2, channel 1,
label "": e752x CE

always with the same values for page, offset, grain, syndrome, row, and
channel values. A defective DIMM?

Best regards
J Esteves

--
+351 939838775 Skype:jmcerqueira http://del.icio.us/jmce


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature