2001-10-31 13:45:16

by Jeff Garzik

[permalink] [raw]
Subject: pre6 oom killer oops

total: used: free: shared: buffers: cached:
Mem: 391372800 302718976 88653824 0 34152448 187908096
Swap: 418996224 827392 418168832
MemTotal: 382200 kB
MemFree: 86576 kB
MemShared: 0 kB
Buffers: 33352 kB
Cached: 182720 kB
SwapCached: 784 kB
Active: 78520 kB
Inactive: 152320 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 382200 kB
LowFree: 86576 kB
SwapTotal: 409176 kB
SwapFree: 408368 kB


Attachments:
dmesg.txt (15.01 kB)
oops.txt (8.01 kB)
sysrqm.txt (804.00 B)
meminfo.txt (522.00 B)
Download all attachments

2001-10-31 13:51:46

by Jeff Garzik

[permalink] [raw]
Subject: Re: pre6 oom killer oops

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 808 96536 27888 181040 0 0 38 159 1102 204 60 9 31
3 0 0 808 96688 27928 181040 0 0 29 0 1100 300 80 19 0
1 0 0 808 96352 27928 181024 0 0 0 0 1070 239 81 19 0
1 0 0 808 96664 27928 181024 0 0 0 0 1063 223 80 20 0
1 0 0 808 96616 27928 181024 0 0 0 8 1077 248 80 20 0
1 0 0 808 96432 27928 181024 0 0 0 0 1073 250 78 22 0
1 0 0 808 96512 27928 181024 0 0 0 0 1059 215 80 20 0
1 0 0 808 96336 27928 181024 0 0 0 24 1052 163 83 17 0
1 0 0 808 96648 27928 181024 0 0 0 48 1069 197 82 18 0
1 0 0 808 96328 27928 181024 0 0 0 0 1039 160 83 17 0
1 0 0 808 96304 27928 181024 0 0 0 0 1058 213 81 19 0
1 0 0 808 96336 27928 181016 0 0 0 0 1062 219 80 20 0
1 0 0 808 96416 27928 181016 0 0 0 0 1134 310 79 21 0
1 0 0 808 96616 27928 181016 0 0 0 0 1148 298 79 20 0
1 0 0 808 96336 27928 181016 0 0 0 0 1162 322 80 20 0
1 0 0 808 96504 27928 181008 0 0 0 0 1143 288 80 19 0
2 0 0 808 95944 27928 181008 0 0 0 8 1140 248 85 14 0
1 0 0 808 96568 27928 181008 0 0 0 8 1040 160 88 12 0
2 0 0 808 96320 27928 181008 0 0 0 8 1031 106 89 11 0
2 0 0 808 96320 27928 181008 0 0 0 0 1030 112 88 12 0
1 0 0 808 94952 27928 181008 0 0 0 0 1032 124 89 11 0
1 0 0 808 94936 27928 181008 0 0 0 8 1033 102 90 10 0
1 0 0 808 95160 27928 181008 0 0 0 0 1028 89 90 10 0
1 0 0 808 96376 27928 181296 0 0 0 8 1035 177 85 15 0
1 0 0 808 69520 28584 202488 0 0 0 0 1028 419 86 14 0
1 0 0 808 45432 30168 224912 0 0 2 16 1040 816 92 8 0
1 0 0 808 30824 31640 238048 0 0 12 8 1044 694 95 5 0
1 0 0 808 15632 32776 252096 0 0 1 0 1027 733 96 4 0
2 0 0 808 9208 31472 265232 0 0 32 0 1062 648 84 15 0
1 0 0 808 16032 30696 260584 0 0 1 0 1028 444 67 32 1
1 0 0 808 6904 30536 259304 0 0 1 8 1031 145 91 9 0
1 0 0 808 7424 30560 259560 0 0 0 8 1031 41 97 3 0
1 0 1 808 13704 30592 259464 0 0 0 3032 1053 42 96 4 0
1 0 0 808 12992 30720 260096 0 0 0 3388 1111 280 52 24 24
1 0 0 808 14344 30752 260288 0 0 0 0 1030 475 71 29 0
2 0 0 808 13408 30832 260416 0 0 2 0 1039 242 86 14 0
1 0 0 808 14912 30440 258952 0 0 0 0 1028 128 91 9 0
1 0 0 808 11352 30472 259144 0 0 0 0 1028 187 88 12 0
1 0 0 808 14608 30512 259232 0 0 0 0 1027 210 86 14 0
1 0 0 808 16320 30536 259720 0 0 0 0 1034 389 74 25 0
1 0 0 808 16128 30632 259824 0 0 0 0 1032 408 68 31 0
2 0 0 808 13312 30680 260024 0 0 1 0 1029 369 76 24 0
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
3 0 0 808 6560 30696 260088 0 0 1 0 1028 126 93 7 0
1 0 0 808 7760 30408 256552 0 0 1 0 1028 36 98 2 0
2 0 0 808 16672 30424 256704 0 0 0 0 1027 133 93 7 0
1 0 0 808 11864 30424 256744 0 0 0 0 1027 99 95 5 0
1 0 0 808 12664 30432 256848 0 0 0 0 1027 83 95 5 0
1 1 0 808 19504 30440 256912 0 0 0 0 1032 151 92 8 0
3 0 0 808 18000 30440 256984 0 0 0 0 1027 145 91 9 0
1 0 0 808 11216 30448 257040 0 0 0 0 1027 136 92 8 0
1 0 0 808 13656 30456 257096 0 0 0 0 1027 102 94 6 0
1 0 0 808 12200 30456 257160 0 0 0 0 1028 158 91 9 0
1 0 0 808 13624 30464 257256 0 0 0 0 1028 126 93 7 0
1 0 0 808 6600 30464 257288 0 0 0 0 1028 80 96 4 0
1 0 0 808 5152 29872 255968 0 0 0 0 1027 99 95 5 0
1 0 0 808 12480 28616 251080 0 0 0 0 1027 16 99 1 0
2 0 0 808 26448 28624 251392 0 0 0 0 1027 147 92 8 0
2 0 0 808 22928 28624 251456 0 0 0 0 1028 164 91 9 0
1 0 0 808 18296 28640 251536 0 0 0 0 1028 139 92 8 0
2 0 0 808 21784 28648 251624 0 0 0 0 1027 157 91 9 0
1 0 0 808 20624 28648 251696 0 0 0 0 1027 166 91 9 0
1 0 0 808 18000 28656 251768 0 0 0 0 1027 131 93 7 0
1 0 0 808 19920 28656 251840 0 0 0 0 1027 177 90 10 0
1 0 0 808 19168 28656 251904 0 0 0 0 1027 172 91 9 0
2 0 0 808 25480 28656 251976 0 0 0 0 1027 169 91 9 0
1 0 0 808 20200 28672 252064 0 0 1 0 1028 138 91 8 1
1 0 0 808 19344 28680 252168 0 0 1 0 1028 54 97 3 0
2 0 0 808 24088 28800 253800 0 0 0 0 1030 129 87 13 1
1 0 0 808 16728 28840 254176 0 0 1 0 1029 91 92 8 0
1 0 0 808 23864 28856 254560 0 0 0 0 1027 103 91 9 0
1 1 0 808 23544 28864 254728 0 0 1 0 1031 155 88 12 0
2 0 0 808 9624 28872 254808 0 0 1 0 1028 130 90 10 0
2 0 0 808 22912 28888 254928 0 0 0 0 1027 77 96 4 0
2 0 0 808 22912 28904 255256 0 0 0 0 1028 133 89 11 0
1 0 0 808 27544 28536 251608 0 0 0 0 1032 80 96 4 0
2 0 0 808 91952 30104 187432 0 0 1 0 1047 1098 65 34 1
1 0 1 808 89032 30192 187512 0 0 8 244 1103 476 61 24 14
1 0 1 808 89744 30240 188344 0 0 0 4391 1110 108 86 11 2
1 0 0 808 85720 30264 188536 0 0 0 243 1066 100 90 10 0
1 0 0 808 78528 30280 188792 0 0 1 120 1057 71 96 4 0
1 0 1 808 86880 30312 189608 0 0 2 213 1029 81 93 7 1
1 0 0 808 81424 30336 189608 0 0 0 67 1058 450 57 22 21
2 0 0 808 89184 30384 190296 0 0 0 293 1062 44 91 9 0
2 1 1 808 86416 30416 190648 0 0 0 171 1049 349 84 16 0
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 808 79536 30424 190760 0 0 0 0 1035 109 96 4 0
1 0 0 808 78008 30432 190896 0 0 0 64 1038 82 97 3 0
1 0 0 808 84400 30432 191104 0 0 0 52 1038 91 97 3 0
1 0 0 808 69352 30432 191120 0 0 0 61 1039 81 97 3 0
1 0 0 808 77088 30448 191360 0 0 0 78 1031 82 97 3 0
2 0 0 808 86048 30464 192064 0 0 0 0 1035 84 94 6 0
1 0 0 808 77464 30480 192096 0 0 0 203 1063 457 64 24 11
2 0 0 808 85800 30568 192672 0 0 1 183 1059 116 89 11 0
1 0 0 808 80568 30584 192776 0 0 0 0 1033 124 93 7 0
1 0 0 808 77840 30600 192880 0 0 0 62 1042 126 93 7 0
1 0 0 808 79936 30632 193000 0 0 1 52 1041 105 94 6 0
1 0 0 808 76880 30688 193880 0 0 20 142 1049 85 93 6 1
2 0 0 808 84016 30760 194432 0 0 17 0 1030 394 75 23 2
2 0 0 808 84152 30800 194768 0 0 0 263 1066 313 80 20 0
1 0 0 808 83144 30840 195488 0 0 0 111 1059 401 82 18 0
1 0 1 808 83056 30848 195864 0 0 0 203 1043 434 79 21 0
1 0 0 808 75984 30864 195936 0 0 0 90 1049 497 62 26 12
2 0 0 808 81424 30896 196248 0 0 0 119 1058 281 59 21 21
2 0 0 808 81424 30920 196664 0 0 0 142 1056 344 71 15 14
1 0 0 808 79424 30928 196760 0 0 0 0 1028 445 78 22 0
1 0 0 808 81672 30960 197024 0 0 0 140 1055 429 61 23 17
1 0 0 808 114576 31176 163768 0 0 1 8 1039 386 77 23 0
2 0 0 808 112984 31224 163864 0 0 2 0 1031 421 68 31 1
1 0 0 808 100728 31224 163896 0 0 0 0 1027 62 94 6 0
1 0 0 808 96984 31224 163896 0 0 0 0 1027 32 98 2 0
1 0 0 808 101112 31232 163904 0 0 0 0 1027 45 97 3 0
1 0 0 808 95768 31232 163904 0 0 0 0 1027 33 98 2 0
1 0 0 808 88960 31232 163904 0 0 0 0 1027 15 99 1 0
2 0 0 808 95888 31232 163904 0 0 0 2 1029 39 98 2 0
1 0 0 808 81376 31232 163904 0 0 0 7 1034 21 99 1 0
1 0 0 808 81720 31232 163904 0 0 0 0 1027 13 99 1 0
1 0 0 808 77920 31232 163904 0 0 0 0 1027 11 99 1 0
1 0 0 808 73288 31248 163912 0 0 3 0 1028 13 98 1 1
2 0 0 808 67408 31248 163912 0 0 0 0 1027 14 99 1 0
1 0 0 808 107184 31248 163912 0 0 0 0 1028 55 96 4 0
2 1 0 808 112496 31248 163928 0 0 0 0 1027 49 97 3 0
1 0 0 808 90632 31264 163976 0 0 4 0 1028 43 96 3 1
1 0 0 808 75424 31264 163976 0 0 0 0 1027 10 98 2 0
1 0 0 808 69848 31264 163976 0 0 0 0 1027 4 99 1 0
1 0 0 808 69848 31264 163976 0 0 0 0 1027 4 100 0 0
1 0 0 808 65464 31264 163976 0 0 0 0 1027 4 96 4 0
1 0 0 808 65696 31264 163976 0 0 0 0 1027 11 97 3 0
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 808 63672 31264 163976 0 0 0 0 1028 13 100 0 0
1 0 0 808 95744 31280 164392 0 0 0 0 1027 89 92 8 0
1 0 0 808 94544 31280 164392 0 0 0 0 1027 11 99 1 0
1 0 0 808 98472 31280 164392 0 0 0 0 1027 33 98 2 0
2 0 0 808 111552 31280 164392 0 0 0 0 1027 18 99 1 0
1 0 0 808 101904 31280 164392 0 0 0 0 1027 39 97 3 0
1 0 0 808 98656 31280 164440 0 0 0 0 1027 47 97 3 0
1 0 0 808 92456 31280 164440 0 0 0 0 1028 14 99 1 0
1 0 0 808 84640 31280 164440 0 0 0 0 1027 12 99 1 0
1 0 0 808 94816 31288 164440 0 0 0 0 1027 29 98 2 0
1 0 0 808 103008 31288 164456 0 0 0 0 1027 34 97 3 0
2 0 0 808 112952 31296 164464 0 0 0 0 1027 59 95 5 0
1 0 0 808 98552 31304 164488 0 0 0 0 1027 65 95 5 0
1 0 0 808 92896 31304 164488 0 0 0 0 1027 13 99 1 0
1 0 0 808 88040 31304 164488 0 0 0 0 1028 11 99 1 0
2 0 0 808 100520 31312 164488 0 0 0 0 1027 29 98 2 0
1 0 0 808 103320 31312 164488 0 0 0 0 1029 28 98 2 0
1 0 0 808 100328 31312 164544 0 0 0 0 1027 56 96 4 0
1 0 0 808 100896 31320 164648 0 0 0 0 1027 30 98 2 0
1 0 0 808 102568 31320 164680 0 0 0 0 1041 117 94 6 0
2 0 0 808 99160 31320 164704 0 0 0 0 1065 108 98 2 0
1 0 0 808 104864 31328 164872 0 0 0 0 1035 58 96 4 0
1 0 0 808 100576 31336 164920 0 0 0 0 1028 54 97 3 0
1 0 0 808 101512 31344 165080 0 0 0 0 1028 30 98 2 0


Attachments:
vmstat.txt (12.34 kB)

2001-10-31 14:12:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: pre6 oom killer oops

further comments #2:

when rebooting, there was some disk corruption in the ext2 filesystem.

It is my guess that this is to the large number of buffers in the vmstat
output, which I believe are dirty buffers that never got written out

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-31 14:18:57

by Alexander Viro

[permalink] [raw]
Subject: Re: pre6 oom killer oops



On Wed, 31 Oct 2001, Jeff Garzik wrote:

> further comments #2:
>
> when rebooting, there was some disk corruption in the ext2 filesystem.
>
> It is my guess that this is to the large number of buffers in the vmstat
> output, which I believe are dirty buffers that never got written out

Judging by your log it's not an OOM - page table corruption got caught by
do_wp_page(), which means that handle_mm_fault() fails (surprise, surprise),
which kills the process.

Looks like a massive memory corruption - later it fscked you in pte_alloc()
and then it screwed buffer cache lists.

2001-10-31 14:27:29

by Jeff Garzik

[permalink] [raw]
Subject: Re: pre6 oom killer oops

Alexander Viro wrote:
>
> On Wed, 31 Oct 2001, Jeff Garzik wrote:
>
> > further comments #2:
> >
> > when rebooting, there was some disk corruption in the ext2 filesystem.
> >
> > It is my guess that this is to the large number of buffers in the vmstat
> > output, which I believe are dirty buffers that never got written out
>
> Judging by your log it's not an OOM - page table corruption got caught by
> do_wp_page(), which means that handle_mm_fault() fails (surprise, surprise),
> which kills the process.
>
> Looks like a massive memory corruption - later it fscked you in pte_alloc()
> and then it screwed buffer cache lists.

I'm reinstalling now, with a bad blocks check, to make sure random disk
crap isn't affecting things. Disk is good AFAIK. Vaguely recent ATA-33
drive. I'll switch to the other alpha to see if I can see similar
symptoms.

Unfortunately I don't know of a good alpha memory tester like
memtest86. SRM firmware tests memory ok, but that probably doesn't mean
much.

Final comment before leaving this machine. Restarting the rpm-rebuilder
(post reboot and fsck), I still see a very large number of buffs. Since
there is zero swapped, this may or may not be normal. The cache value
appears sane, FWIW.


procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
2 0 0 0 35024 13016 267256 0 0 3 491 1046 136 84
12 4
1 0 0 0 106776 14864 191984 0 0 380 0 1903 696 13
20 67
3 0 0 0 85448 16424 210216 0 0 239 11 1066 1305 58
25 17
1 0 0 0 83624 16560 210712 0 0 107 21 1068 550 54
25 21
1 0 0 0 81952 16600 211224 0 0 3 0 1030 96
93 6 0
1 0 0 0 78720 16704 212080 0 0 144 11 1045 118 79
12 9
2 0 0 0 79608 16728 212400 0 0 30 0 1037 108
86 8 5
1 0 0 0 73288 16792 212736 0 0 35 11 1039 84
91 5 4
1 0 0 0 71784 16792 212736 0 0 0 11 1038 25
99 1 0
1 0 0 0 81560 16944 214840 0 0 446 0 1049 259 70
19 11
1 0 0 0 76920 16984 215480 0 0 4 0 1031 487 75
24 1
1 0 0 0 79384 17016 215600 0 0 0 0 1029 23
94 6 0
0 1 1 0 79432 17056 215976 0 0 12 1396 1132 188 48
12 40
0 1 1 0 79384 17056 216008 0 0 11 4542 1107 14
0 2 98
1 0 0 0 74400 17128 216984 0 0 331 217 1090 269 25
10 65
2 0 0 0 70456 17136 217064 0 0 0 288 1030 119
96 4 0
1 0 0 0 72200 17152 217176 0 0 0 0 1050 96
97 3 0
1 0 0 0 67232 17152 217264 0 0 0 107 1041 93
98 2 0
1 0 0 0 70072 17160 217416 0 0 0 0 1029 91
97 3 0
1 0 0 0 68272 17168 217496 0 0 0 75 1035 122
96 4 0

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-31 15:31:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: pre6 oom killer oops


On Wed, 31 Oct 2001, Jeff Garzik wrote:
>
> 2.4.14-pre6 on alpha. three oops attached, an oops right after oom
> killer kicked in, and two hits on this BUG: kernel BUG at buffer.c:498!

This isn't the oom killer - look more closely at your oops:

do_wp_page: bogus page at address 2000002dbc0 (page 0xfffffc173ab13b00)
VM: killing process cc1(29719)

Note the "do_wp_page" message _before_ the VM decided to kill the process.
And the first oops seems to be more of the same: big corruption in the
page tables.

Looks like the oom killer kicked in under extreme memory load, but the
extreme memory load caused something else too.

Linus

2001-10-31 15:46:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: pre6 oom killer oops


On Wed, 31 Oct 2001, Jeff Garzik wrote:
>
> I'm reinstalling now, with a bad blocks check, to make sure random disk
> crap isn't affecting things. Disk is good AFAIK. Vaguely recent ATA-33
> drive. I'll switch to the other alpha to see if I can see similar
> symptoms.

Bad disks won't cause the kind of page table corruption you saw - even if
the disk is total cr*p, and you have a swap partition on it, we never
actually put metadata in swap, only the actual user pages (some other
systems will page out the page directories too, Linux won't).

To me it actually looks like something corrupted a _big_ chunk of memory.
Which is not indicative of memory trouble either (sure, random one-bit
corruption can do just about anything, including causing a big memcpy to
go to the wrong place, but ..).

One thing that does things like this is missed TLB invalidates. Especially
on SMP. If you unmap and free a page, and user space still has a TLB entry
for it, you can have the kernel re-use the page at the same time user
space writes "garbage" to it.

This is, of course, a lot easier to trigger with MMU contexts and the
like.



Something like that is especially likely, as one of the values in your
oops is

swap_free: Bad swap file entry 2e66656464747368

where that value looks suspiciously like a likely string from a compiler:

"hstddef."

So I'd look very closely at the TLB invalidate code, or possibly try to
see if we might re-use a page for some other reason too early.

Linus