LinuxLists.cc - 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer

2010-04-04 22:49:33

Subject: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer

[Sorry for the cross post, but I don't know where to start to tackle this
issue]

Hi,

on an attempt to get to a current kernel, I suffer from an issue, where a
simple du on a reasonably big xfs tree leads to invoking the oom killer:

Apr 4 23:24:53 tyrex kernel: [ 418.913223] XFS mounting filesystem sdd1
Apr 4 23:24:54 tyrex kernel: [ 419.774606] Ending clean XFS mount for filesystem: sdd1
Apr 4 23:26:02 tyrex kernel: [ 488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
Apr 4 23:26:02 tyrex kernel: [ 488.160798] du cpuset=/ mems_allowed=0
Apr 4 23:26:02 tyrex kernel: [ 488.160800] Pid: 6397, comm: du Tainted: G W 2.6.34-rc3-13-vanilla #1
Apr 4 23:26:02 tyrex kernel: [ 488.160802] Call Trace:
Apr 4 23:26:02 tyrex kernel: [ 488.160808] [<c02becc7>] dump_header+0x67/0x1a0
Apr 4 23:26:02 tyrex kernel: [ 488.160811] [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
Apr 4 23:26:02 tyrex kernel: [ 488.160813] [<c02bee59>] oom_kill_process+0x59/0x160
Apr 4 23:26:02 tyrex kernel: [ 488.160815] [<c02bf43e>] __out_of_memory+0x4e/0xc0
Apr 4 23:26:02 tyrex kernel: [ 488.160817] [<c02bf502>] out_of_memory+0x52/0xc0
Apr 4 23:26:02 tyrex kernel: [ 488.160819] [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
Apr 4 23:26:02 tyrex kernel: [ 488.160822] [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
Apr 4 23:26:02 tyrex kernel: [ 488.160825] [<c02ea4a9>] cache_grow+0x249/0x2e0
Apr 4 23:26:02 tyrex kernel: [ 488.160838] [<c02ea748>] cache_alloc_refill+0x208/0x240
Apr 4 23:26:02 tyrex kernel: [ 488.160840] [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
Apr 4 23:26:02 tyrex kernel: [ 488.160868] [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160888] [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160905] [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160923] [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160941] [<f861c8b2>] xfs_iget_cache_miss+0x32/0x200 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160960] [<f8630fd9>] ? xfs_perag_get+0x29/0xa0 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160977] [<f861cbac>] xfs_iget+0x12c/0x1b0 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.160997] [<f863c9c5>] xfs_lookup+0xc5/0x110 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.161017] [<f86467d1>] xfs_vn_lookup+0x51/0xa0 [xfs]
Apr 4 23:26:02 tyrex kernel: [ 488.161021] [<c02ffef7>] do_lookup+0xa7/0x1c0
Apr 4 23:26:02 tyrex kernel: [ 488.161023] [<c030089b>] link_path_walk+0x1db/0x850
Apr 4 23:26:02 tyrex kernel: [ 488.161026] [<c024a870>] ? capable+0x10/0x40
Apr 4 23:26:02 tyrex kernel: [ 488.161028] [<c038969e>] ? security_inode_permission+0x1e/0x20
Apr 4 23:26:02 tyrex kernel: [ 488.161030] [<c0301294>] path_walk+0x54/0xc0
Apr 4 23:26:02 tyrex kernel: [ 488.161032] [<c0301431>] do_path_lookup+0x51/0x90
Apr 4 23:26:02 tyrex kernel: [ 488.161034] [<c0301f8a>] user_path_at+0x4a/0x80
Apr 4 23:26:02 tyrex kernel: [ 488.161037] [<c02f9748>] ? cp_new_stat64+0xe8/0x100
Apr 4 23:26:02 tyrex kernel: [ 488.161039] [<c02f9bb5>] vfs_fstatat+0x35/0x70
Apr 4 23:26:02 tyrex kernel: [ 488.161041] [<c02f9c0d>] sys_fstatat64+0x1d/0x40
Apr 4 23:26:02 tyrex kernel: [ 488.161044] [<c0202d8c>] sysenter_do_call+0x12/0x22
Apr 4 23:26:02 tyrex kernel: [ 488.161045] Mem-Info:
Apr 4 23:26:02 tyrex kernel: [ 488.161046] DMA per-cpu:
Apr 4 23:26:02 tyrex kernel: [ 488.161047] CPU 0: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161048] CPU 1: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161050] CPU 2: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161051] CPU 3: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161052] CPU 4: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161053] CPU 5: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161054] CPU 6: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161056] CPU 7: hi: 0, btch: 1 usd: 0
Apr 4 23:26:02 tyrex kernel: [ 488.161057] Normal per-cpu:
Apr 4 23:26:02 tyrex kernel: [ 488.161058] CPU 0: hi: 186, btch: 31 usd: 173
Apr 4 23:26:02 tyrex kernel: [ 488.161059] CPU 1: hi: 186, btch: 31 usd: 18
Apr 4 23:26:02 tyrex kernel: [ 488.161060] CPU 2: hi: 186, btch: 31 usd: 41
Apr 4 23:26:02 tyrex kernel: [ 488.161061] CPU 3: hi: 186, btch: 31 usd: 197
Apr 4 23:26:02 tyrex kernel: [ 488.161062] CPU 4: hi: 186, btch: 31 usd: 175
Apr 4 23:26:02 tyrex kernel: [ 488.161063] CPU 5: hi: 186, btch: 31 usd: 157
Apr 4 23:26:02 tyrex kernel: [ 488.161065] CPU 6: hi: 186, btch: 31 usd: 182
Apr 4 23:26:02 tyrex kernel: [ 488.161066] CPU 7: hi: 186, btch: 31 usd: 173
Apr 4 23:26:02 tyrex kernel: [ 488.161067] HighMem per-cpu:
Apr 4 23:26:02 tyrex kernel: [ 488.161068] CPU 0: hi: 186, btch: 31 usd: 24
Apr 4 23:26:02 tyrex kernel: [ 488.161069] CPU 1: hi: 186, btch: 31 usd: 33
Apr 4 23:26:02 tyrex kernel: [ 488.161070] CPU 2: hi: 186, btch: 31 usd: 55
Apr 4 23:26:02 tyrex kernel: [ 488.161071] CPU 3: hi: 186, btch: 31 usd: 172
Apr 4 23:26:02 tyrex kernel: [ 488.161072] CPU 4: hi: 186, btch: 31 usd: 177
Apr 4 23:26:02 tyrex kernel: [ 488.161073] CPU 5: hi: 186, btch: 31 usd: 100
Apr 4 23:26:02 tyrex kernel: [ 488.161075] CPU 6: hi: 186, btch: 31 usd: 38
Apr 4 23:26:02 tyrex kernel: [ 488.161076] CPU 7: hi: 186, btch: 31 usd: 115
Apr 4 23:26:02 tyrex kernel: [ 488.161078] active_anon:98428 inactive_anon:24627 isolated_anon:0
Apr 4 23:26:02 tyrex kernel: [ 488.161079] active_file:4141 inactive_file:25571 isolated_file:0
Apr 4 23:26:02 tyrex kernel: [ 488.161080] unevictable:0 dirty:2 writeback:0 unstable:0
Apr 4 23:26:02 tyrex kernel: [ 488.161080] free:403317 slab_reclaimable:203863 slab_unreclaimable:4837
Apr 4 23:26:02 tyrex kernel: [ 488.161081] mapped:7418 shmem:137 pagetables:625 bounce:0
Apr 4 23:26:02 tyrex kernel: [ 488.161085]
DMA free:3520kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:6032kB
slab_unreclaimable:36kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
Apr 4 23:26:02 tyrex kernel: [ 488.161089] lowmem_reserve[]: 0 865 3023 3023
Apr 4 23:26:02 tyrex kernel: [ 488.161093]
Normal free:3580kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB
active_file:360kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:809420kB
slab_unreclaimable:19312kB kernel_stack:3304kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:24 all_unreclaimable? no
Apr 4 23:26:02 tyrex kernel: [ 488.161097] lowmem_reserve[]: 0 0 17267 17267
Apr 4 23:26:02 tyrex kernel: [ 488.161101]
HighMem free:1606168kB min:512kB low:2836kB high:5164kB active_anon:393712kB inactive_anon:98508kB
active_file:16228kB inactive_file:102020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:2210252kB mlocked:0kB dirty:8kB writeback:0kB mapped:29672kB shmem:548kB slab_reclaimable:0kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:2500kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? no
Apr 4 23:26:02 tyrex kernel: [ 488.161105] lowmem_reserve[]: 0 0 0 0
Apr 4 23:26:02 tyrex kernel: [ 488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
Apr 4 23:26:02 tyrex kernel: [ 488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
Apr 4 23:26:02 tyrex kernel: [ 488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
Apr 4 23:26:02 tyrex kernel: [ 488.161122] 29947 total pagecache pages
Apr 4 23:26:02 tyrex kernel: [ 488.161123] 0 pages in swap cache
Apr 4 23:26:02 tyrex kernel: [ 488.161124] Swap cache stats: add 0, delete 0, find 0/0
Apr 4 23:26:02 tyrex kernel: [ 488.161125] Free swap = 2104476kB
Apr 4 23:26:02 tyrex kernel: [ 488.161126] Total swap = 2104476kB
Apr 4 23:26:02 tyrex kernel: [ 488.165523] 784224 pages RAM
Apr 4 23:26:02 tyrex kernel: [ 488.165524] 556914 pages HighMem
Apr 4 23:26:02 tyrex kernel: [ 488.165525] 12060 pages reserved
Apr 4 23:26:02 tyrex kernel: [ 488.165526] 82604 pages shared
Apr 4 23:26:02 tyrex kernel: [ 488.165527] 328045 pages non-shared
Apr 4 23:26:02 tyrex kernel: [ 488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
Apr 4 23:26:02 tyrex kernel: [ 488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
[...]

vmstat 1, when it occurs:
1 0 0 1722516 4 926532 0 0 13800 0 2598 6479 2 14 84 0
3 0 0 1735540 4 927392 0 0 10128 0 2144 5434 0 8 92 0
1 0 0 1733944 4 927868 0 0 10416 92 4465 8264 1 7 93 0
0 0 0 1733448 4 928024 0 0 9220 0 2081 3570 1 8 91 0
0 0 0 1734160 4 927412 0 0 7404 0 1912 3014 0 6 90 4
1 0 0 1733072 4 927208 0 0 9512 0 2227 3591 1 8 89 2
1 1 0 1728356 4 928256 0 0 4620 0 1334 2087 0 2 92 5
2 1 0 1926208 4 928832 0 0 3848 712 1869 3515 0 5 84 11
1 0 0 2181520 4 849064 0 0 2736 3983 3154 308011 2 14 71 13
0 0 0 2181028 4 847560 0 0 140 109 300 484 2 0 98 0
4 0 0 2561212 4 467272 0 0 4 0 383 233 0 6 94 0
0 0 0 2995016 4 33560 0 0 44 0 421 238 0 9 91 0
0 0 0 2995024 4 33584 0 0 120 0 187 251 0 0 100 0
0 0 0 2994932 4 33636 0 0 0 0 34 63 0 0 100 0
1 0 0 2995212 4 33548 0 0 8 0 72 94 0 0 100 0
0 0 0 2995368 4 33528 0 0 0 0 39 73 0 0 100 0
0 0 0 2995384 4 33348 0 0 0 0 96 116 0 0 100 0
0 0 0 2995556 4 33256 0 0 40 0 139 184 0 0 100 0
0 0 0 2995800 4 33248 0 0 68 0 178 193 0 0 100 0

The oom-killer got triggered around the line with 308011 context switches.
Needless to say, this event renders this system dysfunctional.

Sorry for the insanely long lines, but wrapping them to a sane width results
in an unparseable mess. This kernel is closest to upstream, that I was able
to produce, built in Novell build service, available here:
http://download.opensuse.org/repositories/home:/frispete:/kernel-head/openSUSE_11.1/
As you can see, userspace is openSUSE 11.1, arch is i586, based on vanilla
built of rc3 with Linus' git tree applied up to 5e11611a5d22252f3f9c169a3c9377eac0c32033.
I have reproduced this issue also with 2.6.33.1. Usually, I run the openSUSE
customized kernels, e.g. kernel-pae-2.6.27.45-0.1.1 doesn't show this issue.

boot.msg: ftp://urpla.net/boot.msg
config: ftp://urpla.net/vanilla

I do use some unusual vm settings, as documented in boot.msg, but that should
not result in this issue, does it?

Hardware: Xeon X3460, Asus P7F-E motherboard, Intel 3420 PCH chipset, 12 GB Mem

In case, you wonder, why I drove that test with a 4GB memory setup: that was
the easiest way to demonstrate the problem without resorting to tainted and
heavy customized kernels. And for compatibility reasons (customer support),
I have to stay at i586 for the moment.

Of course, the du command does succeed with the distributed kernel with exactly
the same (userspace and sysctl) settings:
1,1T /backup/daily.0

Anybody with an idea, what's going on here?

TIA,
Pete

2010-04-05 00:49:28

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer

On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> [Sorry for the cross post, but I don't know where to start to tackle this
> issue]
>
> Hi,
>
> on an attempt to get to a current kernel, I suffer from an issue, where a
> simple du on a reasonably big xfs tree leads to invoking the oom killer:

How big is the directory tree (how many inodes, etc)?

> Apr 4 23:24:53 tyrex kernel: [ 418.913223] XFS mounting filesystem sdd1
> Apr 4 23:24:54 tyrex kernel: [ 419.774606] Ending clean XFS mount for filesystem: sdd1
> Apr 4 23:26:02 tyrex kernel: [ 488.160795] du invoked oom-killer: gfp_mask=0x802d0, order=0, oom_adj=0
> Apr 4 23:26:02 tyrex kernel: [ 488.160798] du cpuset=/ mems_allowed=0
> Apr 4 23:26:02 tyrex kernel: [ 488.160800] Pid: 6397, comm: du Tainted: G W 2.6.34-rc3-13-vanilla #1
> Apr 4 23:26:02 tyrex kernel: [ 488.160802] Call Trace:
> Apr 4 23:26:02 tyrex kernel: [ 488.160808] [<c02becc7>] dump_header+0x67/0x1a0
> Apr 4 23:26:02 tyrex kernel: [ 488.160811] [<c03cf1a7>] ? ___ratelimit+0x77/0xe0
> Apr 4 23:26:02 tyrex kernel: [ 488.160813] [<c02bee59>] oom_kill_process+0x59/0x160
> Apr 4 23:26:02 tyrex kernel: [ 488.160815] [<c02bf43e>] __out_of_memory+0x4e/0xc0
> Apr 4 23:26:02 tyrex kernel: [ 488.160817] [<c02bf502>] out_of_memory+0x52/0xc0
> Apr 4 23:26:02 tyrex kernel: [ 488.160819] [<c02c20f4>] __alloc_pages_slowpath+0x444/0x4c0
> Apr 4 23:26:02 tyrex kernel: [ 488.160822] [<c02c22c2>] __alloc_pages_nodemask+0x152/0x160
> Apr 4 23:26:02 tyrex kernel: [ 488.160825] [<c02ea4a9>] cache_grow+0x249/0x2e0
> Apr 4 23:26:02 tyrex kernel: [ 488.160838] [<c02ea748>] cache_alloc_refill+0x208/0x240
> Apr 4 23:26:02 tyrex kernel: [ 488.160840] [<c02eab19>] kmem_cache_alloc+0xb9/0xc0
> Apr 4 23:26:02 tyrex kernel: [ 488.160868] [<f86375dd>] ? xfs_trans_brelse+0xfd/0x150 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160888] [<f863d547>] kmem_zone_alloc+0x77/0xb0 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160905] [<f860a043>] ? xfs_da_state_free+0x53/0x60 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160923] [<f861c796>] xfs_inode_alloc+0x26/0x110 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160941] [<f861c8b2>] xfs_iget_cache_miss+0x32/0x200 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160960] [<f8630fd9>] ? xfs_perag_get+0x29/0xa0 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160977] [<f861cbac>] xfs_iget+0x12c/0x1b0 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.160997] [<f863c9c5>] xfs_lookup+0xc5/0x110 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.161017] [<f86467d1>] xfs_vn_lookup+0x51/0xa0 [xfs]
> Apr 4 23:26:02 tyrex kernel: [ 488.161021] [<c02ffef7>] do_lookup+0xa7/0x1c0
> Apr 4 23:26:02 tyrex kernel: [ 488.161023] [<c030089b>] link_path_walk+0x1db/0x850
> Apr 4 23:26:02 tyrex kernel: [ 488.161026] [<c024a870>] ? capable+0x10/0x40
> Apr 4 23:26:02 tyrex kernel: [ 488.161028] [<c038969e>] ? security_inode_permission+0x1e/0x20
> Apr 4 23:26:02 tyrex kernel: [ 488.161030] [<c0301294>] path_walk+0x54/0xc0
> Apr 4 23:26:02 tyrex kernel: [ 488.161032] [<c0301431>] do_path_lookup+0x51/0x90
> Apr 4 23:26:02 tyrex kernel: [ 488.161034] [<c0301f8a>] user_path_at+0x4a/0x80
> Apr 4 23:26:02 tyrex kernel: [ 488.161037] [<c02f9748>] ? cp_new_stat64+0xe8/0x100
> Apr 4 23:26:02 tyrex kernel: [ 488.161039] [<c02f9bb5>] vfs_fstatat+0x35/0x70
> Apr 4 23:26:02 tyrex kernel: [ 488.161041] [<c02f9c0d>] sys_fstatat64+0x1d/0x40
> Apr 4 23:26:02 tyrex kernel: [ 488.161044] [<c0202d8c>] sysenter_do_call+0x12/0x22
> Apr 4 23:26:02 tyrex kernel: [ 488.161045] Mem-Info:
> Apr 4 23:26:02 tyrex kernel: [ 488.161046] DMA per-cpu:
> Apr 4 23:26:02 tyrex kernel: [ 488.161047] CPU 0: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161048] CPU 1: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161050] CPU 2: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161051] CPU 3: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161052] CPU 4: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161053] CPU 5: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161054] CPU 6: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161056] CPU 7: hi: 0, btch: 1 usd: 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161057] Normal per-cpu:
> Apr 4 23:26:02 tyrex kernel: [ 488.161058] CPU 0: hi: 186, btch: 31 usd: 173
> Apr 4 23:26:02 tyrex kernel: [ 488.161059] CPU 1: hi: 186, btch: 31 usd: 18
> Apr 4 23:26:02 tyrex kernel: [ 488.161060] CPU 2: hi: 186, btch: 31 usd: 41
> Apr 4 23:26:02 tyrex kernel: [ 488.161061] CPU 3: hi: 186, btch: 31 usd: 197
> Apr 4 23:26:02 tyrex kernel: [ 488.161062] CPU 4: hi: 186, btch: 31 usd: 175
> Apr 4 23:26:02 tyrex kernel: [ 488.161063] CPU 5: hi: 186, btch: 31 usd: 157
> Apr 4 23:26:02 tyrex kernel: [ 488.161065] CPU 6: hi: 186, btch: 31 usd: 182
> Apr 4 23:26:02 tyrex kernel: [ 488.161066] CPU 7: hi: 186, btch: 31 usd: 173
> Apr 4 23:26:02 tyrex kernel: [ 488.161067] HighMem per-cpu:
> Apr 4 23:26:02 tyrex kernel: [ 488.161068] CPU 0: hi: 186, btch: 31 usd: 24
> Apr 4 23:26:02 tyrex kernel: [ 488.161069] CPU 1: hi: 186, btch: 31 usd: 33
> Apr 4 23:26:02 tyrex kernel: [ 488.161070] CPU 2: hi: 186, btch: 31 usd: 55
> Apr 4 23:26:02 tyrex kernel: [ 488.161071] CPU 3: hi: 186, btch: 31 usd: 172
> Apr 4 23:26:02 tyrex kernel: [ 488.161072] CPU 4: hi: 186, btch: 31 usd: 177
> Apr 4 23:26:02 tyrex kernel: [ 488.161073] CPU 5: hi: 186, btch: 31 usd: 100
> Apr 4 23:26:02 tyrex kernel: [ 488.161075] CPU 6: hi: 186, btch: 31 usd: 38
> Apr 4 23:26:02 tyrex kernel: [ 488.161076] CPU 7: hi: 186, btch: 31 usd: 115
> Apr 4 23:26:02 tyrex kernel: [ 488.161078] active_anon:98428 inactive_anon:24627 isolated_anon:0
> Apr 4 23:26:02 tyrex kernel: [ 488.161079] active_file:4141 inactive_file:25571 isolated_file:0
> Apr 4 23:26:02 tyrex kernel: [ 488.161080] unevictable:0 dirty:2 writeback:0 unstable:0
> Apr 4 23:26:02 tyrex kernel: [ 488.161080] free:403317 slab_reclaimable:203863 slab_unreclaimable:4837
> Apr 4 23:26:02 tyrex kernel: [ 488.161081] mapped:7418 shmem:137 pagetables:625 bounce:0
> Apr 4 23:26:02 tyrex kernel: [ 488.161085]
> DMA free:3520kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:6032kB
> slab_unreclaimable:36kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? yes
> Apr 4 23:26:02 tyrex kernel: [ 488.161089] lowmem_reserve[]: 0 865 3023 3023
> Apr 4 23:26:02 tyrex kernel: [ 488.161093]
> Normal free:3580kB min:3728kB low:4660kB high:5592kB active_anon:0kB inactive_anon:0kB
> active_file:360kB inactive_file:264kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:809420kB
> slab_unreclaimable:19312kB kernel_stack:3304kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:24 all_unreclaimable? no
> Apr 4 23:26:02 tyrex kernel: [ 488.161097] lowmem_reserve[]: 0 0 17267 17267
> Apr 4 23:26:02 tyrex kernel: [ 488.161101]
> HighMem free:1606168kB min:512kB low:2836kB high:5164kB active_anon:393712kB inactive_anon:98508kB
> active_file:16228kB inactive_file:102020kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:2210252kB mlocked:0kB dirty:8kB writeback:0kB mapped:29672kB shmem:548kB slab_reclaimable:0kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:2500kB unstable:0kB bounce:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? no
> Apr 4 23:26:02 tyrex kernel: [ 488.161105] lowmem_reserve[]: 0 0 0 0
> Apr 4 23:26:02 tyrex kernel: [ 488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> Apr 4 23:26:02 tyrex kernel: [ 488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> Apr 4 23:26:02 tyrex kernel: [ 488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> Apr 4 23:26:02 tyrex kernel: [ 488.161122] 29947 total pagecache pages
> Apr 4 23:26:02 tyrex kernel: [ 488.161123] 0 pages in swap cache
> Apr 4 23:26:02 tyrex kernel: [ 488.161124] Swap cache stats: add 0, delete 0, find 0/0
> Apr 4 23:26:02 tyrex kernel: [ 488.161125] Free swap = 2104476kB
> Apr 4 23:26:02 tyrex kernel: [ 488.161126] Total swap = 2104476kB
> Apr 4 23:26:02 tyrex kernel: [ 488.165523] 784224 pages RAM
> Apr 4 23:26:02 tyrex kernel: [ 488.165524] 556914 pages HighMem
> Apr 4 23:26:02 tyrex kernel: [ 488.165525] 12060 pages reserved
> Apr 4 23:26:02 tyrex kernel: [ 488.165526] 82604 pages shared
> Apr 4 23:26:02 tyrex kernel: [ 488.165527] 328045 pages non-shared
> Apr 4 23:26:02 tyrex kernel: [ 488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> Apr 4 23:26:02 tyrex kernel: [ 488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> [...]

Oh, this is a highmem box. You ran out of low memory, I think, which
is where all the inodes are cached. Seems like a VM problem or a
highmem/lowmem split config problem to me, not anything to do with
XFS...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-05 11:36:01

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer

On Monday 05 April 2010, 02:49:06 Dave Chinner wrote:
> On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> > [Sorry for the cross post, but I don't know where to start to tackle this
> > issue]
> >
> > Hi,
> >
> > on an attempt to get to a current kernel, I suffer from an issue, where a
> > simple du on a reasonably big xfs tree leads to invoking the oom killer:
>
> How big is the directory tree (how many inodes, etc)?

It's 1.1 TB system backup tree, let's say: many..

> > Apr 4 23:26:02 tyrex kernel: [ 488.161105] lowmem_reserve[]: 0 0 0 0
> > Apr 4 23:26:02 tyrex kernel: [ 488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> > Apr 4 23:26:02 tyrex kernel: [ 488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> > Apr 4 23:26:02 tyrex kernel: [ 488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> > Apr 4 23:26:02 tyrex kernel: [ 488.161122] 29947 total pagecache pages
> > Apr 4 23:26:02 tyrex kernel: [ 488.161123] 0 pages in swap cache
> > Apr 4 23:26:02 tyrex kernel: [ 488.161124] Swap cache stats: add 0, delete 0, find 0/0
> > Apr 4 23:26:02 tyrex kernel: [ 488.161125] Free swap = 2104476kB
> > Apr 4 23:26:02 tyrex kernel: [ 488.161126] Total swap = 2104476kB
> > Apr 4 23:26:02 tyrex kernel: [ 488.165523] 784224 pages RAM
> > Apr 4 23:26:02 tyrex kernel: [ 488.165524] 556914 pages HighMem
> > Apr 4 23:26:02 tyrex kernel: [ 488.165525] 12060 pages reserved
> > Apr 4 23:26:02 tyrex kernel: [ 488.165526] 82604 pages shared
> > Apr 4 23:26:02 tyrex kernel: [ 488.165527] 328045 pages non-shared
> > Apr 4 23:26:02 tyrex kernel: [ 488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> > Apr 4 23:26:02 tyrex kernel: [ 488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> > [...]
>
> Oh, this is a highmem box. You ran out of low memory, I think, which
> is where all the inodes are cached. Seems like a VM problem or a
> highmem/lowmem split config problem to me, not anything to do with
> XFS...

Might be, I don't have a chance to test this on a different FS. Thanks
for the answer anyway, Dave. I hope, you don't mind, that I keep you
copied on this thread..

This matter is, I cannot locate the problem from the syslog output. Might
be a "can't see the forest because all the trees" syndrome.

Today I repeated that thing with 2.6.24-rc3 as a pae build with openSUSE
patches applied and vm.swappiness, vm.dirty_ratio and vm.dirty_background_ratio
reset to kernel defaults.

It behaves exactly the same, thus it looks like a generic problem. du -sh
on the huge tree, this time gkrellmd triggered the oom killer, while the
du process kept going.

Apr 5 13:09:20 tyrex kernel: [ 1747.524375] XFS mounting filesystem sdd1
Apr 5 13:09:21 tyrex kernel: [ 1747.942048] Ending clean XFS mount for filesystem: sdd1
Apr 5 13:10:27 tyrex kernel: [ 1814.288944] oom_kill_process: 3 callbacks suppressed
Apr 5 13:10:27 tyrex kernel: [ 1814.288946] gkrellmd invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
Apr 5 13:10:27 tyrex kernel: [ 1814.288948] gkrellmd cpuset=/ mems_allowed=0
Apr 5 13:10:27 tyrex kernel: [ 1814.288950] Pid: 4019, comm: gkrellmd Not tainted 2.6.34-rc3-13-pae #1
Apr 5 13:10:27 tyrex kernel: [ 1814.288951] Call Trace:
Apr 5 13:10:27 tyrex kernel: [ 1814.288959] [<c0206181>] try_stack_unwind+0x1b1/0x200
Apr 5 13:10:27 tyrex kernel: [ 1814.288962] [<c020507f>] dump_trace+0x3f/0xe0
Apr 5 13:10:27 tyrex kernel: [ 1814.288965] [<c0205cfb>] show_trace_log_lvl+0x4b/0x60
Apr 5 13:10:27 tyrex kernel: [ 1814.288967] [<c0205d28>] show_trace+0x18/0x20
Apr 5 13:10:27 tyrex kernel: [ 1814.288970] [<c05ec570>] dump_stack+0x6d/0x7d
Apr 5 13:10:27 tyrex kernel: [ 1814.288974] [<c02c758a>] dump_header+0x6a/0x1b0
Apr 5 13:10:27 tyrex kernel: [ 1814.288976] [<c02c772a>] oom_kill_process+0x5a/0x160
Apr 5 13:10:27 tyrex kernel: [ 1814.288979] [<c02c7bc6>] __out_of_memory+0x56/0xc0
Apr 5 13:10:27 tyrex kernel: [ 1814.288981] [<c02c7ca7>] out_of_memory+0x77/0x1b0
Apr 5 13:10:27 tyrex kernel: [ 1814.288984] [<c02cb07c>] __alloc_pages_slowpath+0x47c/0x4c0
Apr 5 13:10:27 tyrex kernel: [ 1814.288986] [<c02cb214>] __alloc_pages_nodemask+0x154/0x160
Apr 5 13:10:27 tyrex kernel: [ 1814.288989] [<c02f67a5>] alloc_pages_current+0x65/0xc0
Apr 5 13:10:27 tyrex kernel: [ 1814.288993] [<c055ca73>] tcp_sendmsg+0x893/0x990
Apr 5 13:10:27 tyrex kernel: [ 1814.288997] [<c05175f6>] sock_sendmsg+0xc6/0x100
Apr 5 13:10:27 tyrex kernel: [ 1814.289000] [<c0517794>] sys_sendto+0x114/0x160
Apr 5 13:10:27 tyrex kernel: [ 1814.289002] [<c0517817>] sys_send+0x37/0x40
Apr 5 13:10:27 tyrex kernel: [ 1814.289005] [<c05187de>] sys_socketcall+0x22e/0x2f0
Apr 5 13:10:27 tyrex kernel: [ 1814.289007] [<c0202e4c>] sysenter_do_call+0x12/0x22
Apr 5 13:10:27 tyrex kernel: [ 1814.289022] [<ffffe424>] 0xffffe424
Apr 5 13:10:27 tyrex kernel: [ 1814.289023] Mem-Info:
Apr 5 13:10:27 tyrex kernel: [ 1814.289024] Node 0 DMA per-cpu:
Apr 5 13:10:27 tyrex kernel: [ 1814.289025] CPU 0: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289026] CPU 1: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289027] CPU 2: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289028] CPU 3: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289030] CPU 4: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289031] CPU 5: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289032] CPU 6: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289033] CPU 7: hi: 0, btch: 1 usd: 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289033] Node 0 Normal per-cpu:
Apr 5 13:10:27 tyrex kernel: [ 1814.289035] CPU 0: hi: 186, btch: 31 usd: 183
Apr 5 13:10:27 tyrex kernel: [ 1814.289036] CPU 1: hi: 186, btch: 31 usd: 126
Apr 5 13:10:27 tyrex kernel: [ 1814.289037] CPU 2: hi: 186, btch: 31 usd: 50
Apr 5 13:10:27 tyrex kernel: [ 1814.289038] CPU 3: hi: 186, btch: 31 usd: 178
Apr 5 13:10:27 tyrex kernel: [ 1814.289039] CPU 4: hi: 186, btch: 31 usd: 158
Apr 5 13:10:27 tyrex kernel: [ 1814.289040] CPU 5: hi: 186, btch: 31 usd: 181
Apr 5 13:10:27 tyrex kernel: [ 1814.289041] CPU 6: hi: 186, btch: 31 usd: 32
Apr 5 13:10:27 tyrex kernel: [ 1814.289042] CPU 7: hi: 186, btch: 31 usd: 164
Apr 5 13:10:27 tyrex kernel: [ 1814.289043] Node 0 HighMem per-cpu:
Apr 5 13:10:27 tyrex kernel: [ 1814.289044] CPU 0: hi: 186, btch: 31 usd: 153
Apr 5 13:10:27 tyrex kernel: [ 1814.289045] CPU 1: hi: 186, btch: 31 usd: 148
Apr 5 13:10:27 tyrex kernel: [ 1814.289046] CPU 2: hi: 186, btch: 31 usd: 119
Apr 5 13:10:27 tyrex kernel: [ 1814.289047] CPU 3: hi: 186, btch: 31 usd: 173
Apr 5 13:10:27 tyrex kernel: [ 1814.289048] CPU 4: hi: 186, btch: 31 usd: 46
Apr 5 13:10:27 tyrex kernel: [ 1814.289049] CPU 5: hi: 186, btch: 31 usd: 167
Apr 5 13:10:27 tyrex kernel: [ 1814.289050] CPU 6: hi: 186, btch: 31 usd: 57
Apr 5 13:10:27 tyrex kernel: [ 1814.289051] CPU 7: hi: 186, btch: 31 usd: 52
Apr 5 13:10:27 tyrex kernel: [ 1814.289054] active_anon:145037 inactive_anon:14560 isolated_anon:0
Apr 5 13:10:27 tyrex kernel: [ 1814.289055] active_file:4510 inactive_file:30778 isolated_file:0
Apr 5 13:10:27 tyrex kernel: [ 1814.289055] unevictable:0 dirty:3 writeback:0 unstable:0
Apr 5 13:10:27 tyrex kernel: [ 1814.289056] free:2694178 slab_reclaimable:183482 slab_unreclaimable:4574
Apr 5 13:10:27 tyrex kernel: [ 1814.289057] mapped:9362 shmem:137 pagetables:994 bounce:0
Apr 5 13:10:27 tyrex kernel: [ 1814.289058] Node 0 DMA free:3488kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:20kB inactive_file:36
kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4416kB slab_unre
claimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 5 13:10:27 tyrex kernel: [ 1814.289064] lowmem_reserve[]: 0 859 12054 12054
Apr 5 13:10:27 tyrex kernel: [ 1814.289066] Node 0 Normal free:3644kB min:3716kB low:4644kB high:5572kB active_anon:0kB inactive_anon:0kB active_file:848kB inacti
ve_file:928kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:879848kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7295
12kB slab_unreclaimable:18280kB kernel_stack:3320kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 5 13:10:27 tyrex kernel: [ 1814.289072] lowmem_reserve[]: 0 0 89561 89561
Apr 5 13:10:27 tyrex kernel: [ 1814.289074] Node 0 HighMem free:10769580kB min:512kB low:12620kB high:24728kB active_anon:580148kB inactive_anon:58240kB active_fi
le:17172kB inactive_file:122148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:11463844kB mlocked:0kB dirty:12kB writeback:0kB mapped:37448kB shme
m:548kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:3976kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:96 all_unreclaimable
? no
Apr 5 13:10:27 tyrex kernel: [ 1814.289080] lowmem_reserve[]: 0 0 0 0
Apr 5 13:10:27 tyrex kernel: [ 1814.289081] Node 0 DMA: 0*4kB 2*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3488kB
Apr 5 13:10:27 tyrex kernel: [ 1814.289087] Node 0 Normal: 430*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3768kB
Apr 5 13:10:27 tyrex kernel: [ 1814.289092] Node 0 HighMem: 213*4kB 619*8kB 374*16kB 15*32kB 357*64kB 413*128kB 277*256kB 80*512kB 46*1024kB 22*2048kB 2558*4096kB
= 10769580kB
Apr 5 13:10:27 tyrex kernel: [ 1814.289097] 35438 total pagecache pages
Apr 5 13:10:27 tyrex kernel: [ 1814.289098] 0 pages in swap cache
Apr 5 13:10:27 tyrex kernel: [ 1814.289099] Swap cache stats: add 0, delete 0, find 0/0
Apr 5 13:10:27 tyrex kernel: [ 1814.289100] Free swap = 2104476kB
Apr 5 13:10:27 tyrex kernel: [ 1814.289101] Total swap = 2104476kB
Apr 5 13:10:27 tyrex kernel: [ 1814.317335] 3380720 pages RAM
Apr 5 13:10:27 tyrex kernel: [ 1814.317337] 3154946 pages HighMem
Apr 5 13:10:27 tyrex kernel: [ 1814.317338] 293990 pages reserved
Apr 5 13:10:27 tyrex kernel: [ 1814.317338] 65372 pages shared
Apr 5 13:10:27 tyrex kernel: [ 1814.317339] 359255 pages non-shared

0 0 0 10942040 4 776696 0 0 17512 4 2897 5178 1 7 92 0
0 0 0 10878436 4 833952 0 0 13396 0 2330 4329 1 5 94 0
0 0 0 10834376 4 872328 0 0 15072 32 2703 5578 0 7 92 0
2 0 0 10834000 4 874896 0 0 11228 0 2331 5601 0 4 95 0
0 0 0 10818628 4 875192 0 0 3172 10 773 1588 0 2 98 0
1 0 0 10819240 4 874452 0 0 5428 69 1395 3091 0 4 96 0
0 0 0 10818608 4 873360 0 0 17728 0 3056 10587 1 12 82 5
1 0 0 10831032 4 873860 0 0 7732 0 1954 5411 0 4 93 3
0 0 0 10825456 4 874208 0 0 2884 0 844 1907 1 2 98 0
0 0 0 10831656 4 873720 0 0 11968 0 2228 6897 0 5 95 0
0 0 0 10830760 4 873744 0 0 6424 8 1549 4021 0 3 94 3
2 0 0 10829524 4 873944 0 0 10160 81 2129 5325 0 4 96 0
0 0 0 10798644 4 874248 0 0 8112 16 2161 4348 1 3 96 0
0 0 0 10781788 4 874688 0 0 5340 0 1823 2924 1 2 97 0
0 2 0 10776580 4 875980 0 0 5964 0 1754 2722 1 4 93 1
0 1 0 10776848 4 875440 0 0 5152 8 1543 2313 0 3 87 11
1 1 0 10776704 4 875596 0 0 2416 0 1011 1632 0 5 79 16
3 3 0 10776712 4 875676 0 0 0 4 749 867 0 11 75 14
2 1 0 10879116 4 867544 0 0 7608 2138 3043 154270 7 13 77 3
2 1 0 10874356 4 870636 0 0 2476 0 1204 711732 0 14 79 7
3 1 0 10874248 4 870668 0 0 4 0 1078 1066604 0 14 71 14
2 1 0 10874040 4 870668 0 0 0 0 1055 1067483 0 11 79 11
2 1 0 10873792 4 870660 0 0 104 0 1137 1063509 0 13 73 13
2 1 0 10873608 4 870444 0 0 324 0 1335 1055994 0 12 76 12
3 0 0 11323756 4 421084 0 0 1768 0 1709 424405 0 16 76 7
1 0 0 11558484 4 186672 0 0 10936 121 2212 3308 1 6 93 0
1 0 0 11501320 4 243908 0 0 13020 0 2281 4205 0 6 93 0
1 0 0 11413776 4 330896 0 0 19872 0 3492 6304 1 6 93 0
0 0 0 11328588 4 415920 0 0 19568 0 3482 6139 1 8 91 0
1 0 0 11299456 4 475428 0 0 15056 0 3054 5242 1 5 94 0

oom killer invoked around the lines with > 1000000 cs.

It's hard to believe, that a current kernel on a current system with 12 GB,
even if using the insane pae on i586 is not able to cope with an du on a
1.1 TB file tree. Since du is invokable by users, this creates a pretty
ugly DOS attack for local users.

Updated boot log: ftp://urpla.net/boot-pae.msg

Pete

2010-04-05 23:06:26

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer

On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> On Monday 05 April 2010, 02:49:06 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 12:49:17AM +0200, Hans-Peter Jansen wrote:
> > > [Sorry for the cross post, but I don't know where to start to tackle this
> > > issue]
> > >
> > > Hi,
> > >
> > > on an attempt to get to a current kernel, I suffer from an issue, where a
> > > simple du on a reasonably big xfs tree leads to invoking the oom killer:
> >
> > How big is the directory tree (how many inodes, etc)?
>
> It's 1.1 TB system backup tree, let's say: many..

1.1TB isn't big anymore. ;)

> > > Apr 4 23:26:02 tyrex kernel: [ 488.161105] lowmem_reserve[]: 0 0 0 0
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161107] DMA: 18*4kB 53*8kB 31*16kB 20*32kB 14*64kB 8*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3552kB
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161112] Normal: 32*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3704kB
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161117] HighMem: 17*4kB 29*8kB 47*16kB 16*32kB 6*64kB 30*128kB 53*256kB 27*512kB 14*1024kB 7*2048kB 377*4096kB = 1606044kB
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161122] 29947 total pagecache pages
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161123] 0 pages in swap cache
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161124] Swap cache stats: add 0, delete 0, find 0/0
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161125] Free swap = 2104476kB
> > > Apr 4 23:26:02 tyrex kernel: [ 488.161126] Total swap = 2104476kB
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165523] 784224 pages RAM
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165524] 556914 pages HighMem
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165525] 12060 pages reserved
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165526] 82604 pages shared
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165527] 328045 pages non-shared
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165529] Out of memory: kill process 4788 (mysqld-max) score 326208 or a child
> > > Apr 4 23:26:02 tyrex kernel: [ 488.165531] Killed process 4788 (mysqld-max) vsz:1304832kB, anon-rss:121428kB, file-rss:4336kB
> > > [...]
> >
> > Oh, this is a highmem box. You ran out of low memory, I think, which
> > is where all the inodes are cached. Seems like a VM problem or a
> > highmem/lowmem split config problem to me, not anything to do with
> > XFS...
>
> Might be, I don't have a chance to test this on a different FS. Thanks
> for the answer anyway, Dave. I hope, you don't mind, that I keep you
> copied on this thread..
>
> This matter is, I cannot locate the problem from the syslog output. Might
> be a "can't see the forest because all the trees" syndrome.

Well, I have to ask why you are running a 32bit PAE kernel when your
CPU is:

<6>[ 0.085062] CPU0: Intel(R) Xeon(R) CPU X3460 @ 2.80GHz stepping 05

64bit capable. Use a 64 bit kernel and this problem should go away.

> It's hard to believe, that a current kernel on a current system with 12 GB,
> even if using the insane pae on i586 is not able to cope with an du on a
> 1.1 TB file tree. Since du is invokable by users, this creates a pretty
> ugly DOS attack for local users.

Agreed. And FWIW, don't let your filesystems get near ENOSPC on
2.6.34-rc, either....

(i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
on page cache allocation before the filesystem can report ENOSPC to
the user application. Test 224 in the xfsqa suite on a VM w/ 1GB
RAM will trigger this with > 90% reliability....)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-06 14:53:15

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

Hi Dave,

On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > >
> > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > is where all the inodes are cached. Seems like a VM problem or a
> > > highmem/lowmem split config problem to me, not anything to do with
> > > XFS...

With all due respect, I disagree. See below.

> > Might be, I don't have a chance to test this on a different FS. Thanks
> > for the answer anyway, Dave. I hope, you don't mind, that I keep you
> > copied on this thread..
> >
> > This matter is, I cannot locate the problem from the syslog output.
> > Might be a "can't see the forest because all the trees" syndrome.
>
> Well, I have to ask why you are running a 32bit PAE kernel when your
> CPU is:
>
> <6>[ 0.085062] CPU0: Intel(R) Xeon(R) CPU X3460 @ 2.80GHz
> stepping 05
>
> 64bit capable. Use a 64 bit kernel and this problem should go away.

Sure, but for compatibility reasons with a customer setup, that I'm fully
responsible for and we strongly depend on, it is i586 still. (and it's a
system, that I've full access on only for a few hours on sundays, which
punishes my family..).

Dave, I really don't want to disappoint you, but a lengthy bisection session
points to:

57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
commit 57817c68229984818fea9e614d6f95249c3fb098
Author: Dave Chinner <[email protected]>
Date: Sun Jan 10 23:51:47 2010 +0000

xfs: reclaim all inodes by background tree walks

We cannot do direct inode reclaim without taking the flush lock to
ensure that we do not reclaim an inode under IO. We check the inode
is clean before doing direct reclaim, but this is not good enough
because the inode flush code marks the inode clean once it has
copied the in-core dirty state to the backing buffer.

It is the flush lock that determines whether the inode is still
under IO, even though it is marked clean, and the inode is still
required at IO completion so we can't reclaim it even though it is
clean in core. Hence the requirement that we need to take the flush
lock even on clean inodes because this guarantees that the inode
writeback IO has completed and it is safe to reclaim the inode.

With delayed write inode flushing, we coul dend up waiting a long
time on the flush lock even for a clean inode. The background
reclaim already handles this efficiently, so avoid all the problems
by killing the direct reclaim path altogether.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Alex Elder <[email protected]>

:040000 040000 9cada5739037ecd59afb358cf5ed6186b82d5236
8e6b6febccba69bc4cdbfd1886d545c369d64c41 M fs

I will try to prove this by reverting this commit on a 2.6.33.2 build, but
that's going to take another day, or so.

> > It's hard to believe, that a current kernel on a current system with 12
> > GB, even if using the insane pae on i586 is not able to cope with an du
> > on a 1.1 TB file tree. Since du is invokable by users, this creates a
> > pretty ugly DOS attack for local users.
>
> Agreed. And FWIW, don't let your filesystems get near ENOSPC on
> 2.6.34-rc, either....
>
> (i.e. under sustained write load, 2.6.34-rc will hit the OOM killer
> on page cache allocation before the filesystem can report ENOSPC to
> the user application. Test 224 in the xfsqa suite on a VM w/ 1GB
> RAM will trigger this with > 90% reliability....)

Hmm, thanks for the warning. Will resort to 2.6.33.2 for now on my servers
and keep an eye on the xfs commit logs...

Cheers && greetings to the orbit ;-),
Pete

For the sake of completeness, here's the revert:

---
commit dfe0d292280ad21c9cf3f240bb415913715d8980
Author: Hans-Peter Jansen <[email protected]>
Date: Tue Apr 6 16:05:47 2010 +0200

Revert "xfs: reclaim all inodes by background tree walks"

This reverts commit 57817c68229984818fea9e614d6f95249c3fb098.

Avoid triggering the oom killer with a simple du on a big xfs tree on i586.

Signed-off-by: Hans-Peter Jansen <[email protected]>

:100644 100644 52e06b4... a76fc01... M fs/xfs/linux-2.6/xfs_super.c

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 52e06b4..a76fc01 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));

/*
- * We always use background reclaim here because even if the
- * inode is clean, it still may be under IO and hence we have
- * to take the flush lock. The background reclaim path handles
- * this more efficiently than we can here, so simply let background
- * reclaim tear down all inodes.
+ * If we have nothing to flush with this inode then complete the
+ * teardown now, otherwise delay the flush operation.
*/
+ if (!xfs_inode_clean(ip)) {
+ xfs_inode_set_reclaim_tag(ip);
+ return;
+ }
+
out_reclaim:
- xfs_inode_set_reclaim_tag(ip);
+ xfs_ireclaim(ip);
}

/*

2010-04-06 23:12:08

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> Hi Dave,
>
> On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > >
> > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > highmem/lowmem split config problem to me, not anything to do with
> > > > XFS...

[snip]

> Dave, I really don't want to disappoint you, but a lengthy bisection session
> points to:
>
> 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> commit 57817c68229984818fea9e614d6f95249c3fb098
> Author: Dave Chinner <[email protected]>
> Date: Sun Jan 10 23:51:47 2010 +0000
>
> xfs: reclaim all inodes by background tree walks

Interesting. I did a fair bit of low memory testing when i made that
change (admittedly none on a highmem i386 box), and since then I've
done lots of "millions of files" tree creates, traversals and destroys on
limited memory machines without triggering problems when memory is
completely full of inodes.

Let me try to reproduce this on a small VM and I'll get back to you.

> diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
> index 52e06b4..a76fc01 100644
> --- a/fs/xfs/linux-2.6/xfs_super.c
> +++ b/fs/xfs/linux-2.6/xfs_super.c
> @@ -954,14 +954,16 @@ xfs_fs_destroy_inode(
> ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
>
> /*
> - * We always use background reclaim here because even if the
> - * inode is clean, it still may be under IO and hence we have
> - * to take the flush lock. The background reclaim path handles
> - * this more efficiently than we can here, so simply let background
> - * reclaim tear down all inodes.
> + * If we have nothing to flush with this inode then complete the
> + * teardown now, otherwise delay the flush operation.
> */
> + if (!xfs_inode_clean(ip)) {
> + xfs_inode_set_reclaim_tag(ip);
> + return;
> + }
> +
> out_reclaim:
> - xfs_inode_set_reclaim_tag(ip);
> + xfs_ireclaim(ip);
> }

I don't think that will work as expected in all situations - the
inode clean check there is not completely valid as the XFS inode
locks aren't held, so it can race with other operations that need
to complete before reclaim is done. This was one of the reasons for
pushing reclaim into the background....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-07 01:46:01

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Wed, Apr 07, 2010 at 09:11:44AM +1000, Dave Chinner wrote:
> On Tue, Apr 06, 2010 at 04:52:57PM +0200, Hans-Peter Jansen wrote:
> > Hi Dave,
> >
> > On Tuesday 06 April 2010, 01:06:00 Dave Chinner wrote:
> > > On Mon, Apr 05, 2010 at 01:35:41PM +0200, Hans-Peter Jansen wrote:
> > > > >
> > > > > Oh, this is a highmem box. You ran out of low memory, I think, which
> > > > > is where all the inodes are cached. Seems like a VM problem or a
> > > > > highmem/lowmem split config problem to me, not anything to do with
> > > > > XFS...
>
> [snip]
>
> > Dave, I really don't want to disappoint you, but a lengthy bisection session
> > points to:
> >
> > 57817c68229984818fea9e614d6f95249c3fb098 is the first bad commit
> > commit 57817c68229984818fea9e614d6f95249c3fb098
> > Author: Dave Chinner <[email protected]>
> > Date: Sun Jan 10 23:51:47 2010 +0000
> >
> > xfs: reclaim all inodes by background tree walks
>
> Interesting. I did a fair bit of low memory testing when i made that
> change (admittedly none on a highmem i386 box), and since then I've
> done lots of "millions of files" tree creates, traversals and destroys on
> limited memory machines without triggering problems when memory is
> completely full of inodes.
>
> Let me try to reproduce this on a small VM and I'll get back to you.

OK, if there is page cache pressure (e.g. creating small files or
grepping the resultant tree) or the machine has significant amounts
of memory (e.g. >= 4GB) then I can't reproduce this.

However, if the memory pressure is purely inode cache (creating zero
length files or read-only traversal), then the OOM killer kicks a
while after the slab cache fills memory. This doesn't need highmem;
I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
this. I'll add zero length file tests and traversals to my low
memory testing.

The best way to fix this, I think, is to trigger a shrinker callback
when memory is low to run the background inode reclaim. The problem
is that these inode caches and the reclaim state are per-filesystem,
not global state, and the current shrinker interface only works with
global state.

Hence there are two patches to this fix - the first adds a context
to the shrinker callout, and the second adds the XFS infrastructure
to track the number of reclaimable inodes per filesystem and
register/unregister shrinkers for each filesystem.

With these patches, my reproducable test case which locked the
machine up with a OOM panic in a couple of minutes has been running
for over half an hour. I have much more confidence in this change
with limited testing than the reverting of the background inode
reclaim as the revert introduces

The patches below apply to the xfs-dev tree, which is currently at
34-rc1. If they don't apply, let me know and I'll redo them against
a vanilla kernel tree. Can you test them to see if the problem goes
away? If the problem is fixed, I'll push them for a proper review
cycle...

Cheers,

Dave.
--
Dave Chinner
[email protected]

Attachments:

(No filename) (2.95 kB)
shrinker-context (10.07 kB)
xfs-inode-reclaim-shrinker (6.35 kB)
Download all attachments

2010-04-07 22:02:53

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
>
> However, if the memory pressure is purely inode cache (creating zero
> length files or read-only traversal), then the OOM killer kicks a
> while after the slab cache fills memory. This doesn't need highmem;
> I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> this. I'll add zero length file tests and traversals to my low
> memory testing.

I'm glad, that you're able to reproduce it. My initial failure was during
disk to disk backup (with a simple cp -al & rsync combination).

> The best way to fix this, I think, is to trigger a shrinker callback
> when memory is low to run the background inode reclaim. The problem
> is that these inode caches and the reclaim state are per-filesystem,
> not global state, and the current shrinker interface only works with
> global state.
>
> Hence there are two patches to this fix - the first adds a context
> to the shrinker callout, and the second adds the XFS infrastructure
> to track the number of reclaimable inodes per filesystem and
> register/unregister shrinkers for each filesystem.

I see, the first one will be interesting to get into mainline, given the
number of projects, that are involved.

> With these patches, my reproducable test case which locked the
> machine up with a OOM panic in a couple of minutes has been running
> for over half an hour. I have much more confidence in this change
> with limited testing than the reverting of the background inode
> reclaim as the revert introduces
>
> The patches below apply to the xfs-dev tree, which is currently at
> 34-rc1. If they don't apply, let me know and I'll redo them against
> a vanilla kernel tree. Can you test them to see if the problem goes
> away? If the problem is fixed, I'll push them for a proper review
> cycle...

Of course, you did the original patch for a reason... Therefor I would love
to test your patches. I've tried to apply them to 2.6.33.2, but after
fixing the same reject as noted below, I'm stuck here:

/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:
In function 'xfs_reclaim_inode_shrink':
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805:
error: implicit declaration of function 'xfs_perag_get'
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805:
warning: assignment makes pointer from integer without a cast
/usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807:
error: implicit declaration of function 'xfs_perag_put'

Now I see, that there happened a rename of the offending functions, but also
they've grown a radix_tree structure and locking. How do I handle that?

BTW, your patches do not apply to Linus' current git tree either:
patching file fs/xfs/quota/xfs_qm.c
Hunk #1 succeeded at 72 (offset 3 lines).
Hunk #2 FAILED at 2120.
1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
I'm able to resolve this, but 2.6.34-current does give me some other
trouble, that I need to get by (PS2 keyboard stops working eventually)..

Anyway, thanks for your great support, Dave. This is much appreciated.

Cheers,
Pete

2010-04-13 01:21:08

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > However, if the memory pressure is purely inode cache (creating zero
> > length files or read-only traversal), then the OOM killer kicks a
> > while after the slab cache fills memory. This doesn't need highmem;
> > I used a x86_64 kernel on a VM w/ 1GB RAM to reliably reproduce
> > this. I'll add zero length file tests and traversals to my low
> > memory testing.
>
> I'm glad, that you're able to reproduce it. My initial failure was during
> disk to disk backup (with a simple cp -al & rsync combination).
>
> > The best way to fix this, I think, is to trigger a shrinker callback
> > when memory is low to run the background inode reclaim. The problem
> > is that these inode caches and the reclaim state are per-filesystem,
> > not global state, and the current shrinker interface only works with
> > global state.
> >
> > Hence there are two patches to this fix - the first adds a context
> > to the shrinker callout, and the second adds the XFS infrastructure
> > to track the number of reclaimable inodes per filesystem and
> > register/unregister shrinkers for each filesystem.
>
> I see, the first one will be interesting to get into mainline, given the
> number of projects, that are involved.
>
> > With these patches, my reproducable test case which locked the
> > machine up with a OOM panic in a couple of minutes has been running
> > for over half an hour. I have much more confidence in this change
> > with limited testing than the reverting of the background inode
> > reclaim as the revert introduces
> >
> > The patches below apply to the xfs-dev tree, which is currently at
> > 34-rc1. If they don't apply, let me know and I'll redo them against
> > a vanilla kernel tree. Can you test them to see if the problem goes
> > away? If the problem is fixed, I'll push them for a proper review
> > cycle...
>
> Of course, you did the original patch for a reason... Therefor I would love
> to test your patches. I've tried to apply them to 2.6.33.2, but after
> fixing the same reject as noted below, I'm stuck here:
>
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:
> In function 'xfs_reclaim_inode_shrink':
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805:
> error: implicit declaration of function 'xfs_perag_get'
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:805:
> warning: assignment makes pointer from integer without a cast
> /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/linux-2.6/xfs_sync.c:807:
> error: implicit declaration of function 'xfs_perag_put'
>
> Now I see, that there happened a rename of the offending functions, but also
> they've grown a radix_tree structure and locking. How do I handle that?

With difficulty. I'd need to backport it to match the .33 code,
which may or may not be trivial...

> BTW, your patches do not apply to Linus' current git tree either:
> patching file fs/xfs/quota/xfs_qm.c
> Hunk #1 succeeded at 72 (offset 3 lines).
> Hunk #2 FAILED at 2120.
> 1 out of 2 hunks FAILED -- saving rejects to file fs/xfs/quota/xfs_qm.c.rej
> I'm able to resolve this, but 2.6.34-current does give me some other
> trouble, that I need to get by (PS2 keyboard stops working eventually)..

Yeah, there's another patch in my xfs-dev tree that changes that.
I'll rebase it on a clean linux tree before I post it again.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-13 08:50:57

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> >
> > > The patches below apply to the xfs-dev tree, which is currently at
> > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > away? If the problem is fixed, I'll push them for a proper review
> > > cycle...
> >
> > Of course, you did the original patch for a reason... Therefor I would
> > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > after fixing the same reject as noted below, I'm stuck here:
> >
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > 'xfs_perag_get'
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > without a cast
> > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > 'xfs_perag_put'
> >
> > Now I see, that there happened a rename of the offending functions, but
> > also they've grown a radix_tree structure and locking. How do I handle
> > that?
>
> With difficulty. I'd need to backport it to match the .33 code,
> which may or may not be trivial...

Dave, may I ask you kindly for briefly elaborating on the worst consequences
of just reverting this hunk, as I've done before?

> > BTW, your patches do not apply to Linus' current git tree either:
> > patching file fs/xfs/quota/xfs_qm.c
> > Hunk #1 succeeded at 72 (offset 3 lines).
> > Hunk #2 FAILED at 2120.
> > 1 out of 2 hunks FAILED -- saving rejects to file
> > fs/xfs/quota/xfs_qm.c.rej I'm able to resolve this, but 2.6.34-current
> > does give me some other trouble, that I need to get by (PS2 keyboard
> > stops working eventually)..
>
> Yeah, there's another patch in my xfs-dev tree that changes that.
> I'll rebase it on a clean linux tree before I post it again.

I've briefly tested this with a codebase somewhere between -rc3 and -rc4,
and it survived the du test, but it suffered from some strange network drop
outs, that aren't funny on a nfs server...

Will retest your patches after opensuse-current catched up with -rc4.
Hopefully, the most blatant stability issues are fixed by then.

Cheers,
Pete

2010-04-13 09:18:39

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 00:32:41 Dave Chinner wrote:
> > On Thu, Apr 08, 2010 at 12:02:20AM +0200, Hans-Peter Jansen wrote:
> > > On Wednesday 07 April 2010, 03:45:33 Dave Chinner wrote:
> > >
> > > > The patches below apply to the xfs-dev tree, which is currently at
> > > > 34-rc1. If they don't apply, let me know and I'll redo them against
> > > > a vanilla kernel tree. Can you test them to see if the problem goes
> > > > away? If the problem is fixed, I'll push them for a proper review
> > > > cycle...
> > >
> > > Of course, you did the original patch for a reason... Therefor I would
> > > love to test your patches. I've tried to apply them to 2.6.33.2, but
> > > after fixing the same reject as noted below, I'm stuck here:
> > >
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c: In function 'xfs_reclaim_inode_shrink':
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: error: implicit declaration of function
> > > 'xfs_perag_get'
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:805: warning: assignment makes pointer from integer
> > > without a cast
> > > /usr/src/packages/BUILD/kernel-default-2.6.33.2/linux-2.6.33/fs/xfs/lin
> > >ux-2.6/xfs_sync.c:807: error: implicit declaration of function
> > > 'xfs_perag_put'
> > >
> > > Now I see, that there happened a rename of the offending functions, but
> > > also they've grown a radix_tree structure and locking. How do I handle
> > > that?
> >
> > With difficulty. I'd need to backport it to match the .33 code,
> > which may or may not be trivial...
>
> Dave, may I ask you kindly for briefly elaborating on the worst consequences
> of just reverting this hunk, as I've done before?

Well, given that is the new shrinker code generating the warnings,
reverting/removing that hunk will render the patch useless :0

I'll get you a working 2.6.33 patch tomorrow - it's dinner time
now....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-13 09:42:53

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> >
> > Dave, may I ask you kindly for briefly elaborating on the worst
> > consequences of just reverting this hunk, as I've done before?
>
> Well, given that is the new shrinker code generating the warnings,
> reverting/removing that hunk will render the patch useless :0

Excuse me, I didn't express myself well. I'm after the consequences of
applying the revert, that I posted a few messages above.

> I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> now....

Cool, thanks.

Have a nice dinner,
Pete

2010-04-24 16:44:33

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Tuesday 13 April 2010, 11:42:33 Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> > On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> > > Dave, may I ask you kindly for briefly elaborating on the worst
> > > consequences of just reverting this hunk, as I've done before?
> >
> > Well, given that is the new shrinker code generating the warnings,
> > reverting/removing that hunk will render the patch useless :0
>
> Excuse me, I didn't express myself well. I'm after the consequences of
> applying the revert, that I posted a few messages above.
>
> > I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> > now....
>
> Cool, thanks.

Obviously and not totally unexpected, really fixing this is going to take
more time.

FYI, 2.6.33.2 is still affected from this issue.

Greg, you might search for a server using xfs filesystems and and a i586
kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an
ordinary user, do a "du" on /usr, and wait for the other users screaming...

BTW, all affected kernels, available from
http://download.opensuse.org/repositories/home:/frispete: have the
offending patch reverted (see subject), do run fine for me (on this
aspect).

Will you guys pass by another round of stable fixes without doing anything
on this issue?

Dave, this is why I'm kindly asking you: what might be the worst
consequences, if we just do the revert for now (at least for 2.6.33), until
you and Nick came to a final decision on how to solve this issue in the
future.

Just a brief note would be fine.

Cheers,
Pete

2010-04-24 21:23:39

by Emmanuel Florac

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

Le Sat, 24 Apr 2010 18:44:22 +0200 vous ?criviez:

> Greg, you might search for a server using xfs filesystems and and a
> i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> log in as an ordinary user, do a "du" on /usr, and wait for the other
> users screaming...

I did precisely that, and didn't notice anything special (du on kernel
source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
array, 8GB RAM.

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <[email protected]>
| +33 1 78 94 84 02
------------------------------------------------------------------------

2010-04-24 22:30:55

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
> Le Sat, 24 Apr 2010 18:44:22 +0200 vous ?criviez:
> > Greg, you might search for a server using xfs filesystems and and a
> > i586 kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
> > log in as an ordinary user, do a "du" on /usr, and wait for the other
> > users screaming...
>
> I did precisely that, and didn't notice anything special (du on kernel
> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
> array, 8GB RAM.

I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
from later kernels. The offending patch was included in linux-2.6.33
between -rc4 and -rc5:

Committer
Alex Elder<[email protected]>
Author
Dave Chinner<[email protected]>
Author date
11.01.10 00:51
Parent
xfs: Avoid inodes in reclaim when flushing from inode cache
Child
xfs: Remove inode iolock held check during allocation
Branch
master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
Branch
2.6.33.1 (Linux 2.6.33)
Follows
v2.6.33-rc4 (Linux 2.6.33-rc4)
Precedes
v2.6.33-rc5 (Linux 2.6.33-rc5)

Cheers,
Pete

2010-04-24 22:40:17

by Justin P. Mattock

[permalink] [raw]

Subject: Re: [opensuse-kernel] Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On 04/24/2010 03:30 PM, Hans-Peter Jansen wrote:
> On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
>> Le Sat, 24 Apr 2010 18:44:22 +0200 vous ?criviez:
>>> Greg, you might search for a server using xfs filesystems and and a
>>> i586 kernel>= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
>>> log in as an ordinary user, do a "du" on /usr, and wait for the other
>>> users screaming...
>>
>> I did precisely that, and didn't notice anything special (du on kernel
>> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
>> array, 8GB RAM.
>
> I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
> the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
> from later kernels. The offending patch was included in linux-2.6.33
> between -rc4 and -rc5:
>
> Committer
> Alex Elder<[email protected]>
> Author
> Dave Chinner<[email protected]>
> Author date
> 11.01.10 00:51
> Parent
> xfs: Avoid inodes in reclaim when flushing from inode cache
> Child
> xfs: Remove inode iolock held check during allocation
> Branch
> master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
> Branch
> 2.6.33.1 (Linux 2.6.33)
> Follows
> v2.6.33-rc4 (Linux 2.6.33-rc4)
> Precedes
> v2.6.33-rc5 (Linux 2.6.33-rc5)
>
> Cheers,
> Pete

is this bisectable?
from what I remember with 2.6.33
looking at the bugreports I don't recall
any issue in regards with firmware related stuff
for radeon(but could be wrong).

Keep in mind, I don't have your card, but I do have the X1600
which had no issues so far(running the latest HEAD).
does changing the .config work for you?
(in regards to what the thread I posted
had mentioned)

as for the open suse SP1 of SLE11..glad there is
an option to load the latest kernel.

Justin P. Mattock

2010-04-24 22:41:33

by Justin P. Mattock

[permalink] [raw]

Subject: Re: [opensuse-kernel] Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On 04/24/2010 03:30 PM, Hans-Peter Jansen wrote:
> On Saturday 24 April 2010, 23:23:20 Emmanuel Florac wrote:
>> Le Sat, 24 Apr 2010 18:44:22 +0200 vous ?criviez:
>>> Greg, you might search for a server using xfs filesystems and and a
>>> i586 kernel>= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well),
>>> log in as an ordinary user, do a "du" on /usr, and wait for the other
>>> users screaming...
>>
>> I did precisely that, and didn't notice anything special (du on kernel
>> source tree) kernel 2.6.32.11, deadline scheduler, 7 drives RAID-6
>> array, 8GB RAM.
>
> I guess, you're not on this specific openSUSE git version of 2.6.32.11 (e.g.
> the preparation for SP1 of SLE11), which, as usual, carries a lot of stuff
> from later kernels. The offending patch was included in linux-2.6.33
> between -rc4 and -rc5:
>
> Committer
> Alex Elder<[email protected]>
> Author
> Dave Chinner<[email protected]>
> Author date
> 11.01.10 00:51
> Parent
> xfs: Avoid inodes in reclaim when flushing from inode cache
> Child
> xfs: Remove inode iolock held check during allocation
> Branch
> master origin (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/...)
> Branch
> 2.6.33.1 (Linux 2.6.33)
> Follows
> v2.6.33-rc4 (Linux 2.6.33-rc4)
> Precedes
> v2.6.33-rc5 (Linux 2.6.33-rc5)
>
> Cheers,
> Pete

sh*t.. my bad just replied to the wrong
thread..

I'm tired.. please ignore this.

Justin P. Mattock

2010-04-25 13:05:13

by Emmanuel Florac

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

Le Sun, 25 Apr 2010 00:30:40 +0200 vous ?criviez:

> I guess, you're not on this specific openSUSE git version of
> 2.6.32.11 (e.g. the preparation for SP1 of SLE11), which, as usual,
> carries a lot of stuff from later kernels.

No, I always use pristine unpatched kernel.org releases, no SELinux, no
nothing. Just another confirmation I should go on this way :)

--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <[email protected]>
| +33 1 78 94 84 02
------------------------------------------------------------------------

2010-04-25 16:27:06

by Greg KH

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> FYI, 2.6.33.2 is still affected from this issue.

Is 2.6.33.3-rc2 affected? A lot of xfs patches are in there (as are in
2.6.32.12-rc2.)

thanks,

greg k-h

2010-04-25 16:57:38

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > FYI, 2.6.33.2 is still affected from this issue.
>
> Is 2.6.33.3-rc2 affected? A lot of xfs patches are in there (as are in
> 2.6.32.12-rc2.)

Yes. It's not even in mainline yet as Nick doesn't like the trivial
core VM fix required to solve this in a clean way.

2010-04-25 18:17:55

by Greg KH

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Sun, Apr 25, 2010 at 12:57:35PM -0400, Christoph Hellwig wrote:
> On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> > On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > > FYI, 2.6.33.2 is still affected from this issue.
> >
> > Is 2.6.33.3-rc2 affected? A lot of xfs patches are in there (as are in
> > 2.6.32.12-rc2.)
>
> Yes. It's not even in mainline yet as Nick doesn't like the trivial
> core VM fix required to solve this in a clean way.

Hm, Nick, why? This seems like a real problem, easily reproduced. Is
it solved some other way in Linus's tree that we could backport to the
-stable series?

thanks,

greg k-h

2010-04-26 00:32:26

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> On Tuesday 13 April 2010, 11:42:33 Hans-Peter Jansen wrote:
> > On Tuesday 13 April 2010, 11:18:23 Dave Chinner wrote:
> > > On Tue, Apr 13, 2010 at 10:50:35AM +0200, Hans-Peter Jansen wrote:
> > > > Dave, may I ask you kindly for briefly elaborating on the worst
> > > > consequences of just reverting this hunk, as I've done before?
> > >
> > > Well, given that is the new shrinker code generating the warnings,
> > > reverting/removing that hunk will render the patch useless :0
> >
> > Excuse me, I didn't express myself well. I'm after the consequences of
> > applying the revert, that I posted a few messages above.
> >
> > > I'll get you a working 2.6.33 patch tomorrow - it's dinner time
> > > now....
> >
> > Cool, thanks.
>
> Obviously and not totally unexpected, really fixing this is going to take
> more time.

The problem is that the fix I did has been rejected by the upstream
VM guys, and the stable rules are that fixes have to be in mainline
before they can be put in a stable release. So, until we get a fix
in mainline, it can't be fixed in the -stable kernels.

> FYI, 2.6.33.2 is still affected from this issue.
>
> Greg, you might search for a server using xfs filesystems and and a i586
> kernel >= 2.6.33, (2.6.32.11 of SLE11-SP1 will serve as well), log in as an
> ordinary user, do a "du" on /usr, and wait for the other users screaming...

Yet there's only been one report of the problem. While that doesn't
make it any less serious, I don't think the problem you're reporting
is as widespread as you are making it out to be. We'll get the fix
done and upstream, and then it will go back to the stable kernel.

You could always apply the *tested* patches I posted that fix
the problem, as....

> BTW, all affected kernels, available from
> http://download.opensuse.org/repositories/home:/frispete: have the
> offending patch reverted (see subject), do run fine for me (on this
> aspect).

... you seem to be capable of doing so.

> Will you guys pass by another round of stable fixes without doing anything
> on this issue?

If the process of getting the fix upstream takes longer than another
stable release cycle, then yes. I'm sorry, but I can't control the
process, and if someone takes a week to NACK a fix, then you're just
going to have to wait longer. Feel free to run the fix in the
meantime - testing it, even if it was NACKed will still help us
because if it fixes your problem we know that we are fixing the
_right problem_.

If you can't live with this, then you shouldn't be running the
latest and greatest kernels in your production environment....

> Dave, this is why I'm kindly asking you: what might be the worst
> consequences, if we just do the revert for now (at least for 2.6.33), until
> you and Nick came to a final decision on how to solve this issue in the
> future.

I've already told you - you could be reintroducing all the really
hard to reproduce inode reclaim problems (oops, hangs, panics,
potentially even fs corruption) that the patch in question was part
of the fix for. You're running code that changes reclaim in very
subtle ways and has not been tested upstream in any way - if it
breaks you get to keep all the broken pieces to yourself...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-26 00:36:48

by Dave Chinner

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > FYI, 2.6.33.2 is still affected from this issue.
>
> Is 2.6.33.3-rc2 affected? A lot of xfs patches are in there (as are in
> 2.6.32.12-rc2.)

The fix is not in Linus' kernel yet, Greg. So once that is done,
I'll have to backport the fix back to those stable kernels as well.
It's not a trivial fix, so it will miss this round of stable
releases....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-04-27 00:02:18

by Greg KH

[permalink] [raw]

Subject: Re: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer [bisected: 57817c68229984818fea9e614d6f95249c3fb098]

On Mon, Apr 26, 2010 at 10:36:37AM +1000, Dave Chinner wrote:
> On Sun, Apr 25, 2010 at 09:27:40AM -0700, Greg KH wrote:
> > On Sat, Apr 24, 2010 at 06:44:22PM +0200, Hans-Peter Jansen wrote:
> > > FYI, 2.6.33.2 is still affected from this issue.
> >
> > Is 2.6.33.3-rc2 affected? A lot of xfs patches are in there (as are in
> > 2.6.32.12-rc2.)
>
> The fix is not in Linus' kernel yet, Greg. So once that is done,
> I'll have to backport the fix back to those stable kernels as well.
> It's not a trivial fix, so it will miss this round of stable
> releases....

Ok, that's fine, just checking. I'm in no rush, I have plenty of other
patches to queue up for the next stable releases :)

thanks,

greg k-h