2005-03-15 20:48:42

by Noah Meyerhans

[permalink] [raw]
Subject: OOM problems with 2.6.11-rc4

Hello. We have a server, currently running 2.6.11-rc4, that is
experiencing similar OOM problems to those described at
http://groups-beta.google.com/group/fa.linux.kernel/msg/9633559fea029f6e
and discussed further by several developers here (the summary is at
http://www.kerneltraffic.org/kernel-traffic/kt20050212_296.html#6) We
are running 2.6.11-rc4 because it contains the patches that Andrea
mentioned in the kerneltraffic link. The problem was present in 2.6.10
as well. We can try newer 2.6 kernels if it helps.

The machine in question is a dual Xeon system with 2 GB of RAM, 3.5 GB
of swap, and several TB of NFS exported filesystems. One notable point
is that this machine has been running in overcommit mode 2
(/proc/sys/vm/overcommit_memory = 2) and the OOM killer is still being
triggered, which is allegedly not supposed to be possible according to
the kerneltraffic.org document above. We had been running in overcommit
mode 0 until about a month ago, and experienced similar OOM problems
then as well.

The problem can be somewhat reliably triggered by running our backup
software on a particular filesystem. The backup software attempts to
keep the entire file list in memory, and this filesystem contains
several million files, so lots of memory is being allocated.

The server experienced these problems today and we captured the kernel
output, which is included below. Note that this machine has not used
very much swap at all, and we've never observed it completely running
out of swap.

Note that in this kernel output, the last memory dump is from the magic
SysRq key. By the time we've reached this point, the machine is
unresponsive and our next action is to trigger a sync+reboot via the
SysRq key.

File content:
057 slab:220275 mapped:12395 pagetables:118
DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:696kB present:16384kB pages_scanned:1203 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:3744kB min:3756kB low:4692kB high:5632kB active:0kB inactive:368kB present:901120kB pages_scanned:683 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:896kB min:512kB low:640kB high:768kB active:50076kB inactive:1121156kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 1*4kB 2*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
Normal: 0*4kB 10*8kB 1*16kB 2*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3744kB
HighMem: 82*4kB 1*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 896kB
Swap cache: add 2582, delete 2011, find 276/524, race 0+0
Free swap = 3610572kB
Total swap = 3615236kB
Out of Memory: Killed process 1188 (exim).
oom-killer: gfp_mask=0xd0
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16

HighMem per-cpu:

cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16

Free pages: 9196kB (1856kB HighMem)
Active:12382 inactive:280459 dirty:214 writeback:0 unstable:0 free:2299 slab:220221 mapped:12256 pagetables:122
DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:736kB present:16384kB pages_scanned:5706 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:3752kB min:3756kB low:4692kB high:5632kB active:0kB inactive:368kB present:901120kB pages_scanned:6943 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:1856kB min:512kB low:640kB high:768kB active:49528kB inactive:1120732kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 3*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
Normal: 0*4kB 11*8kB 1*16kB 2*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3752kB
HighMem: 204*4kB 36*8kB 9*16kB 3*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1856kB
Swap cache: add 2582, delete 2011, find 276/524, race 0+0
Free swap = 3610572kB
Total swap = 3615236kB
Out of Memory: Killed process 17905 (terad).
oom-killer: gfp_mask=0xd0
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1

Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16

Free pages: 21804kB (14464kB HighMem)
Active:9243 inactive:280452 dirty:214 writeback:0 unstable:0 free:5451 slab:220222 mapped:9110 pagetables:115
DMA free:3588kB min:68kB low:84kB high:100kB active:28kB inactive:708kB present:16384kB pages_scanned:5739 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:3752kB min:3756kB low:4692kB high:5632kB active:0kB inactive:368kB present:901120kB pages_scanned:6943 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:14464kB min:512kB low:640kB high:768kB active:36944kB inactive:1120732kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 3*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
Normal: 0*4kB 11*8kB 1*16kB 2*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3752kB
HighMem: 1824*4kB 572*8kB 122*16kB 4*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 14464kB
Swap cache: add 2582, delete 2047, find 276/524, race 0+0
Free swap = 3612564kB
Total swap = 3615236kB
Out of Memory: Killed process 19442 (terad).

SysRq : HELP : loglevel0-8 reBoot tErm kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount
SysRq : Show Memory
Mem-info:
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16

Free pages: 21380kB (14016kB HighMem)
Active:9400 inactive:280265 dirty:35 writeback:0 unstable:0 free:5345 slab:220381 mapped:9238 pagetables:139
DMA free:3604kB min:68kB low:84kB high:100kB active:80kB inactive:48kB present:16384kB pages_scanned:5968 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:3760kB min:3756kB low:4692kB high:5632kB active:60kB inactive:276kB present:901120kB pages_scanned:2019134 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:14016kB min:512kB low:640kB high:768kB active:37460kB inactive:1120736kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 5*4kB 2*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3604kB
Normal: 0*4kB 12*8kB 1*16kB 2*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3760kB
HighMem: 1712*4kB 572*8kB 122*16kB 4*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 14016kB
Swap cache: add 2582, delete 2047, find 276/524, race 0+0
Free swap = 3612564kB
Total swap = 3615236kB
Free swap: 3612564kB
524160 pages of RAM
294768 pages of HIGHMEM
5496 reserved pages
22159 pages shared
535 pages swap cached

--
Noah Meyerhans System Administrator
MIT Computer Science and Artificial Intelligence Laboratory


2005-03-15 21:59:02

by Sean

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Tue, March 15, 2005 3:44 pm, Noah Meyerhans said:
> Hello. We have a server, currently running 2.6.11-rc4, that is
> experiencing similar OOM problems to those described at
> http://groups-beta.google.com/group/fa.linux.kernel/msg/9633559fea029f6e
> and discussed further by several developers here (the summary is at
> http://www.kerneltraffic.org/kernel-traffic/kt20050212_296.html#6) We
> are running 2.6.11-rc4 because it contains the patches that Andrea
> mentioned in the kerneltraffic link. The problem was present in 2.6.10
> as well. We can try newer 2.6 kernels if it helps.
>
> The machine in question is a dual Xeon system with 2 GB of RAM, 3.5 GB
> of swap, and several TB of NFS exported filesystems. One notable point
> is that this machine has been running in overcommit mode 2
> (/proc/sys/vm/overcommit_memory = 2) and the OOM killer is still being
> triggered, which is allegedly not supposed to be possible according to
> the kerneltraffic.org document above. We had been running in overcommit
> mode 0 until about a month ago, and experienced similar OOM problems
> then as well.

We're seeing this on our dual Xeon box too, with 4 GB of RAM and 2GB of
swap (no NFS) using stock RHEL 4 kernel. The only thing that seems to
keep it from happening is setting /proc/sys/vm/vfs_cache_pressure to
10000.

Sean


2005-03-15 22:15:46

by Lee Revell

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Tue, 2005-03-15 at 16:56 -0500, Sean wrote:
> On Tue, March 15, 2005 3:44 pm, Noah Meyerhans said:
> > The machine in question is a dual Xeon system with 2 GB of RAM, 3.5 GB
> > of swap, and several TB of NFS exported filesystems. One notable point
> > is that this machine has been running in overcommit mode 2
> > (/proc/sys/vm/overcommit_memory = 2) and the OOM killer is still being
> > triggered, which is allegedly not supposed to be possible according to
> > the kerneltraffic.org document above. We had been running in overcommit
> > mode 0 until about a month ago, and experienced similar OOM problems
> > then as well.
>
> We're seeing this on our dual Xeon box too, with 4 GB of RAM and 2GB of
> swap (no NFS) using stock RHEL 4 kernel. The only thing that seems to
> keep it from happening is setting /proc/sys/vm/vfs_cache_pressure to
> 10000.

I suspect I hit this too on a smaller (UP) machine with 512MB RAM/512MB
swap while stress testing RT stuff with dbench and massively parallel
makes. The OOM seemed to trigger way before the machine filled up swap.
I dismissed it at the time, but maybe there's something there.

Lee

2005-03-15 23:53:06

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Noah Meyerhans <[email protected]> wrote:
>
> Active:12382 inactive:280459 dirty:214 writeback:0 unstable:0 free:2299 slab:220221 mapped:12256 pagetables:122

Vast amounts of slab - presumably inode and dentries.

What sort of local filesystems are in use?

Can you take a copy of /proc/slabinfo when the backup has run for a while,
send it?

It's useful to run `watch -n1 cat /proc/meminfo', see what the various
caches are doing during the operation.

Also, run slabtop if you have it. Or bloatmeter
(http://www.zip.com.au/~akpm/linux/patches/stuff/bloatmon and
http://www.zip.com.au/~akpm/linux/patches/stuff/bloatmeter). The thing to
watch for here is the internal fragmentation of the slab caches:

dentry_cache: 76505KB 82373KB 92.87

93% is good. Sometimes it gets much worse - very regular directory
patterns can trigger high fragmentation levels.

Does increasing /proc/sys/vm/vfs_cache_pressure help? If you're watching
/proc/meminfo you should be able to observe the effect of that upon the
Slab: figure.

2005-03-16 00:31:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Tue, Mar 15, 2005 at 03:44:13PM -0500, Noah Meyerhans wrote:
> Hello. We have a server, currently running 2.6.11-rc4, that is
> experiencing similar OOM problems to those described at
> http://groups-beta.google.com/group/fa.linux.kernel/msg/9633559fea029f6e
> and discussed further by several developers here (the summary is at
> http://www.kerneltraffic.org/kernel-traffic/kt20050212_296.html#6) We
> are running 2.6.11-rc4 because it contains the patches that Andrea
> mentioned in the kerneltraffic link. The problem was present in 2.6.10
> as well. We can try newer 2.6 kernels if it helps.

Thanks for testing the new code, but unfortunately the problem you're
facing is a different one. It's still definitely another VM bug though.

While looking after your bug I identified for sure a bug in how the VM
sets the all_unreclaimable, the VM is setting all_unreclaimable on the
normal zone without any care about the progress we're making at freeing
the slab. Once all_unreclaimable is set, it's pretty much too late in
trying not to go OOM. all_unreclaimable truly means OOM so we must be
extremely careful when we set it (for sure the slab progress must be
taken into account).

We also want kswapd to help us in freeing the slab in the background
instead of erroneously giving it up if some slab cache is still
freeable.

Once all_unreclaimable is set, then shrink_caches will stop calling
shrink_zone for anything but the lowest prio, and this will lead to
sc.nr_scanned to be small, and this will lead to shrink_slab to get a
small parameter too.


In short I think we can start by trying this fix (which has some risk,
since now it might become harder to detect an oom condition, but I don't
see many other ways in order to keep the slab progress into account
without major changes). perhaps another way would be to check for
total_reclaimed < SWAP_CLUSTER_MAX, but the one I used in the patch is
much safer for your purposes (even if less safe in terms of not running
into live locks).

Beware this absolutely untested and it may not be enough. Perhaps there
are more bugs in the same area (the shrink_slab itself seems overkill
complicated for no good reason and different methods returns random
stuff, dcache returns a percentage of the free entries, dquot instead
returns the allocated inuse entries too which makes the whole API
looking unreliable).

Signed-off-by: Andrea Arcangeli <[email protected]>

--- x/mm/vmscan.c.~1~ 2005-03-14 05:02:17.000000000 +0100
+++ x/mm/vmscan.c 2005-03-16 01:28:16.000000000 +0100
@@ -1074,8 +1074,9 @@ scan:
total_scanned += sc.nr_scanned;
if (zone->all_unreclaimable)
continue;
- if (zone->pages_scanned >= (zone->nr_active +
- zone->nr_inactive) * 4)
+ if (!reclaim_state->reclaimed_slab &&
+ zone->pages_scanned >= (zone->nr_active +
+ zone->nr_inactive) * 4)
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and

This below is an untested attempt at bringing dquot a bit more in line
with the API, to make the whole thing a bit more consistent, though I
doubt you're using quotas, so it's only the above one that's going to be
interesting for you to test.

Signed-off-by: Andrea Arcangeli <[email protected]>

--- x/fs/dquot.c.~1~ 2005-03-08 01:02:13.000000000 +0100
+++ x/fs/dquot.c 2005-03-16 01:18:19.000000000 +0100
@@ -510,7 +510,7 @@ static int shrink_dqcache_memory(int nr,
spin_lock(&dq_list_lock);
if (nr)
prune_dqcache(nr);
- ret = dqstats.allocated_dquots;
+ ret = (dqstats.free_dquots / 100) * sysctl_vfs_cache_pressure;
spin_unlock(&dq_list_lock);
return ret;
}

Let us know if this helps in any way or not. Thanks!

2005-03-16 11:12:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Wed, Mar 16, 2005 at 01:31:34AM +0100, Andrea Arcangeli wrote:
> In short I think we can start by trying this fix (which has some risk,
> since now it might become harder to detect an oom condition, but I don't

Some testing shows that oom conditions are still detected fine (I
expected this but I wasn't completely sure until I tested it ;). Now the
main question is if this is enough to fix your problem or if there are
more hidden bugs in the same area.

2005-03-16 12:05:14

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Andrea Arcangeli <[email protected]> wrote:
>
> the VM is setting all_unreclaimable on the
> normal zone without any care about the progress we're making at freeing
> the slab.

Urgh, I didn't notice that all_unreclaimable is set.

> Beware this absolutely untested and it may not be enough. Perhaps there
> are more bugs in the same area (the shrink_slab itself seems overkill
> complicated for no good reason and different methods returns random
> stuff, dcache returns a percentage of the free entries, dquot instead
> returns the allocated inuse entries too which makes the whole API
> looking unreliable).

No, the two functions are equivalent for the default value of
vfs_cache_pressure (100) - it's not a percentage. It's just that we forgot
about the quota cache when adding the tunable. And mbcache, come to that.

> Signed-off-by: Andrea Arcangeli <[email protected]>
>
> --- x/mm/vmscan.c.~1~ 2005-03-14 05:02:17.000000000 +0100
> +++ x/mm/vmscan.c 2005-03-16 01:28:16.000000000 +0100
> @@ -1074,8 +1074,9 @@ scan:
> total_scanned += sc.nr_scanned;
> if (zone->all_unreclaimable)
> continue;
> - if (zone->pages_scanned >= (zone->nr_active +
> - zone->nr_inactive) * 4)

A change we made a while back effectively doubles the rate at which
pages_scanned gets incremented here (we now account for the active list as
well as the inactive list). So this should be *8 to make it more
equivalent to the old code. Not that this is likely to make much
difference.


> + if (!reclaim_state->reclaimed_slab &&
> + zone->pages_scanned >= (zone->nr_active +
> + zone->nr_inactive) * 4)
> zone->all_unreclaimable = 1;

That might not change anything because we clear ->all_unreclaimable in
free_page_bulk(). Although that is behind the per-cpu-pages, so there will
be some lag. And this change will cause us to not bale out of reclaim..

Still, I think it would make more sense to return a success indication from
shrink_slab() if we actually freed any slab objects. That will prevent us
from incorrectly going all_unreclaimable if all we happen to be doing is
increasing slab internal fragmentation.

We could do that kludgily by re-polling the shrinker but it would be better
to return a second value from all the shrinkers.

> --- x/fs/dquot.c.~1~ 2005-03-08 01:02:13.000000000 +0100
> +++ x/fs/dquot.c 2005-03-16 01:18:19.000000000 +0100
> @@ -510,7 +510,7 @@ static int shrink_dqcache_memory(int nr,
> spin_lock(&dq_list_lock);
> if (nr)
> prune_dqcache(nr);
> - ret = dqstats.allocated_dquots;
> + ret = (dqstats.free_dquots / 100) * sysctl_vfs_cache_pressure;
> spin_unlock(&dq_list_lock);
> return ret;
> }

yup.

2005-03-16 12:17:55

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Andrew Morton <[email protected]> wrote:
>
> Still, I think it would make more sense to return a success indication from
> shrink_slab() if we actually freed any slab objects. That will prevent us
> from incorrectly going all_unreclaimable if all we happen to be doing is
> increasing slab internal fragmentation.
>
> We could do that kludgily by re-polling the shrinker but it would be better
> to return a second value from all the shrinkers.

This is the kludgy version.

--- 25/mm/vmscan.c~vmscan-notice-slab-shrinking 2005-03-16 04:12:49.000000000 -0800
+++ 25-akpm/mm/vmscan.c 2005-03-16 04:14:02.000000000 -0800
@@ -180,17 +180,20 @@ EXPORT_SYMBOL(remove_shrinker);
* `lru_pages' represents the number of on-LRU pages in all the zones which
* are eligible for the caller's allocation attempt. It is used for balancing
* slab reclaim versus page reclaim.
+ *
+ * Returns the number of slab objects which we shrunk.
*/
static int shrink_slab(unsigned long scanned, unsigned int gfp_mask,
unsigned long lru_pages)
{
struct shrinker *shrinker;
+ int ret = 0;

if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;

if (!down_read_trylock(&shrinker_rwsem))
- return 0;
+ return 1; /* Assume we'll be able to shrink next time */

list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
@@ -209,10 +212,14 @@ static int shrink_slab(unsigned long sca
while (total_scan >= SHRINK_BATCH) {
long this_scan = SHRINK_BATCH;
int shrink_ret;
+ int nr_before;

+ nr_before = (*shrinker->shrinker)(0, gfp_mask);
shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
if (shrink_ret == -1)
break;
+ if (shrink_ret < nr_before)
+ ret += nr_before - shrink_ret;
mod_page_state(slabs_scanned, this_scan);
total_scan -= this_scan;

@@ -222,7 +229,7 @@ static int shrink_slab(unsigned long sca
shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
- return 0;
+ return ret;
}

/* Called without lock on whether page is mapped, so answer is unstable */
@@ -1077,6 +1084,7 @@ scan:
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
+ int nr_slab;

if (zone->present_pages == 0)
continue;
@@ -1098,14 +1106,15 @@ scan:
sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
shrink_zone(zone, &sc);
reclaim_state->reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+ nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
+ lru_pages);
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
total_reclaimed += sc.nr_reclaimed;
total_scanned += sc.nr_scanned;
if (zone->all_unreclaimable)
continue;
- if (zone->pages_scanned >= (zone->nr_active +
- zone->nr_inactive) * 4)
+ if (nr_slab == 0 && zone->pages_scanned >=
+ (zone->nr_active + zone->nr_inactive) * 4)
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and
_

2005-03-16 12:23:33

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Andrea Arcangeli <[email protected]> wrote:
>
> - ret = dqstats.allocated_dquots;
> + ret = (dqstats.free_dquots / 100) * sysctl_vfs_cache_pressure;

Oh I see. Yes, using .allocated_dquots was wrong.

2005-03-16 12:30:43

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Andrea Arcangeli <[email protected]> wrote:
>
> This below is an untested attempt at bringing dquot a bit more in line
> with the API, to make the whole thing a bit more consistent,

Like this? (Noah, don't bother testing this one)



Fix some bugs spotted by Andrea Arcangeli <[email protected]>

- When we added /proc/sys/vm/vfs_cache_pressure we forgot to allow it to
tune the dquot and mbcache slabs as well.

- Reduce lock contention in shrink_dqcache_memory().

- Use dqstats.free_dquots in shrink_dqcache_memory(): this is the count of
reclaimable objects.

Signed-off-by: Andrew Morton <[email protected]>
---

25-akpm/fs/dquot.c | 12 +++++-------
25-akpm/fs/mbcache.c | 2 +-
2 files changed, 6 insertions(+), 8 deletions(-)

diff -puN fs/dquot.c~slab-shrinkers-use-vfs_cache_pressure fs/dquot.c
--- 25/fs/dquot.c~slab-shrinkers-use-vfs_cache_pressure 2005-03-16 04:22:01.000000000 -0800
+++ 25-akpm/fs/dquot.c 2005-03-16 04:27:09.000000000 -0800
@@ -505,14 +505,12 @@ static void prune_dqcache(int count)

static int shrink_dqcache_memory(int nr, unsigned int gfp_mask)
{
- int ret;
-
- spin_lock(&dq_list_lock);
- if (nr)
+ if (nr) {
+ spin_lock(&dq_list_lock);
prune_dqcache(nr);
- ret = dqstats.allocated_dquots;
- spin_unlock(&dq_list_lock);
- return ret;
+ spin_unlock(&dq_list_lock);
+ }
+ return (dqstats.free_dquots / 100) * sysctl_vfs_cache_pressure;
}

/*
diff -puN fs/mbcache.c~slab-shrinkers-use-vfs_cache_pressure fs/mbcache.c
--- 25/fs/mbcache.c~slab-shrinkers-use-vfs_cache_pressure 2005-03-16 04:22:01.000000000 -0800
+++ 25-akpm/fs/mbcache.c 2005-03-16 04:24:43.000000000 -0800
@@ -225,7 +225,7 @@ mb_cache_shrink_fn(int nr_to_scan, unsig
e_lru_list), gfp_mask);
}
out:
- return count;
+ return (count / 100) * sysctl_vfs_cache_pressure;
}


_

2005-03-16 18:39:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Wed, Mar 16, 2005 at 04:04:35AM -0800, Andrew Morton wrote:
> > + if (!reclaim_state->reclaimed_slab &&
> > + zone->pages_scanned >= (zone->nr_active +
> > + zone->nr_inactive) * 4)
> > zone->all_unreclaimable = 1;
>
> That might not change anything because we clear ->all_unreclaimable in
> free_page_bulk(). [..]

Really? free_page_bulk is called inside shrink_slab, and so it's overwritten
later by all_unreclaimable. Otherwise how could all_unreclaimable be set
in the first place if a single page freed by shrink_slab would be enough
to clear it?

shrink_slab
all_unreclaimable = 0
zone->pages_scanned >= (zone->nr_active [..]
all_unreclaimable = 1

try_to_free_pages
all_unreclaimable == 1
oom

I also considering changing shrink_slab to return a progress retval, but
then I noticed I could get away with a one liner fix ;).

Your fix is better but it should be mostly equivalent in pratcie. I
liked the dontrylock not risking to go oom, the one liner couldn't
handle that ;).

thanks!

2005-03-16 22:22:09

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Andrea Arcangeli <[email protected]> wrote:
>
> On Wed, Mar 16, 2005 at 04:04:35AM -0800, Andrew Morton wrote:
> > > + if (!reclaim_state->reclaimed_slab &&
> > > + zone->pages_scanned >= (zone->nr_active +
> > > + zone->nr_inactive) * 4)
> > > zone->all_unreclaimable = 1;
> >
> > That might not change anything because we clear ->all_unreclaimable in
> > free_page_bulk(). [..]
>
> Really? free_page_bulk is called inside shrink_slab, and so it's overwritten
> later by all_unreclaimable. Otherwise how could all_unreclaimable be set
> in the first place if a single page freed by shrink_slab would be enough
> to clear it?
>
> shrink_slab
> all_unreclaimable = 0
> zone->pages_scanned >= (zone->nr_active [..]
> all_unreclaimable = 1
>
> try_to_free_pages
> all_unreclaimable == 1
> oom

Spose so.

> I also considering changing shrink_slab to return a progress retval, but
> then I noticed I could get away with a one liner fix ;).
>
> Your fix is better but it should be mostly equivalent in pratcie. I
> liked the dontrylock not risking to go oom, the one liner couldn't
> handle that ;).

It has a problem. If ZONE_DMA is really, really oom, kswapd will sit there
freeing up ZONE_NORMAL slab objects and not setting all_unreclaimable.
We'll end up using tons of CPU and reclaiming lots of slab in response to a
ZONE_DMA oom.

I'm thinking that the most accurate way of fixing this and also avoiding
the "we're fragmenting slab but not actually freeing pages yet" problem is


- change task_struct->reclaim_state so that it has an array of booleans
(one per zone)

- in kmem_cache_free, work out what zone the object corresponds to and
set the boolean in current->reclaim_state which corresponds to that zone.

- in balance_pgdat(), inspect this zone's boolean to see if we're making
any forward progress with slab freeing.

Probably we can do the work in kmem_cache_free() at the place where we
spill the slab magazine, to optimise things a bit. I haven't looked at it.

But that has a problem too. Some other task might be freeing objects into
the relevant zone instead of this one.

So maybe a better approach would be to add a "someone freed something"
counter to the zone structure. That would be incremented whenever anyone
frees a page for a slab object. Then in balance_pdgat we take a look at
that before and after performing the LRU and slab scans. If it
incremented, dont' set all_unreclaimable. And still keep the
free_pages_bulk code there as the code which takes us _out_ of the
all_unreclaimable state.

It's tricky.


2005-03-18 16:18:40

by Noah Meyerhans

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

Hi Andrew, Andrea, et al. Sorry for taking a while to get back to you
on this. Thanks a lot for the work you've already put in to this. We
built a 2.6.11.4 kernel with Andrea's first patch for this problem (the
patch is included at the end of this mail, just to make sure you know
which one I'm referring to). We had also switched back to overcommit
mode 0. More comments follow inline...

On Tue, Mar 15, 2005 at 03:46:08PM -0800, Andrew Morton wrote:
> > Active:12382 inactive:280459 dirty:214 writeback:0 unstable:0 free:2299 slab:220221 mapped:12256 pagetables:122
>
> Vast amounts of slab - presumably inode and dentries.
>
> What sort of local filesystems are in use?

Well, that's certainly an interesting question. The filesystem is IBM's
JFS. If you tell me that's part of the problem, I'm not likely to
disagree. 8^)

> Can you take a copy of /proc/slabinfo when the backup has run for a while,
> send it?

We triggerred a backup process, and I watched slabtop and /proc/meminfo
while it was running, right up until the time the OOM killer was
triggered. Unfortunately I didn't get a copy of slabinfo. Hopefully
the slabtop and meminfo output help a bit, though. Here are the last
three seconds worth of /proc/meminfo:

Fri Mar 18 10:41:08 EST 2005
MemTotal: 2074660 kB
MemFree: 8492 kB
Buffers: 19552 kB
Cached: 1132916 kB
SwapCached: 3672 kB
Active: 55040 kB
Inactive: 1136024 kB
HighTotal: 1179072 kB
HighFree: 576 kB
LowTotal: 895588 kB
LowFree: 7916 kB
SwapTotal: 3615236 kB
SwapFree: 3609168 kB
Dirty: 68 kB
Writeback: 0 kB
Mapped: 43744 kB
Slab: 861952 kB
CommitLimit: 4652564 kB
Committed_AS: 53272 kB
PageTables: 572 kB
VmallocTotal: 114680 kB
VmallocUsed: 6700 kB
VmallocChunk: 107964 kB
Fri Mar 18 10:41:10 EST 2005
MemTotal: 2074660 kB
MemFree: 8236 kB
Buffers: 19512 kB
Cached: 1132884 kB
SwapCached: 3672 kB
Active: 54708 kB
Inactive: 1136288 kB
HighTotal: 1179072 kB
HighFree: 576 kB
LowTotal: 895588 kB
LowFree: 7660 kB
SwapTotal: 3615236 kB
SwapFree: 3609168 kB
Dirty: 68 kB
Writeback: 0 kB
Mapped: 43744 kB
Slab: 862216 kB
CommitLimit: 4652564 kB
Committed_AS: 53272 kB
PageTables: 572 kB
VmallocTotal: 114680 kB
VmallocUsed: 6700 kB
VmallocChunk: 107964 kB
<date wasn't inserted here because OOM killer killed it>
MemTotal: 2074660 kB
MemFree: 8620 kB
Buffers: 19388 kB
Cached: 1132552 kB
SwapCached: 3780 kB
Active: 56200 kB
Inactive: 1134388 kB
HighTotal: 1179072 kB
HighFree: 960 kB
LowTotal: 895588 kB
LowFree: 7660 kB
SwapTotal: 3615236 kB
SwapFree: 3609204 kB
Dirty: 104 kB
Writeback: 0 kB
Mapped: 43572 kB
Slab: 862484 kB
CommitLimit: 4652564 kB
Committed_AS: 53100 kB
PageTables: 564 kB
VmallocTotal: 114680 kB
VmallocUsed: 6700 kB
VmallocChunk: 107964 kB

Here are the top few entries from the last page of slabtop:
830696 830639 99% 0.80K 207674 4 830696K jfs_ip
129675 4841 3% 0.05K 1729 75 6916K buffer_head
39186 35588 90% 0.27K 2799 14 11196K radix_tree_node
5983 2619 43% 0.12K 193 31 772K size-128
4860 4728 97% 0.05K 60 81 240K journal_head
4403 4403 100% 0.03K 37 119 148K size-32
4164 4161 99% 1.00K 1041 4 4164K size-1024
3857 1552 40% 0.13K 133 29 532K dentry_cache
3355 1781 53% 0.06K 55 61 220K size-64
3103 3026 97% 0.04K 29 107 116K sysfs_dir_cache
2712 2412 88% 0.02K 12 226 48K dm_io
2712 2412 88% 0.02K 12 226 48K dm_tio



> Does increasing /proc/sys/vm/vfs_cache_pressure help? If you're watching
> /proc/meminfo you should be able to observe the effect of that upon the
> Slab: figure.

It doesn't have any noticable effect on the stability of the machine. I
set it to 10000 but within a few hours the machine had crashed again.

We weren't able to capture all of the console messages prior to the
crash. Here are some of them. Note that, again, the last memory dump
is was manually triggered via SysRq:

nactive:132kB present:16384kB pages_scanned:1589 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:3752kB min:3756kB low:4692kB high:5632kB active:9948kB inactive:9648kB present:901120kB pages_scanned:20640 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:960kB min:512kB low:640kB high:768kB active:45132kB inactive:1125920kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB0*4096kB = 3588kB
Normal: 0*4kB 1*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3752kB
HighMem: 92*4kB 10*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 960kB
Swap cache: add 2443, delete 1525, find 1049/1167, race 0+0
Free swap = 3609204kB
Total swap = 3615236kB
Out of Memory: Killed process 21640 (zsh).
oom-killer: gfp_mask=0xd0
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
HighMem per-cpu:

cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16

Free pages: 8812kB (1088kB HighMem)
Active:13647 inactive:283917 dirty:20 writeback:0 unstable:0 free:2203 slab:215622 mapped:10903 pagetables:141
DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:132kB present:16384kB pages_scanned:3084 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:4136kB min:3756kB low:4692kB high:5632kB active:9504kB inactive:9656kB present:901120kB pages_scanned:22772 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:1088kB min:512kB low:640kB high:768kB active:45084kB inactive:1125880kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB0*4096kB = 3588kB
Normal: 96*4kB 1*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4136kB
HighMem: 116*4kB 14*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1088kB
Swap cache: add 2470, delete 1525, find 1085/1207, race 0+0
Free swap = 3609204kB
Total swap = 3615236kB
Out of Memory: Killed process 21642 (sleep).
oom-killer: gfp_mask=0xd0
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1

Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16

Free pages: 8940kB (1216kB HighMem)
Active:13638 inactive:283923 dirty:27 writeback:0 unstable:0 free:2235 slab:215622 mapped:10880 pagetables:141
DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:132kB present:16384kB pages_scanned:3149 all_unreclaimable? yes
lowmem_reserve[]: 0 880 2031
Normal free:4136kB min:3756kB low:4692kB high:5632kB active:9520kB inactive:9668kB present:901120kB pages_scanned:22837 all_unreclaimable? yes
lowmem_reserve[]: 0 0 9212
HighMem free:1216kB min:512kB low:640kB high:768kB active:45032kB inactive:1125892kB present:1179136kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB0*4096kB = 3588kB
Normal: 96*4kB 1*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4136kB
HighMem: 146*4kB 15*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1216kB
Swap cache: add 2470, delete 1525, find 1095/1217, race 0+0
Free swap = 3609204kB
Total swap = 3615236kB




Andrea's patch, applied to the kernel we're currently running:
--- x/mm/vmscan.c.~1~ 2005-03-14 05:02:17.000000000 +0100
+++ x/mm/vmscan.c 2005-03-16 01:28:16.000000000 +0100
@@ -1074,8 +1074,9 @@ scan:
total_scanned += sc.nr_scanned;
if (zone->all_unreclaimable)
continue;
- if (zone->pages_scanned >= (zone->nr_active +
- zone->nr_inactive) * 4)
+ if (!reclaim_state->reclaimed_slab &&
+ zone->pages_scanned >= (zone->nr_active +
+ zone->nr_inactive) * 4)
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and

--
Noah Meyerhans System Administrator
MIT Computer Science and Artificial Intelligence Laboratory

2005-03-23 19:55:05

by Mingming Cao

[permalink] [raw]
Subject: OOM problems on 2.6.12-rc1 with many fsx tests

Andrea, Andrew,

I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
hours the system hit OOM, and OOM keep killing processes one by one. I
could reproduce this problem very constantly on a 2 way PIII 700MHZ with
512MB RAM. Also the problem could be reproduced on running the same test
on reiser fs.

The fsx command is:

./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &

I also see fsx tests start to generating report about read bad data
about the tests have run for about 9 hours(one hour before of the OOM
happen).

I also logged the /proc/meminfo every 30 seconds from a remote machine.
I will attach the OOM message, the /proc/meminfo at the beginning of
the test and the end(when OOM killed the logging process) here.

When I run the tests I don't know this is a known issue and don't know
if 2.6.12-rc1 contains the proposed fix from Andrea/Andrew or not. If
you have something that I could try, let me know.

Thanks!

Mingming


============================================================
OOM messages on console
============================================================

Mar 23 02:16:16 elm3b92 kernel: oom-killer: gfp_mask=0x80d2
Mar 23 02:16:18 elm3b92 kernel: DMA per-cpu:
Mar 23 02:16:18 elm3b92 kernel: cpu 0 hot: low 2, high 6, batch 1
Mar 23 02:16:18 elm3b92 kernel: cpu 0 cold: low 0, high 2, batch 1
Mar 23 02:16:18 elm3b92 kernel: cpu 1 hot: low 2, high 6, batch 1
Mar 23 02:16:18 elm3b92 kernel: cpu 1 cold: low 0, high 2, batch 1
Mar 23 02:16:18 elm3b92 kernel: Normal per-cpu:
Mar 23 02:16:18 elm3b92 kernel: cpu 0 hot: low 32, high 96, batch 16
Mar 23 02:16:18 elm3b92 kernel: cpu 0 cold: low 0, high 32, batch 16
Mar 23 02:16:18 elm3b92 kernel: cpu 1 hot: low 32, high 96, batch 16
Mar 23 02:16:18 elm3b92 kernel: cpu 1 cold: low 0, high 32, batch 16
Mar 23 02:16:18 elm3b92 kernel: HighMem per-cpu: empty
Mar 23 02:16:19 elm3b92 kernel:
Mar 23 02:16:20 elm3b92 kernel: Free pages: 5208kB (0kB HighMem)
Mar 23 02:16:20 elm3b92 kernel: Active:60624 inactive:60838 dirty:0
writeback:0 unstable:0 free:1302 slab:3208 mapped:196 pagetables:105
Mar 23 02:16:20 elm3b92 kernel: DMA free:2072kB min:88kB low:108kB
high:132kB active:4776kB inactive:3908kB present:16384kB
pages_scanned:15150 all_unreclaimable? yes
Mar 23 02:16:20 elm3b92 kernel: lowmem_reserve[]: 0 496 496
Mar 23 02:16:20 elm3b92 kernel: Normal free:3136kB min:2804kB low:3504kB
high:4204kB active:237968kB inactive:239200kB present:507904kB
pages_scanned:15151 all_unreclaimable? no
Mar 23 02:16:20 elm3b92 kernel: lowmem_reserve[]: 0 0 0
Mar 23 02:16:20 elm3b92 kernel: HighMem free:0kB min:128kB low:160kB
high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0
all_unreclaimable? no
Mar 23 02:16:20 elm3b92 kernel: lowmem_reserve[]: 0 0 0
Mar 23 02:16:20 elm3b92 kernel: DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2072kB
Mar 23 02:16:20 elm3b92 kernel: Normal: 58*4kB 25*8kB 1*16kB 2*32kB
0*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3200kB
Mar 23 02:16:20 elm3b92 kernel: HighMem: empty
Mar 23 02:16:20 elm3b92 kernel: Swap cache: add 284169, delete 283991,
find 105376/203154, race 2+25
Mar 23 02:16:20 elm3b92 kernel: Free swap = 1042692kB
Mar 23 02:16:20 elm3b92 kernel: Total swap = 1052216kB
Mar 23 02:16:20 elm3b92 kernel: Out of Memory: Killed process 2115
(cupsd).
Mar 23 02:21:01 elm3b92 kernel: oom-killer: gfp_mask=0x1d2
Mar 23 02:21:03 elm3b92 kernel: DMA per-cpu:
Mar 23 02:21:03 elm3b92 kernel: cpu 0 hot: low 2, high 6, batch 1
Mar 23 02:21:03 elm3b92 kernel: cpu 0 cold: low 0, high 2, batch 1
Mar 23 02:21:03 elm3b92 kernel: cpu 1 hot: low 2, high 6, batch 1
Mar 23 02:21:03 elm3b92 kernel: cpu 1 cold: low 0, high 2, batch 1
Mar 23 02:21:03 elm3b92 kernel: Normal per-cpu:
Mar 23 02:21:03 elm3b92 kernel: cpu 0 hot: low 32, high 96, batch 16
Mar 23 02:21:03 elm3b92 kernel: cpu 0 cold: low 0, high 32, batch 16
Mar 23 02:21:03 elm3b92 kernel: cpu 1 hot: low 32, high 96, batch 16
Mar 23 02:21:03 elm3b92 kernel: cpu 1 cold: low 0, high 32, batch 16
Mar 23 02:21:03 elm3b92 kernel: HighMem per-cpu: empty
Mar 23 02:21:04 elm3b92 kernel:
Mar 23 02:21:08 elm3b92 kernel: Free pages: 4872kB (0kB HighMem)
Mar 23 02:21:10 elm3b92 kernel: Active:61061 inactive:60495 dirty:0
writeback:30 unstable:0 free:1218 slab:3159 mapped:272 pagetables:86
Mar 23 02:21:11 elm3b92 kernel: DMA free:2072kB min:88kB low:108kB
high:132kB active:4832kB inactive:3852kB present:16384kB
pages_scanned:18465 all_unreclaimable? yes
Mar 23 02:21:11 elm3b92 kernel: lowmem_reserve[]: 0 496 496
Mar 23 02:21:11 elm3b92 kernel: Normal free:2800kB min:2804kB low:3504kB
high:4204kB active:239492kB inactive:237988kB present:507904kB
pages_scanned:107346 all_unreclaimable? no
Mar 23 02:21:11 elm3b92 kernel: lowmem_reserve[]: 0 0 0
Mar 23 02:21:11 elm3b92 kernel: DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2072kB
Mar 23 02:21:11 elm3b92 kernel: Normal: 2*4kB 7*8kB 3*16kB 2*32kB 0*64kB
1*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2864kB
Mar 23 02:21:11 elm3b92 kernel: HighMem: empty
Mar 23 02:21:11 elm3b92 kernel: Swap cache: add 292112, delete 291918,
find 107914/208530, race 2+28
Mar 23 02:21:11 elm3b92 kernel: Free swap = 1044992kB
Mar 23 02:21:11 elm3b92 kernel: Total swap = 1052216kB
Mar 23 02:21:11 elm3b92 kernel: Out of Memory: Killed process 5638
(nscd).
Mar 23 02:21:19 elm3b92 kernel: oom-killer: gfp_mask=0x1d2
Mar 23 02:21:21 elm3b92 kernel: DMA per-cpu:
Mar 23 02:21:21 elm3b92 kernel: cpu 0 hot: low 2, high 6, batch 1
Mar 23 02:21:21 elm3b92 kernel: cpu 0 cold: low 0, high 2, batch 1
Mar 23 02:21:21 elm3b92 kernel: cpu 1 hot: low 2, high 6, batch 1
Mar 23 02:21:21 elm3b92 kernel: cpu 1 cold: low 0, high 2, batch 1
Mar 23 02:21:21 elm3b92 kernel: Normal per-cpu:
Mar 23 02:21:21 elm3b92 kernel: cpu 0 hot: low 32, high 96, batch 16
Mar 23 02:21:21 elm3b92 kernel: cpu 0 cold: low 0, high 32, batch 16
Mar 23 02:21:21 elm3b92 kernel: cpu 1 hot: low 32, high 96, batch 16
Mar 23 02:21:21 elm3b92 kernel: cpu 1 cold: low 0, high 32, batch 16
Mar 23 02:21:21 elm3b92 kernel: HighMem per-cpu: empty
Mar 23 02:21:21 elm3b92 kernel:
Mar 23 02:21:21 elm3b92 kernel: Free pages: 5296kB (0kB HighMem)
Mar 23 02:21:21 elm3b92 kernel: Active:60775 inactive:60649 dirty:28
writeback:0 unstable:0 free:1324 slab:3157 mapped:220 pagetables:81
Mar 23 02:21:21 elm3b92 kernel: DMA free:2072kB min:88kB low:108kB
high:132kB active:4864kB inactive:3820kB present:16384kB
pages_scanned:18855 all_unreclaimable? yes
Mar 23 02:21:21 elm3b92 kernel: lowmem_reserve[]: 0 496 496
Mar 23 02:21:21 elm3b92 kernel: Normal free:3224kB min:2804kB low:3504kB
high:4204kB active:238080kB inactive:238776kB present:507904kB
pages_scanned:52439 all_unreclaimable? no
Mar 23 02:21:21 elm3b92 kernel: lowmem_reserve[]: 0 0 0
Mar 23 02:21:21 elm3b92 kernel: HighMem free:0kB min:128kB low:160kB
high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0
all_unreclaimable? no
Mar 23 02:21:21 elm3b92 kernel: lowmem_reserve[]: 0 0 0
Mar 23 02:21:21 elm3b92 kernel: DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2072kB
Mar 23 02:21:21 elm3b92 kernel: Normal: 96*4kB 19*8kB 0*16kB 2*32kB
0*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3288kB
Mar 23 02:21:21 elm3b92 kernel: HighMem: empty
Mar 23 02:21:21 elm3b92 kernel: Swap cache: add 293172, delete 292990,
find 108239/209189, race 2+31
Mar 23 02:21:21 elm3b92 kernel: Free swap = 1045004kB
Mar 23 02:21:21 elm3b92 kernel: Total swap = 1052216kB
Mar 23 02:21:21 elm3b92 kernel: Out of Memory: Killed process 2062
(portmap).



=======================================================
/proc/meminfo at the beginning of the fsx tests
=======================================================
+ cat /proc/meminfo
MemTotal: 510400 kB
MemFree: 397284 kB
Buffers: 23788 kB
Cached: 35712 kB
SwapCached: 0 kB
Active: 68592 kB
Inactive: 21312 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 510400 kB
LowFree: 397284 kB
SwapTotal: 1052216 kB
SwapFree: 1052216 kB
Dirty: 2312 kB
Writeback: 0 kB
Mapped: 23500 kB
Slab: 16412 kB
CommitLimit: 1307416 kB
Committed_AS: 74804 kB
PageTables: 572 kB
VmallocTotal: 516024 kB
VmallocUsed: 3700 kB
VmallocChunk: 512320 kB
+ sleep 30

===========================================
/proc/meminfo at the end of the log file
===========================================
+ cat /proc/meminfo
MemTotal: 510400 kB
MemFree: 5512 kB
Buffers: 500 kB
Cached: 584 kB
SwapCached: 344 kB
Active: 243580 kB
Inactive: 242248 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 510400 kB
LowFree: 5512 kB
SwapTotal: 1052216 kB
SwapFree: 1045984 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 804 kB
Slab: 12584 kB
CommitLimit: 1307416 kB
Committed_AS: 13560 kB
PageTables: 284 kB
VmallocTotal: 516024 kB
VmallocUsed: 3700 kB
VmallocChunk: 512320 kB
+ sleep 30




2005-03-23 22:50:25

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

Mingming Cao <[email protected]> wrote:
>
> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> hours the system hit OOM, and OOM keep killing processes one by one.

I don't have a very good record reading these oom dumps lately, but this
one look really weird. Basically no mapped memory, tons of pagecache on
the LRU.

It would be interesting if you could run the same test on 2.6.11.

2005-03-23 23:14:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

>> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>> hours the system hit OOM, and OOM keep killing processes one by one.
>
> I don't have a very good record reading these oom dumps lately, but this
> one look really weird. Basically no mapped memory, tons of pagecache on
> the LRU.
>
> It would be interesting if you could run the same test on 2.6.11.

One thing I'm finding is that it's hard to backtrace who has each page
in this sort of situation. My plan is to write a debug patch to walk
mem_map and dump out some info on each page. I would appreciate ideas
on what info would be useful here. Some things are fairly obvious, like
we want to know if it's anon / mapped into address space (& which),
whether it's slab / buffers / pagecache etc ... any other suggestions
you have would be much appreciated.

I'm suspecting in many cases we don't keep enough info, and it would be
too slow to keep it in the default case - so I may need to add some
extra debug fields in struct page as a config option, but let's start
with what we have.

M.

2005-03-23 23:21:04

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

"Martin J. Bligh" <[email protected]> wrote:
>
> > It would be interesting if you could run the same test on 2.6.11.
>
> One thing I'm finding is that it's hard to backtrace who has each page
> in this sort of situation. My plan is to write a debug patch to walk
> mem_map and dump out some info on each page. I would appreciate ideas
> on what info would be useful here. Some things are fairly obvious, like
> we want to know if it's anon / mapped into address space (& which),
> whether it's slab / buffers / pagecache etc ... any other suggestions
> you have would be much appreciated.

You could use

page-owner-tracking-leak-detector.patch
make-page_owner-handle-non-contiguous-page-ranges.patch
add-gfp_mask-to-page-owner.patch

which sticks an 8-slot stack backtrace into each page, recording who
allocated it.

But that's probably not very interesting info for pagecache pages.

Nothing beats poking around in a dead machine's guts with kgdb though.

2005-03-23 23:27:07

by Andries Brouwer

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Wed, Mar 23, 2005 at 03:20:55PM -0800, Andrew Morton wrote:

> Nothing beats poking around in a dead machine's guts with kgdb though.

Everyone his taste.

But I was surprised by

> SwapTotal: 1052216 kB
> SwapFree: 1045984 kB

Strange that processes are killed while lots of swap is available.

Andries

2005-03-23 23:31:10

by Martin J. Bligh

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

>> Nothing beats poking around in a dead machine's guts with kgdb though.
>
> Everyone his taste.
>
> But I was surprised by
>
>> SwapTotal: 1052216 kB
>> SwapFree: 1045984 kB
>
> Strange that processes are killed while lots of swap is available.

I don't think we're that smart about it. If we're really low on mem, it
seems we invoke the OOM killer whether processes are causing the problem
or not.

OTOH, if we can't free the kernel mem, we don't have much choice, but
it's not really helping much ;-)

M.

2005-03-23 23:43:44

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

"Martin J. Bligh" <[email protected]> wrote:
>
> >> Nothing beats poking around in a dead machine's guts with kgdb though.
> >
> > Everyone his taste.
> >
> > But I was surprised by
> >
> >> SwapTotal: 1052216 kB
> >> SwapFree: 1045984 kB
> >
> > Strange that processes are killed while lots of swap is available.
>
> I don't think we're that smart about it. If we're really low on mem, it
> seems we invoke the OOM killer whether processes are causing the problem
> or not.
>
> OTOH, if we can't free the kernel mem, we don't have much choice, but
> it's not really helping much ;-)
>

I'm suspecting here that we simply leaked a refcount on every darn
pagecache page in the machine. Note how mapped memory has shrunk down to
less than a megabyte and everything which can be swapped out has been
swapped out.

If so, then oom-killing everything in the world is pretty inevitable.

2005-03-23 23:51:38

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Andrew Morton wrote:

> "Martin J. Bligh" <[email protected]> wrote:
>
>>>>Nothing beats poking around in a dead machine's guts with kgdb though.
>>>
>>>Everyone his taste.
>>>
>>>But I was surprised by
>>>
>>>
>>>>SwapTotal: 1052216 kB
>>>>SwapFree: 1045984 kB
>>>
>>>Strange that processes are killed while lots of swap is available.
>>
>>I don't think we're that smart about it. If we're really low on mem, it
>>seems we invoke the OOM killer whether processes are causing the problem
>>or not.
>>
>>OTOH, if we can't free the kernel mem, we don't have much choice, but
>>it's not really helping much ;-)
>>
>
>
> I'm suspecting here that we simply leaked a refcount on every darn
> pagecache page in the machine. Note how mapped memory has shrunk down to
> less than a megabyte and everything which can be swapped out has been
> swapped out.

That makes sense. We have almost 485MB in active and inactive caches,
but we are not able reclai them :(

Active: 243580 kB
Inactive: 242248 kB

>
> If so, then oom-killing everything in the world is pretty inevitable.
>


2005-03-24 01:50:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Wed, Mar 23, 2005 at 03:42:32PM -0800, Andrew Morton wrote:
> I'm suspecting here that we simply leaked a refcount on every darn
> pagecache page in the machine. Note how mapped memory has shrunk down to
> less than a megabyte and everything which can be swapped out has been
> swapped out.
>
> If so, then oom-killing everything in the world is pretty inevitable.

Agreed, it looks like a memleak of a page_count (while mapcount is fine).

I would suggest looking after pages part of pagecache (i.e.
page->mapcount not null) that have a mapcount of 0 and a page_count > 1,
almost all of them should be like that during the memleak, and almost
none should be like that before the memleak.

This seems unrelated to the bug that started the thread that was clearly
a slab shrinking issue and not a pagecache memleak.

Thanks.

2005-03-24 02:01:19

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

Andrea Arcangeli <[email protected]> wrote:
>
> On Wed, Mar 23, 2005 at 03:42:32PM -0800, Andrew Morton wrote:
> > I'm suspecting here that we simply leaked a refcount on every darn
> > pagecache page in the machine. Note how mapped memory has shrunk down to
> > less than a megabyte and everything which can be swapped out has been
> > swapped out.
> >
> > If so, then oom-killing everything in the world is pretty inevitable.
>
> Agreed, it looks like a memleak of a page_count (while mapcount is fine).
>
> I would suggest looking after pages part of pagecache (i.e.
> page->mapcount not null) that have a mapcount of 0 and a page_count > 1,
> almost all of them should be like that during the memleak, and almost
> none should be like that before the memleak.
>
> This seems unrelated to the bug that started the thread that was clearly
> a slab shrinking issue and not a pagecache memleak.
>

The vmscan.c changes in -rc1 look harmless enough. That's assuming that
2.6.11 doesn't have the bug.

btw, that new orphanned-page handling code has a printk in it, and nobody
has reported it coming out yet...

2005-03-25 22:04:02

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

Mingming Cao <[email protected]> wrote:
>
> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> hours the system hit OOM, and OOM keep killing processes one by one. I
> could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> 512MB RAM. Also the problem could be reproduced on running the same test
> on reiser fs.
>
> The fsx command is:
>
> ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &

I was able to reproduce this on ext3. Seven instances of the above leaked
10-15MB over 10 hours. All of it permanently stuck on the LRU.

I'll continue to poke at it - see what kernel it started with, which
filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a
quick process.

Given that you also saw it on reiserfs, it might be a bug in the core
mmap/truncate/unmap handling. We'll see.

> I also see fsx tests start to generating report about read bad data
> about the tests have run for about 9 hours(one hour before of the OOM
> happen).

I haven't noticed anything like that.

2005-03-25 22:21:07

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > hours the system hit OOM, and OOM keep killing processes one by one. I
> > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > 512MB RAM. Also the problem could be reproduced on running the same test
> > on reiser fs.
> >
> > The fsx command is:
> >
> > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>
> I was able to reproduce this on ext3. Seven instances of the above leaked
> 10-15MB over 10 hours. All of it permanently stuck on the LRU.
>
> I'll continue to poke at it - see what kernel it started with, which
> filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a
> quick process.

I reproduced *similar* issue with 2.6.11. The reason I say similar, is
there is no OOM kill, but very low free memory and machine doesn't
respond at all. (I booted my machine with 256M memory and ran 20 copies
of fsx on ext3).


Thanks,
Badari

2005-03-26 00:17:29

by Dave Jones

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Wed, Mar 23, 2005 at 11:53:04AM -0800, Mingming Cao wrote:

> The fsx command is:
>
> ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>
> I also see fsx tests start to generating report about read bad data
> about the tests have run for about 9 hours(one hour before of the OOM
> happen).

Is writing to the same testfile from multiple fsx's supposed to work?
It sounds like a surefire way to break the consistency checking that it does.
I'm surprised it lasts 9hrs before it breaks.

In the past I've done tests like..

for i in `seq 1 100`
do
fsx foo$i &
done

to make each process use a different test file.

Dave

2005-03-26 00:27:04

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Fri, 2005-03-25 at 16:17, Dave Jones wrote:
> On Wed, Mar 23, 2005 at 11:53:04AM -0800, Mingming Cao wrote:
>
> > The fsx command is:
> >
> > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> >
> > I also see fsx tests start to generating report about read bad data
> > about the tests have run for about 9 hours(one hour before of the OOM
> > happen).
>
> Is writing to the same testfile from multiple fsx's supposed to work?
> It sounds like a surefire way to break the consistency checking that it does.
> I'm surprised it lasts 9hrs before it breaks.
>
> In the past I've done tests like..
>
> for i in `seq 1 100`
> do
> fsx foo$i &
> done
>
> to make each process use a different test file.
>

No. We are doing on different files - Mingming cut and pasted
only a single line from the script.

Thanks,
Badari

2005-03-27 00:24:08

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
> On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
> > Mingming Cao <[email protected]> wrote:
> > >
> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > hours the system hit OOM, and OOM keep killing processes one by one. I
> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > > 512MB RAM. Also the problem could be reproduced on running the same test
> > > on reiser fs.
> > >
> > > The fsx command is:
> > >
> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> >
> > I was able to reproduce this on ext3. Seven instances of the above leaked
> > 10-15MB over 10 hours. All of it permanently stuck on the LRU.
> >
> > I'll continue to poke at it - see what kernel it started with, which
> > filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a
> > quick process.
>
> I reproduced *similar* issue with 2.6.11. The reason I say similar, is
> there is no OOM kill, but very low free memory and machine doesn't
> respond at all. (I booted my machine with 256M memory and ran 20 copies
> of fsx on ext3).
>
>

Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
his machine, my machine did not go to OOM on 2.6.11,still alive, but
memory is very low(only 5M free). Killed all fsx and umount the ext3
filesystem did not bring back much memory. I will going to rerun the
tests without the mapped read/write to see what happen.


2005-03-27 19:26:14

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote:
> On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
> > On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
> > > Mingming Cao <[email protected]> wrote:
> > > >
> > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > > hours the system hit OOM, and OOM keep killing processes one by one. I
> > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > > > 512MB RAM. Also the problem could be reproduced on running the same test
> > > > on reiser fs.
> > > >
> > > > The fsx command is:
> > > >
> > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> > >
> > > I was able to reproduce this on ext3. Seven instances of the above leaked
> > > 10-15MB over 10 hours. All of it permanently stuck on the LRU.
> > >
> > > I'll continue to poke at it - see what kernel it started with, which
> > > filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a
> > > quick process.
> >
> > I reproduced *similar* issue with 2.6.11. The reason I say similar, is
> > there is no OOM kill, but very low free memory and machine doesn't
> > respond at all. (I booted my machine with 256M memory and ran 20 copies
> > of fsx on ext3).
> >
> >
>
> Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
> his machine, my machine did not go to OOM on 2.6.11,still alive, but
> memory is very low(only 5M free). Killed all fsx and umount the ext3
> filesystem did not bring back much memory. I will going to rerun the
> tests without the mapped read/write to see what happen.
>
>

Run fsx tests without mapped IO on 2.6.11 seems fine. Here is
the /proc/meminfo after 18 hours run:

# cat /proc/meminfo
MemTotal: 510464 kB
MemFree: 6004 kB
Buffers: 179420 kB
Cached: 9144 kB
SwapCached: 0 kB
Active: 313236 kB
Inactive: 171380 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 510464 kB
LowFree: 6004 kB
SwapTotal: 1052216 kB
SwapFree: 1052216 kB
Dirty: 2100 kB
Writeback: 0 kB
Mapped: 24884 kB
Slab: 14788 kB
CommitLimit: 1307448 kB
Committed_AS: 78032 kB
PageTables: 720 kB
VmallocTotal: 516024 kB
VmallocUsed: 1672 kB
VmallocChunk: 514352 kB

elm3b92:~ # killall -9 fsx
elm3b92:~ # cat /proc/meminfo
MemTotal: 510464 kB
MemFree: 21332 kB
Buffers: 179668 kB
Cached: 8828 kB
SwapCached: 0 kB
Active: 298748 kB
Inactive: 171152 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 510464 kB
LowFree: 21332 kB
SwapTotal: 1052216 kB
SwapFree: 1052216 kB
Dirty: 1140 kB
Writeback: 0 kB
Mapped: 11648 kB
Slab: 14632 kB
CommitLimit: 1307448 kB
Committed_AS: 59800 kB
PageTables: 492 kB
VmallocTotal: 516024 kB
VmallocUsed: 1672 kB
VmallocChunk: 514352 kB

elm3b92:~ # umount /mnt/ext3
elm3b92:~ # cat /proc/meminfo
MemTotal: 510464 kB
MemFree: 181636 kB
Buffers: 22092 kB
Cached: 6740 kB
SwapCached: 0 kB
Active: 151284 kB
Inactive: 158948 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 510464 kB
LowFree: 181636 kB
SwapTotal: 1052216 kB
SwapFree: 1052216 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 11656 kB
Slab: 14052 kB
CommitLimit: 1307448 kB
Committed_AS: 59800 kB
PageTables: 492 kB
VmallocTotal: 516024 kB
VmallocUsed: 1672 kB
VmallocChunk: 514352 kB


2005-03-27 20:07:07

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Mingming Cao wrote:

> On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote:
>
>>On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
>>
>>>On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
>>>
>>>>Mingming Cao <[email protected]> wrote:
>>>>
>>>>>I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>>>>>2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>>>>>hours the system hit OOM, and OOM keep killing processes one by one. I
>>>>>could reproduce this problem very constantly on a 2 way PIII 700MHZ with
>>>>>512MB RAM. Also the problem could be reproduced on running the same test
>>>>>on reiser fs.
>>>>>
>>>>>The fsx command is:
>>>>>
>>>>>./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>>>>
>>>>I was able to reproduce this on ext3. Seven instances of the above leaked
>>>>10-15MB over 10 hours. All of it permanently stuck on the LRU.
>>>>
>>>>I'll continue to poke at it - see what kernel it started with, which
>>>>filesystems it affects, whether it happens on UP&&!PREEMPT, etc. Not a
>>>>quick process.
>>>
>>>I reproduced *similar* issue with 2.6.11. The reason I say similar, is
>>>there is no OOM kill, but very low free memory and machine doesn't
>>>respond at all. (I booted my machine with 256M memory and ran 20 copies
>>>of fsx on ext3).
>>>
>>>
>>
>>Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
>>his machine, my machine did not go to OOM on 2.6.11,still alive, but
>>memory is very low(only 5M free). Killed all fsx and umount the ext3
>>filesystem did not bring back much memory. I will going to rerun the
>>tests without the mapped read/write to see what happen.
>>
>>
>
>
> Run fsx tests without mapped IO on 2.6.11 seems fine. Here is
> the /proc/meminfo after 18 hours run:

Mingming, Reproduce it on 2.6.11 with mapped IO tests. That will tell
us when the regression started.

Thanks,
Badari

2005-03-27 20:17:19

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Badari Pulavarty wrote:

> Mingming Cao wrote:
>
>> On Sat, 2005-03-26 at 16:23 -0800, Mingming Cao wrote:
>>
>>> On Fri, 2005-03-25 at 14:11 -0800, Badari Pulavarty wrote:
>>>
>>>> On Fri, 2005-03-25 at 13:56, Andrew Morton wrote:
>>>>
>>>>> Mingming Cao <[email protected]> wrote:
>>>>>
>>>>>> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx
>>>>>> tests on
>>>>>> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>>>>>> hours the system hit OOM, and OOM keep killing processes one by
>>>>>> one. I
>>>>>> could reproduce this problem very constantly on a 2 way PIII
>>>>>> 700MHZ with
>>>>>> 512MB RAM. Also the problem could be reproduced on running the
>>>>>> same test
>>>>>> on reiser fs.
>>>>>>
>>>>>> The fsx command is:
>>>>>>
>>>>>> ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>>>>>
>>>>>
>>>>> I was able to reproduce this on ext3. Seven instances of the above
>>>>> leaked
>>>>> 10-15MB over 10 hours. All of it permanently stuck on the LRU.
>>>>>
>>>>> I'll continue to poke at it - see what kernel it started with, which
>>>>> filesystems it affects, whether it happens on UP&&!PREEMPT, etc.
>>>>> Not a
>>>>> quick process.
>>>>
>>>>
>>>> I reproduced *similar* issue with 2.6.11. The reason I say similar, is
>>>> there is no OOM kill, but very low free memory and machine doesn't
>>>> respond at all. (I booted my machine with 256M memory and ran 20 copies
>>>> of fsx on ext3).
>>>>
>>>>
>>>
>>> Yes, I re-run the same test on 2.6.11 for 24 hours, like Badari see on
>>> his machine, my machine did not go to OOM on 2.6.11,still alive, but
>>> memory is very low(only 5M free). Killed all fsx and umount the ext3
>>> filesystem did not bring back much memory. I will going to rerun the
>>> tests without the mapped read/write to see what happen.
>>>
>>>
>>
>>
>> Run fsx tests without mapped IO on 2.6.11 seems fine. Here is
>> the /proc/meminfo after 18 hours run:
>
>
> Mingming, Reproduce it on 2.6.11 with mapped IO tests. That will tell
> us when the regression started.

Sorry - Ignore my request, Mingming already did this work and posted
the result.

Thanks,
Badari

2005-04-04 01:36:39

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

Mingming Cao <[email protected]> wrote:
>
> I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> hours the system hit OOM, and OOM keep killing processes one by one. I
> could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> 512MB RAM. Also the problem could be reproduced on running the same test
> on reiser fs.
>
> The fsx command is:
>
> ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &


This ext3 bug goes all the way back to 2.6.6.

I don't know yet why you saw problems with reiser3 and I'm pretty sure I
saw problems with ext2. More testing is needed there.

Without the below patch it's possible to make ext3 leak at around a
megabyte per minute by arranging for the fs to run a commit every 50
milliseconds, btw.

(Stephen, please review...)



This fixes the lots-of-fsx-linux-instances-cause-a-slow-leak bug.

It's been there since 2.6.6, caused by:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm4/broken-out/jbd-move-locked-buffers.patch

That patch moves under-writeout ordered-data buffers onto a separate journal
list during commit. It took out the old code which was based on a single
list.

The old code (necessarily) had logic which would restart I/O against buffers
which had been redirtied while they were on the committing transaction's
t_sync_datalist list. The new code only writes buffers once, ignoring
redirtyings by a later transaction, which is good.

But over on the truncate side of things, in journal_unmap_buffer(), we're
treating buffers on the t_locked_list as inviolable things which belong to the
committing transaction, and we just leave them alone during concurrent
truncate-vs-commit.

The net effect is that when truncate tries to invalidate a page whose buffers
are on t_locked_list and have been redirtied, journal_unmap_buffer() just
leaves those buffers alone. truncate will remove the page from its mapping
and we end up with an anonymous clean page with dirty buffers, which is an
illegal state for a page. The JBD commit will not clean those buffers as they
are removed from t_locked_list. The VM (try_to_free_buffers) cannot reclaim
these pages.

The patch teaches journal_unmap_buffer() about buffers which are on the
committing transaction's t_locked_list. These buffers have been written and
I/O has completed. We can take them off the transaction and undirty them
within the context of journal_invalidatepage()->journal_unmap_buffer().

Signed-off-by: Andrew Morton <[email protected]>
---

25-akpm/fs/jbd/transaction.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)

diff -puN fs/jbd/transaction.c~jbd-dirty-buffer-leak-fix fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~jbd-dirty-buffer-leak-fix 2005-04-03 15:12:12.000000000 -0700
+++ 25-akpm/fs/jbd/transaction.c 2005-04-03 15:14:40.000000000 -0700
@@ -1812,7 +1812,17 @@ static int journal_unmap_buffer(journal_
}
}
} else if (transaction == journal->j_committing_transaction) {
- /* If it is committing, we simply cannot touch it. We
+ if (jh->b_jlist == BJ_Locked) {
+ /*
+ * The buffer is on the committing transaction's locked
+ * list. We have the buffer locked, so I/O has
+ * completed. So we can nail the buffer now.
+ */
+ may_free = __dispose_buffer(jh, transaction);
+ goto zap_buffer;
+ }
+ /*
+ * If it is committing, we simply cannot touch it. We
* can remove it's next_transaction pointer from the
* running transaction if that is set, but nothing
* else. */
@@ -1887,7 +1897,6 @@ int journal_invalidatepage(journal_t *jo
unsigned int next_off = curr_off + bh->b_size;
next = bh->b_this_page;

- /* AKPM: doing lock_buffer here may be overly paranoid */
if (offset <= curr_off) {
/* This block is wholly outside the truncation point */
lock_buffer(bh);
_

2005-04-04 16:56:56

by Mingming Cao

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > hours the system hit OOM, and OOM keep killing processes one by one. I
> > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > 512MB RAM. Also the problem could be reproduced on running the same test
> > on reiser fs.
> >
> > The fsx command is:
> >
> > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>
>
> This ext3 bug goes all the way back to 2.6.6.

> I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> saw problems with ext2. More testing is needed there.
>

We (Janet and I) are chasing this bug as well. Janet is able to
reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
double check it. Will try your patch today.

Thanks,
Mingming

2005-04-04 20:14:10

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

Mingming Cao <[email protected]> wrote:
>
> On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote:
> > Mingming Cao <[email protected]> wrote:
> > >
> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > hours the system hit OOM, and OOM keep killing processes one by one. I
> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > > 512MB RAM. Also the problem could be reproduced on running the same test
> > > on reiser fs.
> > >
> > > The fsx command is:
> > >
> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> >
> >
> > This ext3 bug goes all the way back to 2.6.6.
>
> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> > saw problems with ext2. More testing is needed there.
> >
>
> We (Janet and I) are chasing this bug as well. Janet is able to
> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
> double check it. Will try your patch today.

There's a second leak, with similar-looking symptoms. At ~50
commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
than a hundredth the rate of the first one.

2005-04-04 20:35:02

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

"Martin J. Bligh" <[email protected]> wrote:
>
> >> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> >> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> >> > > hours the system hit OOM, and OOM keep killing processes one by one. I
> >> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> >> > > 512MB RAM. Also the problem could be reproduced on running the same test
> >> > > on reiser fs.
> >> > >
> >> > > The fsx command is:
> >> > >
> >> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> >> >
> >> >
> >> > This ext3 bug goes all the way back to 2.6.6.
> >>
> >> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> >> > saw problems with ext2. More testing is needed there.
> >> >
> >>
> >> We (Janet and I) are chasing this bug as well. Janet is able to
> >> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
> >> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
> >> double check it. Will try your patch today.
> >
> > There's a second leak, with similar-looking symptoms. At ~50
> > commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
> > than a hundredth the rate of the first one.
>
> What are you using to see these with, just kgdb, and a large cranial
> capacity? Or is there some more magic?
>

Nothing magical: run the test for a while, kill everything, cause a huge
swapstorm then look at the meminfo numbers. If active+inactive is
significantly larger than cahed+buffers+swapcached+mapped+minus-a-bit then
it's leaked.

Right now I have:

MemTotal: 246264 kB
MemFree: 196148 kB
Buffers: 4200 kB
Cached: 3308 kB
SwapCached: 8064 kB
Active: 21548 kB
Inactive: 12532 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 246264 kB
LowFree: 196148 kB
SwapTotal: 1020116 kB
SwapFree: 1001612 kB
Dirty: 60 kB
Writeback: 0 kB
Mapped: 2284 kB
Slab: 12200 kB
CommitLimit: 1143248 kB
Committed_AS: 34004 kB
PageTables: 1200 kB
VmallocTotal: 774136 kB
VmallocUsed: 82832 kB
VmallocChunk: 691188 kB
HugePages_Total: 0
HugePages_Free: 0

33 megs on the LRU, unaccounted for in other places.

Once the leak is nice and large I can start a new swapstorm, set a
breakpoint in try_to_free_buffers() (for example) and start looking at the
state of the page and its buffers.


2005-04-04 20:35:02

by Martin J. Bligh

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

>> > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
>> > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
>> > > hours the system hit OOM, and OOM keep killing processes one by one. I
>> > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
>> > > 512MB RAM. Also the problem could be reproduced on running the same test
>> > > on reiser fs.
>> > >
>> > > The fsx command is:
>> > >
>> > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
>> >
>> >
>> > This ext3 bug goes all the way back to 2.6.6.
>>
>> > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
>> > saw problems with ext2. More testing is needed there.
>> >
>>
>> We (Janet and I) are chasing this bug as well. Janet is able to
>> reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
>> this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
>> double check it. Will try your patch today.
>
> There's a second leak, with similar-looking symptoms. At ~50
> commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
> than a hundredth the rate of the first one.

What are you using to see these with, just kgdb, and a large cranial
capacity? Or is there some more magic?

m.

2005-04-05 16:44:47

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: OOM problems on 2.6.12-rc1 with many fsx tests

Hi,

On Mon, 2005-04-04 at 02:35, Andrew Morton wrote:

> Without the below patch it's possible to make ext3 leak at around a
> megabyte per minute by arranging for the fs to run a commit every 50
> milliseconds, btw.

Ouch!
> (Stephen, please review...)

Doing so now.

> The patch teaches journal_unmap_buffer() about buffers which are on the
> committing transaction's t_locked_list. These buffers have been written and
> I/O has completed.

Agreed. The key here is that the buffer is locked before
journal_unmap_buffer() is called, so we can indeed rely on it being
safely on disk.

> We can take them off the transaction and undirty them
> within the context of journal_invalidatepage()->journal_unmap_buffer().

Right - the committing transaction can't be doing any more writes, and
the current transaction has explicitly told us to throw away its own
writes if we get here. Unfiling the buffer should be safe.

> + if (jh->b_jlist == BJ_Locked) {
> + /*
> + * The buffer is on the committing transaction's locked
> + * list. We have the buffer locked, so I/O has
> + * completed. So we can nail the buffer now.
> + */
> + may_free = __dispose_buffer(jh, transaction);
> + goto zap_buffer;
> + }

ACK.

--Stephen

2005-04-05 17:26:50

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

On Mon, 2005-04-04 at 13:04 -0700, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > On Sun, 2005-04-03 at 18:35 -0700, Andrew Morton wrote:
> > > Mingming Cao <[email protected]> wrote:
> > > >
> > > > I run into OOM problem again on 2.6.12-rc1. I run some(20) fsx tests on
> > > > 2.6.12-rc1 kernel(and 2.6.11-mm4) on ext3 filesystem, after about 10
> > > > hours the system hit OOM, and OOM keep killing processes one by one. I
> > > > could reproduce this problem very constantly on a 2 way PIII 700MHZ with
> > > > 512MB RAM. Also the problem could be reproduced on running the same test
> > > > on reiser fs.
> > > >
> > > > The fsx command is:
> > > >
> > > > ./fsx -c 10 -n -r 4096 -w 4096 /mnt/test/foo1 &
> > >
> > >
> > > This ext3 bug goes all the way back to 2.6.6.
> >
> > > I don't know yet why you saw problems with reiser3 and I'm pretty sure I
> > > saw problems with ext2. More testing is needed there.
> > >
> >
> > We (Janet and I) are chasing this bug as well. Janet is able to
> > reproduce this bug on 2.6.9 but I can't. Glad to know you have nail down
> > this issue on ext3. I am pretty sure I saw this on Reiser3 once, I will
> > double check it. Will try your patch today.
>
> There's a second leak, with similar-looking symptoms. At ~50
> commits/second it has leaked ~10MB in 24 hours, so it's very slow - less
> than a hundredth the rate of the first one.
>

I run the test(20 instances of fsx) with your patch on 2.6.12-rc1 with
512MB RAM (where I were able to constantly re-create the mem leak and
lead to OOM before). The result is the kernel did not get into OOM after
about 19 hours(before it took about 9 hours or so), system is still
responsive. However I did notice about ~60MB delta between Active
+Inactive and Buffers+cached+Swapcached+Mapped+Slab

Here is the current /proc/meminfo

elm3b92:~ # cat /proc/meminfo
MemTotal: 510400 kB
MemFree: 97004 kB
Buffers: 196772 kB
Cached: 77608 kB
SwapCached: 0 kB
Active: 299064 kB
Inactive: 83140 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 510400 kB
LowFree: 97004 kB
SwapTotal: 1052216 kB
SwapFree: 1052216 kB
Dirty: 1600 kB
Writeback: 0 kB
Mapped: 23256 kB
Slab: 24548 kB
CommitLimit: 1307416 kB
Committed_AS: 73560 kB
PageTables: 532 kB
VmallocTotal: 516024 kB
VmallocUsed: 3700 kB
VmallocChunk: 512320 kB



2005-04-06 01:24:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Mingming Cao <[email protected]> wrote:
>
> I run the test(20 instances of fsx) with your patch on 2.6.12-rc1 with
> 512MB RAM (where I were able to constantly re-create the mem leak and
> lead to OOM before). The result is the kernel did not get into OOM after
> about 19 hours(before it took about 9 hours or so), system is still
> responsive. However I did notice about ~60MB delta between Active
> +Inactive and Buffers+cached+Swapcached+Mapped+Slab

yes.

Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to
be 100x slower. Which is fortunate because this one is going to take a
long time to fix. I'll poke at it some more.

2005-04-12 13:02:28

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: OOM problems on 2.6.12-rc1 with many fsx tests

Hi,

On Wed, 2005-04-06 at 02:23, Andrew Morton wrote:

> Nobody has noticed the now-fixed leak since 2.6.6 and this one appears to
> be 100x slower. Which is fortunate because this one is going to take a
> long time to fix. I'll poke at it some more.

OK, I'm now at the stage where I can kick off that fsx test on a kernel
without your leak fix, kill it, umount and get

Whoops: found 43 unfreeable buffers still on the superblock debug list
for sb 00000100296b2d48. Tracing one...
buffer trace for buffer at 0x000001003edaa9c8 (I am CPU 0)
...

with a trace pointing to journal_unmap_buffer(). I'll try with the fix
in place to see if there are any other cases showing up with the same
problem.

--Stephen

2005-04-13 13:46:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Fri, Mar 18, 2005 at 11:12:18AM -0500, Noah Meyerhans wrote:
> Well, that's certainly an interesting question. The filesystem is IBM's
> JFS. If you tell me that's part of the problem, I'm not likely to
> disagree. 8^)

It would be nice if you could reproduce with ext3 or reiserfs (if with
ext3, after applying the memleak fix from Andrew that was found in this
same thread ;). The below make it look like a jfs problem.

830696 830639 99% 0.80K 207674 4 830696K jfs_ip

2005-04-14 16:59:15

by Noah Meyerhans

[permalink] [raw]
Subject: Re: OOM problems with 2.6.11-rc4

On Wed, Apr 13, 2005 at 03:47:40PM +0200, Andrea Arcangeli wrote:
> On Fri, Mar 18, 2005 at 11:12:18AM -0500, Noah Meyerhans wrote:
> > Well, that's certainly an interesting question. The filesystem is IBM's
> > JFS. If you tell me that's part of the problem, I'm not likely to
> > disagree. 8^)
>
> It would be nice if you could reproduce with ext3 or reiserfs (if with
> ext3, after applying the memleak fix from Andrew that was found in this
> same thread ;). The below make it look like a jfs problem.
>
> 830696 830639 99% 0.80K 207674 4 830696K jfs_ip

I'll see what I can do. It may be difficult to move all the data to a
different filesystem. There are multiple terabytes in use.

I'll refer the JFS developers to this thread, too, they may be able to
shed some light on it.

Thanks.
noah

--
Noah Meyerhans System Administrator
MIT Computer Science and Artificial Intelligence Laboratory


Attachments:
(No filename) (944.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments