LinuxLists.cc - dbench on tmpfs OOM's

2002-09-17 04:41:21

Subject: dbench on tmpfs OOM's

My machine OOM'd during a run of 2 simultaneous dbench 512's on 2
separate 12GB tmpfs fs's on a 32x NUMA-Q with 32GB of RAM, 2.5.35
plus minimalistic booting fixes, no swap, and my recently posted false
OOM fix (though it is perhaps not the most desirable fix).

Note: gfp_mask == __GFP_FS | __GFP_HIGHIO | __GFP_IO | __GFP__WAIT
... the __GFP_FS checks can't save us here.

Cheers,
Bill

#1 0xc013ce2e in oom_kill () at oom_kill.c:182
#2 0xc013cee8 in out_of_memory () at oom_kill.c:250
#3 0xc0138bec in try_to_free_pages (classzone=0xc0389300, gfp_mask=464,
order=0) at vmscan.c:561
#4 0xc013989b in balance_classzone (classzone=0xc0389300, gfp_mask=464,
order=0, freed=0xf6169ed0) at page_alloc.c:278
#5 0xc0139b97 in __alloc_pages (gfp_mask=464, order=0, zonelist=0xc02a3e94)
at page_alloc.c:401
#6 0xc013cb57 in alloc_pages_pgdat (pgdat=0xc0389000, gfp_mask=464, order=0)
at numa.c:77
#7 0xc013cba3 in _alloc_pages (gfp_mask=464, order=0) at numa.c:105
#8 0xc0139c23 in get_zeroed_page (gfp_mask=45) at page_alloc.c:442
#9 0xc013db12 in shmem_getpage_locked (info=0xd950f900, inode=0xd950f978,
idx=16) at shmem.c:205
#10 0xc013e6ea in shmem_file_write (file=0xf45dd6c0,
buf=0x804bda1 '\001' <repeats 200 times>..., count=65474, ppos=0xf45dd6e0)
at shmem.c:957
#11 0xc014553c in vfs_write (file=0xf45dd6c0,
buf=0x804bda0 '\001' <repeats 200 times>..., count=65475, pos=0xf45dd6e0)
at read_write.c:216
#12 0xc014561e in sys_write (fd=6,
buf=0x804bda0 '\001' <repeats 200 times>..., count=65475)
at read_write.c:246
#13 0xc010771f in syscall_call () at process.c:966

MemTotal: 32107256 kB
MemFree: 27564648 kB
MemShared: 0 kB
Cached: 4196528 kB
SwapCached: 0 kB
Active: 1924400 kB
Inactive: 2381404 kB
HighTotal: 31588352 kB
HighFree: 27563224 kB
LowTotal: 518904 kB
LowFree: 1424 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Committed_AS: 4330600 kB
PageTables: 12804 kB
ReverseMaps: 133841
TLB flushes: 1709
non flushes: 1093

2002-09-17 04:54:02

by Andrew Morton

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

William Lee Irwin III wrote:
>
> ...
> MemTotal: 32107256 kB
> MemFree: 27564648 kB

I'd be suspecting that your node fallback is bust.

Suggest you add a call to show_free_areas() somewhere; consider
exposing the full per-zone status via /proc with a proper patch.

2002-09-17 04:58:32

by Martin J. Bligh

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

>> ...
>> MemTotal: 32107256 kB
>> MemFree: 27564648 kB
>
> I'd be suspecting that your node fallback is bust.
>
> Suggest you add a call to show_free_areas() somewhere; consider
> exposing the full per-zone status via /proc with a proper patch.

Won't /proc/meminfo.numa show that? Or do you mean something
else by "full per-zone status"?

Looks to me like it's just out of low memory:

> LowFree: 1424 kB

There is no low memory on anything but node 0 ...

M.

2002-09-17 05:09:12

by Andrew Morton

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

"Martin J. Bligh" wrote:
>
> >> ...
> >> MemTotal: 32107256 kB
> >> MemFree: 27564648 kB
> >
> > I'd be suspecting that your node fallback is bust.
> >
> > Suggest you add a call to show_free_areas() somewhere; consider
> > exposing the full per-zone status via /proc with a proper patch.
>
> Won't /proc/meminfo.numa show that? Or do you mean something
> else by "full per-zone status"?

meminfo.what? Remember when I suggested that you put
a testing mode into the numa code so that mortals could
run numa builds on non-numa boxes?

> Looks to me like it's just out of low memory:
>
> > LowFree: 1424 kB
>
> There is no low memory on anything but node 0 ...
>

It was a GFP_HIGH allocation - just pagecache.

2002-09-17 05:15:09

by Martin J. Bligh

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

> meminfo.what? Remember when I suggested that you put
> a testing mode into the numa code so that mortals could
> run numa builds on non-numa boxes?

NUMA aware meminfo is one of the patches you have sitting
in your tree. I haven't got around to the NUMA-sim yet ...
maybe after Halloween when management stop asking me to
get other bits of code in before the freeze ;-)

mbligh@larry:~$ cat /proc/meminfo.numa

Node 0 MemTotal: 4194304 kB
Node 0 MemFree: 3420660 kB
Node 0 MemUsed: 773644 kB
Node 0 HighTotal: 3418112 kB
Node 0 HighFree: 2737992 kB
Node 0 LowTotal: 776192 kB
Node 0 LowFree: 682668 kB

Node 1 MemTotal: 4147200 kB
Node 1 MemFree: 4116444 kB
Node 1 MemUsed: 30756 kB
Node 1 HighTotal: 4147200 kB
Node 1 HighFree: 4116444 kB
Node 1 LowTotal: 0 kB
Node 1 LowFree: 0 kB

Node 2 MemTotal: 4147200 kB
Node 2 MemFree: 4131816 kB
Node 2 MemUsed: 15384 kB
Node 2 HighTotal: 4147200 kB
Node 2 HighFree: 4131816 kB
Node 2 LowTotal: 0 kB
Node 2 LowFree: 0 kB

Node 3 MemTotal: 4147200 kB
Node 3 MemFree: 4128432 kB
Node 3 MemUsed: 18768 kB
Node 3 HighTotal: 4147200 kB
Node 3 HighFree: 4128432 kB
Node 3 LowTotal: 0 kB
Node 3 LowFree: 0 kB

>> Looks to me like it's just out of low memory:
>>
>> > LowFree: 1424 kB
>>
>> There is no low memory on anything but node 0 ...
>
> It was a GFP_HIGH allocation - just pagecache.

Ah, but what does a balance_classzone do on a NUMA box?
Once you've finished rototilling the code you're looking
at, I think we might have a better clue what it's supposed
to do, at least ...

M.

2002-09-17 05:15:11

by William Lee Irwin III

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

William Lee Irwin III wrote:
>> MemTotal: 32107256 kB
>> MemFree: 27564648 kB

On Mon, Sep 16, 2002 at 09:58:43PM -0700, Andrew Morton wrote:
> I'd be suspecting that your node fallback is bust.
> Suggest you add a call to show_free_areas() somewhere; consider
> exposing the full per-zone status via /proc with a proper patch.

I went through the nodes by hand. It's just a run of the mill
ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of
the highmem zones were anywhere near ->pages_low.

Cheers,
Bill

2002-09-17 05:27:08

by Andrew Morton

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

William Lee Irwin III wrote:
>
> William Lee Irwin III wrote:
> >> MemTotal: 32107256 kB
> >> MemFree: 27564648 kB
>
> On Mon, Sep 16, 2002 at 09:58:43PM -0700, Andrew Morton wrote:
> > I'd be suspecting that your node fallback is bust.
> > Suggest you add a call to show_free_areas() somewhere; consider
> > exposing the full per-zone status via /proc with a proper patch.
>
> I went through the nodes by hand. It's just a run of the mill
> ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of
> the highmem zones were anywhere near ->pages_low.
>

erk. Why is shmem using GFP_USER?

mnm:/usr/src/25> grep page_address mm/shmem.c
mnm:/usr/src/25>

2002-09-17 06:34:09

by Christoph Rohland

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

Hi Andrew,

Andrew Morton wrote:
> William Lee Irwin III wrote:
>>I went through the nodes by hand. It's just a run of the mill
>>ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of
>>the highmem zones were anywhere near ->pages_low.
>>
>>
>
> erk. Why is shmem using GFP_USER?
>
> mnm:/usr/src/25> grep page_address mm/shmem.c

For inode and page vector allocation.

Greetings
Christoph

2002-09-17 06:55:47

by Hugh Dickins

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

On Mon, 16 Sep 2002, Andrew Morton wrote:
> William Lee Irwin III wrote:
> >
> > William Lee Irwin III wrote:
> > >> MemTotal: 32107256 kB
> > >> MemFree: 27564648 kB
> >
> > On Mon, Sep 16, 2002 at 09:58:43PM -0700, Andrew Morton wrote:
> > > I'd be suspecting that your node fallback is bust.
> > > Suggest you add a call to show_free_areas() somewhere; consider
> > > exposing the full per-zone status via /proc with a proper patch.
> >
> > I went through the nodes by hand. It's just a run of the mill
> > ZONE_NORMAL OOM coming out of the GFP_USER allocation. None of
> > the highmem zones were anywhere near ->pages_low.
> >
>
> erk. Why is shmem using GFP_USER?
>
> mnm:/usr/src/25> grep page_address mm/shmem.c
> mnm:/usr/src/25>

shmem uses GFP_USER for its index pages to GFP_HIGHUSER data pages.

Not to say there aren't other problems in the mix too, but Bill's
main problem here will be one you discovered a while ago, Andrew.
We fixed it then, but in my loopable tmpfs version, and I've been
slow to extract the fixes and push them to mainline (or now -mm),
since there's not much else that suffers than dbench.

The problem is that dbench likes to do large random(?) seeks and
then writes at resulting offset; and although shmem-tmpfs imposes
a cap (default: half of memory) on the data pages, it imposes no
cap on its index pages. So it foolishly ends up filling normal
zone with empty index blocks for zero-length files, the index
page allocation being done _before_ the data cap check.

I'll rebase the relevant fixes against 2.5.35-mm1 later today,
do a little testing and post the patch.

What I never did was try GFP_HIGHUSER and kmap on the index pages:
I think I decided back then that it wasn't likely to be needed
(sparsely filled file indexes are a rarer case than sparsely filled
pagetables, once the stupidity is fixed; and small files don't use
index pages at all). But Bill's testing may well prove me wrong.

Hugh

2002-09-17 07:25:59

by William Lee Irwin III

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

On Tue, Sep 17, 2002 at 08:01:20AM +0100, Hugh Dickins wrote:
> shmem uses GFP_USER for its index pages to GFP_HIGHUSER data pages.
> Not to say there aren't other problems in the mix too, but Bill's
> main problem here will be one you discovered a while ago, Andrew.
> We fixed it then, but in my loopable tmpfs version, and I've been
> slow to extract the fixes and push them to mainline (or now -mm),
> since there's not much else that suffers than dbench.
> The problem is that dbench likes to do large random(?) seeks and
> then writes at resulting offset; and although shmem-tmpfs imposes
> a cap (default: half of memory) on the data pages, it imposes no
> cap on its index pages. So it foolishly ends up filling normal
> zone with empty index blocks for zero-length files, the index
> page allocation being done _before_ the data cap check.

The extreme configurations of my machines put a great deal of stress on
many codepaths. It shouldn't be regarded as any great failing that some
corrections are required to function properly on them, as the visibility
given to highmem-related pressures is far, far greater than seen elsewhere.

One thing to bear in mind is that though the kernel may be functioning
correctly in refusing the requests made, the additional functionality
of being capable of servicing them would be much appreciated and likely
to be of good use for the machines in the field we serve. Although from
your general stance on things, it seems I've little to convince you of.

On Tue, Sep 17, 2002 at 08:01:20AM +0100, Hugh Dickins wrote:
> I'll rebase the relevant fixes against 2.5.35-mm1 later today,
> do a little testing and post the patch.
> What I never did was try GFP_HIGHUSER and kmap on the index pages:
> I think I decided back then that it wasn't likely to be needed
> (sparsely filled file indexes are a rarer case than sparsely filled
> pagetables, once the stupidity is fixed; and small files don't use
> index pages at all). But Bill's testing may well prove me wrong.

I suspected you may well have plots here, as you have often before,
so I held off on brewing up attempts at such myself until you replied.
I'll defer to you.

Thanks,
Bill

P.S.:

The original intent of this testing was to obtain a profile of "best
case" behavior not limited by throttling within the block layer. The
method was derived from the observation that though the test on-disk
was seek-bound according to the block layer, as the test was configured,
it should have been able to have been carried out in-core. Carrying out
the test on tmpfs was the method chosen to eliminate the block throttling.

Whatever kind of additional stress-testing and optimization you have an
interest in enabling tmpfs for I'd be at least moderately interested in
providing, as I've some interest in using tmpfs to assist with
generalized large page support within the core kernel.

2002-09-17 07:53:51

by Christoph Rohland

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

Hi Hugh,

On Tue, 17 Sep 2002, Hugh Dickins wrote:
> What I never did was try GFP_HIGHUSER and kmap on the index pages:
> I think I decided back then that it wasn't likely to be needed
> (sparsely filled file indexes are a rarer case than sparsely filled
> pagetables, once the stupidity is fixed; and small files don't use
> index pages at all). But Bill's testing may well prove me wrong.

I think that this would be a good improvement. Big database and
application servers would definitely benefit from it, desktops could
easier use tmpfs as temporary file systems.

I never dared to do it with my limited time since I feared deadlock
situations.

Also I ended up that I would try to go one step further: Make the
index pages swappable, i.e. make the directory nodes normal tmpfs
files. This would even make the accounting right.

Greetings
Christoph

2002-09-17 07:57:51

by Andrew Morton

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

William Lee Irwin III wrote:
>
> ...
> The extreme configurations of my machines put a great deal of stress on
> many codepaths.

Nevertheless, the issue you raised is valid.

Should direct reclaim reach across and reclaim from other nodes, or not?

If so, when?

I suspect that it shouldn't. Direct reclaim is increasingly
last-resort and only seems to happen (now) with stresstest-type
workloads, and when there is latency in getting kswapd up and
running.

I'd suggest that the balancing of other nodes be left in the
hands of kswapd and that the fallback logic be isolated to the
page allocation path.

The current kswapd logic is all tangled up with direct reclaim.
I've split this up, so that we can balance these two different
functions. Code passes initial stresstesting; there may be corner
cases.

There is some lack of clarity in what kswapd does and what
direct-reclaim tasks do; try_to_free_pages() tries to service both
functions, and they are different.

- kswapd's role is to keep all zones on its node at

zone->free_pages >= zone->pages_high.

and to never stop as long as any zones do not meet that condition.

- A direct reclaimer's role is to try to free some pages from the
zones which are suitable for this particular allocation request, and
to return when that has been achieved, or when all the relevant zones
are at

zone->free_pages >= zone->pages_high.

The patch explicitly separates these two code paths; kswapd does not
run try_to_free_pages() any more. kswapd should not be aware of zone
fallbacks.

include/linux/mmzone.h | 1
mm/page_alloc.c | 3
mm/vmscan.c | 238 +++++++++++++++++++++++--------------------------
3 files changed, 116 insertions(+), 126 deletions(-)

--- 2.5.35/mm/vmscan.c~per-zone-vm Tue Sep 17 00:23:57 2002
+++ 2.5.35-akpm/mm/vmscan.c Tue Sep 17 00:59:47 2002
@@ -109,7 +109,8 @@ struct vmstats {
int refiled_nonfreeable;
int refiled_no_mapping;
int refiled_nofs;
- int refiled_congested;
+ int refiled_congested_kswapd;
+ int refiled_congested_non_kswapd;
int written_back;
int refiled_writeback;
int refiled_writeback_nofs;
@@ -122,15 +123,19 @@ struct vmstats {
int refill_refiled;
} vmstats;

+
+/*
+ * shrink_list returns the number of reclaimed pages
+ */
static /* inline */ int
-shrink_list(struct list_head *page_list, int nr_pages,
- unsigned int gfp_mask, int *max_scan, int *nr_mapped)
+shrink_list(struct list_head *page_list, unsigned int gfp_mask,
+ int *max_scan, int *nr_mapped)
{
struct address_space *mapping;
LIST_HEAD(ret_pages);
struct pagevec freed_pvec;
- const int nr_pages_in = nr_pages;
int pgactivate = 0;
+ int ret = 0;

pagevec_init(&freed_pvec);
while (!list_empty(page_list)) {
@@ -268,7 +273,10 @@ shrink_list(struct list_head *page_list,
bdi = mapping->backing_dev_info;
if (bdi != current->backing_dev_info &&
bdi_write_congested(bdi)){
- vmstats.refiled_congested++;
+ if (current->flags & PF_KSWAPD)
+ vmstats.refiled_congested_kswapd++;
+ else
+ vmstats.refiled_congested_non_kswapd++;
goto keep_locked;
}

@@ -336,7 +344,7 @@ shrink_list(struct list_head *page_list,
__put_page(page); /* The pagecache ref */
free_it:
unlock_page(page);
- nr_pages--;
+ ret++;
vmstats.reclaimed++;
if (!pagevec_add(&freed_pvec, page))
__pagevec_release_nonlru(&freed_pvec);
@@ -354,11 +362,11 @@ keep:
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
__pagevec_release_nonlru(&freed_pvec);
- mod_page_state(pgsteal, nr_pages_in - nr_pages);
+ mod_page_state(pgsteal, ret);
if (current->flags & PF_KSWAPD)
- mod_page_state(kswapd_steal, nr_pages_in - nr_pages);
+ mod_page_state(kswapd_steal, ret);
mod_page_state(pgactivate, pgactivate);
- return nr_pages;
+ return ret;
}

/*
@@ -367,18 +375,19 @@ keep:
* not freed will be added back to the LRU.
*
* shrink_cache() is passed the number of pages to try to free, and returns
- * the number which are yet-to-free.
+ * the number of pages which were reclaimed.
*
* For pagecache intensive workloads, the first loop here is the hottest spot
* in the kernel (apart from the copy_*_user functions).
*/
static /* inline */ int
-shrink_cache(int nr_pages, struct zone *zone,
+shrink_cache(const int nr_pages, struct zone *zone,
unsigned int gfp_mask, int max_scan, int *nr_mapped)
{
LIST_HEAD(page_list);
struct pagevec pvec;
int nr_to_process;
+ int ret = 0;

/*
* Try to ensure that we free `nr_pages' pages in one pass of the loop.
@@ -391,10 +400,11 @@ shrink_cache(int nr_pages, struct zone *

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- while (max_scan > 0 && nr_pages > 0) {
+ while (max_scan > 0 && ret < nr_pages) {
struct page *page;
int nr_taken = 0;
int nr_scan = 0;
+ int nr_freed;

while (nr_scan++ < nr_to_process &&
!list_empty(&zone->inactive_list)) {
@@ -425,10 +435,10 @@ shrink_cache(int nr_pages, struct zone *

max_scan -= nr_scan;
mod_page_state(pgscan, nr_scan);
- nr_pages = shrink_list(&page_list, nr_pages,
- gfp_mask, &max_scan, nr_mapped);
-
- if (nr_pages <= 0 && list_empty(&page_list))
+ nr_freed = shrink_list(&page_list, gfp_mask,
+ &max_scan, nr_mapped);
+ ret += nr_freed;
+ if (nr_freed <= 0 && list_empty(&page_list))
goto done;

spin_lock_irq(&zone->lru_lock);
@@ -454,7 +464,7 @@ shrink_cache(int nr_pages, struct zone *
spin_unlock_irq(&zone->lru_lock);
done:
pagevec_release(&pvec);
- return nr_pages;
+ return ret;
}

/*
@@ -578,9 +588,14 @@ refill_inactive_zone(struct zone *zone,
mod_page_state(pgdeactivate, pgdeactivate);
}

+/*
+ * Try to reclaim `nr_pages' from this zone. Returns the number of reclaimed
+ * pages. This is a basic per-zone page freer. Used by both kswapd and
+ * direct reclaim.
+ */
static /* inline */ int
-shrink_zone(struct zone *zone, int max_scan,
- unsigned int gfp_mask, int nr_pages, int *nr_mapped)
+shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask,
+ const int nr_pages, int *nr_mapped)
{
unsigned long ratio;

@@ -601,36 +616,60 @@ shrink_zone(struct zone *zone, int max_s
atomic_sub(SWAP_CLUSTER_MAX, &zone->refill_counter);
refill_inactive_zone(zone, SWAP_CLUSTER_MAX);
}
- nr_pages = shrink_cache(nr_pages, zone, gfp_mask,
- max_scan, nr_mapped);
- return nr_pages;
+ return shrink_cache(nr_pages, zone, gfp_mask, max_scan, nr_mapped);
+}
+
+/*
+ * FIXME: don't do this for ZONE_HIGHMEM
+ */
+/*
+ * Here we assume it costs one seek to replace a lru page and that it also
+ * takes a seek to recreate a cache object. With this in mind we age equal
+ * percentages of the lru and ageable caches. This should balance the seeks
+ * generated by these structures.
+ *
+ * NOTE: for now I do this for all zones. If we find this is too aggressive
+ * on large boxes we may want to exclude ZONE_HIGHMEM.
+ *
+ * If we're encountering mapped pages on the LRU then increase the pressure on
+ * slab to avoid swapping.
+ */
+static void shrink_slab(int total_scanned, int gfp_mask)
+{
+ int shrink_ratio;
+ int pages = nr_used_zone_pages();
+
+ shrink_ratio = (pages / (total_scanned + 1)) + 1;
+ shrink_dcache_memory(shrink_ratio, gfp_mask);
+ shrink_icache_memory(shrink_ratio, gfp_mask);
+ shrink_dqcache_memory(shrink_ratio, gfp_mask);
}

+/*
+ * This is the direct reclaim path, for page-allocating processes. We only
+ * try to reclaim pages from zones which will satisfy the caller's allocation
+ * request.
+ */
static int
shrink_caches(struct zone *classzone, int priority,
- int *total_scanned, int gfp_mask, int nr_pages)
+ int *total_scanned, int gfp_mask, const int nr_pages)
{
struct zone *first_classzone;
struct zone *zone;
- int ratio;
int nr_mapped = 0;
- int pages = nr_used_zone_pages();
+ int ret = 0;

first_classzone = classzone->zone_pgdat->node_zones;
for (zone = classzone; zone >= first_classzone; zone--) {
int max_scan;
int to_reclaim;
- int unreclaimed;

to_reclaim = zone->pages_high - zone->free_pages;
if (to_reclaim < 0)
continue; /* zone has enough memory */

- if (to_reclaim > SWAP_CLUSTER_MAX)
- to_reclaim = SWAP_CLUSTER_MAX;
-
- if (to_reclaim < nr_pages)
- to_reclaim = nr_pages;
+ to_reclaim = min(to_reclaim, SWAP_CLUSTER_MAX);
+ to_reclaim = max(to_reclaim, nr_pages);

/*
* If we cannot reclaim `nr_pages' pages by scanning twice
@@ -639,33 +678,18 @@ shrink_caches(struct zone *classzone, in
max_scan = zone->nr_inactive >> priority;
if (max_scan < to_reclaim * 2)
max_scan = to_reclaim * 2;
- unreclaimed = shrink_zone(zone, max_scan,
- gfp_mask, to_reclaim, &nr_mapped);
- nr_pages -= to_reclaim - unreclaimed;
+ ret += shrink_zone(zone, max_scan, gfp_mask,
+ to_reclaim, &nr_mapped);
*total_scanned += max_scan;
+ *total_scanned += nr_mapped;
+ if (ret >= nr_pages)
+ break;
}
-
- /*
- * Here we assume it costs one seek to replace a lru page and that
- * it also takes a seek to recreate a cache object. With this in
- * mind we age equal percentages of the lru and ageable caches.
- * This should balance the seeks generated by these structures.
- *
- * NOTE: for now I do this for all zones. If we find this is too
- * aggressive on large boxes we may want to exclude ZONE_HIGHMEM
- *
- * If we're encountering mapped pages on the LRU then increase the
- * pressure on slab to avoid swapping.
- */
- ratio = (pages / (*total_scanned + nr_mapped + 1)) + 1;
- shrink_dcache_memory(ratio, gfp_mask);
- shrink_icache_memory(ratio, gfp_mask);
- shrink_dqcache_memory(ratio, gfp_mask);
- return nr_pages;
+ return ret;
}

/*
- * This is the main entry point to page reclaim.
+ * This is the main entry point to direct page reclaim.
*
* If a full scan of the inactive list fails to free enough memory then we
* are "out of memory" and something needs to be killed.
@@ -685,17 +709,18 @@ int
try_to_free_pages(struct zone *classzone,
unsigned int gfp_mask, unsigned int order)
{
- int priority = DEF_PRIORITY;
- int nr_pages = SWAP_CLUSTER_MAX;
+ int priority;
+ const int nr_pages = SWAP_CLUSTER_MAX;
+ int nr_reclaimed = 0;

inc_page_state(pageoutrun);

for (priority = DEF_PRIORITY; priority; priority--) {
int total_scanned = 0;

- nr_pages = shrink_caches(classzone, priority, &total_scanned,
- gfp_mask, nr_pages);
- if (nr_pages <= 0)
+ nr_reclaimed += shrink_caches(classzone, priority,
+ &total_scanned, gfp_mask, nr_pages);
+ if (nr_reclaimed >= nr_pages)
return 1;
if (total_scanned == 0)
return 1; /* All zones had enough free memory */
@@ -710,62 +735,46 @@ try_to_free_pages(struct zone *classzone

/* Take a nap, wait for some writeback to complete */
blk_congestion_wait(WRITE, HZ/4);
+ shrink_slab(total_scanned, gfp_mask);
}
if (gfp_mask & __GFP_FS)
out_of_memory();
return 0;
}

-static int check_classzone_need_balance(struct zone *classzone)
+/*
+ * kswapd will work across all this node's zones until they are all at
+ * pages_high.
+ */
+static void kswapd_balance_pgdat(pg_data_t *pgdat)
{
- struct zone *first_classzone;
+ int priority = DEF_PRIORITY;
+ int i;

- first_classzone = classzone->zone_pgdat->node_zones;
- while (classzone >= first_classzone) {
- if (classzone->free_pages > classzone->pages_high)
- return 0;
- classzone--;
- }
- return 1;
-}
+ for (priority = DEF_PRIORITY; priority; priority--) {
+ int success = 1;

-static int kswapd_balance_pgdat(pg_data_t * pgdat)
-{
- int need_more_balance = 0, i;
- struct zone *zone;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+ int nr_mapped = 0;
+ int max_scan;
+ int to_reclaim;

- for (i = pgdat->nr_zones-1; i >= 0; i--) {
- zone = pgdat->node_zones + i;
- cond_resched();
- if (!zone->need_balance)
- continue;
- if (!try_to_free_pages(zone, GFP_KSWAPD, 0)) {
- zone->need_balance = 0;
- __set_current_state(TASK_INTERRUPTIBLE);
- schedule_timeout(HZ);
- continue;
+ to_reclaim = zone->pages_high - zone->free_pages;
+ if (to_reclaim <= 0)
+ continue;
+ success = 0;
+ max_scan = zone->nr_inactive >> priority;
+ if (max_scan < to_reclaim * 2)
+ max_scan = to_reclaim * 2;
+ shrink_zone(zone, max_scan, GFP_KSWAPD,
+ to_reclaim, &nr_mapped);
+ shrink_slab(max_scan + nr_mapped, GFP_KSWAPD);
}
- if (check_classzone_need_balance(zone))
- need_more_balance = 1;
- else
- zone->need_balance = 0;
- }
-
- return need_more_balance;
-}
-
-static int kswapd_can_sleep_pgdat(pg_data_t * pgdat)
-{
- struct zone *zone;
- int i;
-
- for (i = pgdat->nr_zones-1; i >= 0; i--) {
- zone = pgdat->node_zones + i;
- if (zone->need_balance)
- return 0;
+ if (success)
+ break; /* All zones are at pages_high */
+ blk_congestion_wait(WRITE, HZ/4);
}
-
- return 1;
}

/*
@@ -785,7 +794,7 @@ int kswapd(void *p)
{
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ DEFINE_WAIT(wait);

daemonize();
set_cpus_allowed(tsk, __node_to_cpu_mask(p->node_id));
@@ -806,27 +815,12 @@ int kswapd(void *p)
*/
tsk->flags |= PF_MEMALLOC|PF_KSWAPD;

- /*
- * Kswapd main loop.
- */
- for (;;) {
+ for ( ; ; ) {
if (current->flags & PF_FREEZE)
refrigerator(PF_IOTHREAD);
- __set_current_state(TASK_INTERRUPTIBLE);
- add_wait_queue(&pgdat->kswapd_wait, &wait);
-
- mb();
- if (kswapd_can_sleep_pgdat(pgdat))
- schedule();
-
- __set_current_state(TASK_RUNNING);
- remove_wait_queue(&pgdat->kswapd_wait, &wait);
-
- /*
- * If we actually get into a low-memory situation,
- * the processes needing more memory will wake us
- * up on a more timely basis.
- */
+ prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&pgdat->kswapd_wait, &wait);
kswapd_balance_pgdat(pgdat);
blk_run_queues();
}
--- 2.5.35/mm/page_alloc.c~per-zone-vm Tue Sep 17 00:23:57 2002
+++ 2.5.35-akpm/mm/page_alloc.c Tue Sep 17 00:23:57 2002
@@ -343,8 +343,6 @@ __alloc_pages(unsigned int gfp_mask, uns
}
}

- classzone->need_balance = 1;
- mb();
/* we're somewhat low on memory, failed to find what we needed */
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];
@@ -869,7 +867,6 @@ void __init free_area_init_core(pg_data_
spin_lock_init(&zone->lru_lock);
zone->zone_pgdat = pgdat;
zone->free_pages = 0;
- zone->need_balance = 0;
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_list);
atomic_set(&zone->refill_counter, 0);
--- 2.5.35/include/linux/mmzone.h~per-zone-vm Tue Sep 17 00:23:57 2002
+++ 2.5.35-akpm/include/linux/mmzone.h Tue Sep 17 00:23:57 2002
@@ -62,7 +62,6 @@ struct zone {
spinlock_t lock;
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
- int need_balance;

ZONE_PADDING(_pad1_)

.

2002-12-10 05:21:41

by William Lee Irwin III

[permalink] [raw]

Subject: Re: dbench on tmpfs OOM's

On Tue, Sep 17, 2002 at 08:01:20AM +0100, Hugh Dickins wrote:
> What I never did was try GFP_HIGHUSER and kmap on the index pages:
> I think I decided back then that it wasn't likely to be needed
> (sparsely filled file indexes are a rarer case than sparsely filled
> pagetables, once the stupidity is fixed; and small files don't use
> index pages at all). But Bill's testing may well prove me wrong.

The included fix works flawlessly under the conditions of the original
reported problem on 2.5.50-bk6-wli-1.

Sorry for not getting back to you sooner.

Thanks,
Bill

Results:
-------

instance 1:
----------
Throughput 86.2057 MB/sec (NB=107.757 MB/sec 862.057 MBit/sec) 512 procs
dbench 512 360.36s user 12645.64s system 1648% cpu 13:08.91 total

instance 2:
----------
Throughput 85.8913 MB/sec (NB=107.364 MB/sec 858.913 MBit/sec) 512 procs
dbench 512 361.96s user 11780.65s system 1539% cpu 13:08.97 total

Peak memory consumption during the run:

/proc/meminfo:
-------------
MemTotal: 32125300 kB
MemFree: 7841472 kB
MemShared: 0 kB
Buffers: 1236 kB
Cached: 23397036 kB
SwapCached: 0 kB
Active: 149512 kB
Inactive: 23386864 kB
HighTotal: 31588352 kB
HighFree: 7681344 kB
LowTotal: 536948 kB
LowFree: 160128 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 142508 kB
Slab: 133020 kB
Committed_AS: 23757472 kB
PageTables: 18820 kB
ReverseMaps: 168934
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB

/proc/slabinfo (reported by bloatmost):
---------------------------------------
shmem_inode_cache: 39321KB 39682KB 99.9
radix_tree_node: 33870KB 34335KB 98.64
pae_pmd: 18732KB 18732KB 100.0
dentry_cache: 11612KB 14156KB 82.2
task_struct: 2691KB 2710KB 99.32
signal_act: 2207KB 2216KB 99.58
filp: 1976KB 2032KB 97.23
size-1024: 1824KB 1824KB 100.0
names_cache: 1740KB 1740KB 100.0
vm_area_struct: 1598KB 1650KB 96.88
pte_chain: 1271KB 1305KB 97.39
size-2048: 982KB 1032KB 95.15
biovec-BIO_MAX_PAGES: 768KB 780KB 98.46
files_cache: 704KB 704KB 100.0
mm_struct: 656KB 665KB 98.65
size-512: 421KB 436KB 96.55
blkdev_requests: 400KB 405KB 98.76
biovec-128: 384KB 390KB 98.46
ext2_inode_cache: 309KB 315KB 98.33
inode_cache: 253KB 253KB 100.0
skbuff_head_cache: 221KB 251KB 88.24
size-32: 183KB 211KB 86.68