Andrew Morton wrote:
> shrink_slab-for-all-zones.patch
> vm: scan slab in response to highmem scanning
>
> zone-balancing-fix.patch
> vmscan: zone balancing fix
On 2.6.3 + [1] + nfsd-lofft.patch running on a 1GB ram file server. I
have noticed two related issues.
First, under 2.6.3 it averages about 90MB[2] anon memory, and 30MB with
the -mm4 vm (the rest is in swap cache). This could balance out on the
normal non-idle week-day load though...
Second the -mm4 vm, there is a lot more swapping[3] going on during the
daily updatedb, and backup runs that are performed on this machine.
I'd have to call this second issue a regression, but I want to run it a
couple more days to see if it gets any better (unless you agree of course).
Mike
[1]
instrument-highmem-page-reclaim.patch
blk_congestion_wait-return-remaining.patch
vmscan-remove-priority.patch
kswapd-throttling-fixes.patch
vm-dont-rotate-active-list.patch
vm-lru-info.patch
vm-shrink-zone.patch
vm-tune-throttle.patch
shrink_slab-for-all-zones.patch
zone-balancing-fix.patch
zone-balancing-batching.patch
[2]
http://www.matchmail.com/stats/lrrd/matchmail.com/fileserver.matchmail.com-memory.html
[3]
http://www.matchmail.com/stats/lrrd/matchmail.com/fileserver.matchmail.com-swap.html
Mike Fedyk wrote:
> Andrew Morton wrote:
>
>> shrink_slab-for-all-zones.patch
>> vm: scan slab in response to highmem scanning
>>
>> zone-balancing-fix.patch
>> vmscan: zone balancing fix
>
>
> On 2.6.3 + [1] + nfsd-lofft.patch running on a 1GB ram file server.
> I have noticed two related issues.
>
> First, under 2.6.3 it averages about 90MB[2] anon memory, and 30MB
> with the -mm4 vm (the rest is in swap cache). This could balance out
> on the normal non-idle week-day load though...
>
> Second the -mm4 vm, there is a lot more swapping[3] going on during
> the daily updatedb, and backup runs that are performed on this machine.
> I'd have to call this second issue a regression, but I want to run it
> a couple more days to see if it gets any better (unless you agree of
> course).
>
There are a few things backed out now in 2.6.4-rc1-mm1, and quite a
few other changes. I hope we can trouble you to test 2.6.4-rc1-mm1?
Tell me, do you have highmem enabled on this system? If so, swapping
might be explained by the batching patch. With it, a small highmem
zone could possibly place quite a lot more pressure on a large
ZONE_NORMAL.
2.6.4-rc1-mm1 sould do much better here.
Nick Piggin wrote:
>
>
> Mike Fedyk wrote:
>
>> Andrew Morton wrote:
>>
>>> shrink_slab-for-all-zones.patch
>>> vm: scan slab in response to highmem scanning
>>>
>>> zone-balancing-fix.patch
>>> vmscan: zone balancing fix
>>
>>
>>
>> On 2.6.3 + [1] + nfsd-lofft.patch running on a 1GB ram file server.
>> I have noticed two related issues.
>>
>> First, under 2.6.3 it averages about 90MB[2] anon memory, and 30MB
>> with the -mm4 vm (the rest is in swap cache). This could balance out
>> on the normal non-idle week-day load though...
>>
>> Second the -mm4 vm, there is a lot more swapping[3] going on during
>> the daily updatedb, and backup runs that are performed on this machine.
>> I'd have to call this second issue a regression, but I want to run it
>> a couple more days to see if it gets any better (unless you agree of
>> course).
>>
>
> There are a few things backed out now in 2.6.4-rc1-mm1, and quite a
> few other changes. I hope we can trouble you to test 2.6.4-rc1-mm1?
Yes, I saw that, but since I wasn't using the new code, I chose to keep
it in the "-mm4" thread. :-D
I'll backport it to 2.6.3 if it doesn't patch with "-F3"...
> Tell me, do you have highmem enabled on this system? If so, swapping
Yes, to get that extra 128MB ram. :)
> might be explained by the batching patch. With it, a small highmem
> zone could possibly place quite a lot more pressure on a large
> ZONE_NORMAL.
>
> 2.6.4-rc1-mm1 sould do much better here.
OK, I'll give that one a shot Monday or Tuesday night.
So, I'll merge up 2.6.3 + "vm of rc1-mm1" and tell you guys what I see.
Are the graphs helpful at all?
Mike
Nick Piggin wrote:
>
> There are a few things backed out now in 2.6.4-rc1-mm1, and quite a
> few other changes. I hope we can trouble you to test 2.6.4-rc1-mm1?
>
> Tell me, do you have highmem enabled on this system? If so, swapping
> might be explained by the batching patch. With it, a small highmem
> zone could possibly place quite a lot more pressure on a large
> ZONE_NORMAL.
>
> 2.6.4-rc1-mm1 sould do much better here.
Gah no. It would have the same problem actually, if that is indeed
what is happening.
It will take a bit more work to solve this in rc1-mm1. You would
probably want to explicitly use incremental min limits for kswapd.
(background info in kswapd-avoid-higher-zones.patch)
Mike Fedyk wrote:
> Nick Piggin wrote:
>
>>
>>
>> There are a few things backed out now in 2.6.4-rc1-mm1, and quite a
>> few other changes. I hope we can trouble you to test 2.6.4-rc1-mm1?
>
>
> Yes, I saw that, but since I wasn't using the new code, I chose to
> keep it in the "-mm4" thread. :-D
>
> I'll backport it to 2.6.3 if it doesn't patch with "-F3"...
>
Actually, see my other post. It is possible you'll have the same
problem.
>> Tell me, do you have highmem enabled on this system? If so, swapping
>
>
> Yes, to get that extra 128MB ram. :)
>
Yeah thats fine. I think this would be the right thing to do,
especially for a file server. It is something that should work.
>> might be explained by the batching patch. With it, a small highmem
>> zone could possibly place quite a lot more pressure on a large
>> ZONE_NORMAL.
>>
>> 2.6.4-rc1-mm1 sould do much better here.
>
>
> OK, I'll give that one a shot Monday or Tuesday night.
>
> So, I'll merge up 2.6.3 + "vm of rc1-mm1" and tell you guys what I see.
>
I'm not so hopeful for you anymore :P
> Are the graphs helpful at all?
>
My eyes! The goggles, they do nothing!
They have a lot of good info but I'm a bit hard pressed working
out what kernel is running where, and it's a bit hard working out
all the shades of blue on my crappy little monitor.
But if they were easier to read I reckon they'd be useful ;)
Nick Piggin wrote:
>
>
> Mike Fedyk wrote:
>> So, I'll merge up 2.6.3 + "vm of rc1-mm1" and tell you guys what I see.
>>
>
> I'm not so hopeful for you anymore :P
These patches apply with only a few offsets if you apply them like in
the series file, so there's not much work for either of us in applying
these patches (unless I need to test without a dependent patch or
something obvious like that...)
>
>> Are the graphs helpful at all?
>>
>
>
> My eyes! The goggles, they do nothing!
>
Heh.
> They have a lot of good info but I'm a bit hard pressed working
> out what kernel is running where
Suffice it to say, 2.6.3 is the begining of week9, and 2.6.3-lofft-mm4vm
is the end of week9. The graphs weren't meant to keep secondary
information like kernel version...
> and it's a bit hard working out
> all the shades of blue on my crappy little monitor.
Yeah, I see what you mean. The code in the lrrd/munin project controls
what colors come in what order, but I can control what order the info is
output in...
> But if they were easier to read I reckon they'd be useful ;)
I'd like for that to be true especially since I rewrote the memory
plugin for munin to graph as much as was exported to userspace from the
Linux kernel...
Did I miss anything? ;)
linux-2.6-npiggin/mm/vmscan.c | 36 ++++++++++++++++++++++++------------
1 files changed, 24 insertions(+), 12 deletions(-)
diff -puN mm/vmscan.c~vm-kswapd-incremental-min mm/vmscan.c
--- linux-2.6/mm/vmscan.c~vm-kswapd-incremental-min 2004-03-01 20:29:18.000000000 +1100
+++ linux-2.6-npiggin/mm/vmscan.c 2004-03-01 20:44:26.000000000 +1100
@@ -889,6 +889,8 @@ out:
return ret;
}
+extern int sysctl_lower_zone_protection;
+
/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
@@ -907,12 +909,9 @@ out:
* dead and from now on, only perform a short scan. Basically we're polling
* the zone for when the problem goes away.
*
- * kswapd scans the zones in the highmem->normal->dma direction. It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones. This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * balance_pgdat tries to coexist with the INFAMOUS "incremental min" by
+ * trying to free lower zones a bit harder if higher zones are low too.
+ * See mm/page_alloc.c
*/
static int balance_pgdat(pg_data_t *pgdat, int nr_pages, struct page_state *ps)
{
@@ -930,24 +929,37 @@ static int balance_pgdat(pg_data_t *pgda
}
for (priority = DEF_PRIORITY; priority; priority--) {
+ unsigned long min;
int all_zones_ok = 1;
int pages_scanned = 0;
+ min = 0; /* Shut up gcc */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;
int total_scanned = 0;
int max_scan;
int reclaimed;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
- continue;
-
if (nr_pages == 0) { /* Not software suspend */
- if (zone->free_pages <= zone->pages_high)
- all_zones_ok = 0;
+ /* "incremental min" right here */
if (all_zones_ok)
+ min = zone->pages_high;
+ else
+ min += zone->pages_high;
+
+ if (zone->free_pages <= min)
+ all_zones_ok = 0;
+ else
continue;
+
+ min += zone->pages_high *
+ sysctl_lower_zone_protection;
}
+
+ /* Note: this is checked *after* min is incremented */
+ if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ continue;
+
zone->temp_priority = priority;
max_scan = zone->nr_inactive >> priority;
reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL,
_
Nick Piggin <[email protected]> wrote:
>
> Andrew, I think you had kswapd scanning in the direction opposite the
> one indicated by your comments. Or maybe I've just confused myself?
>
Nope, the node_zones[] array is indexed by
#define ZONE_DMA 0
#define ZONE_NORMAL 1
#define ZONE_HIGHMEM 2
Whereas the zonelist.zones[] array goes the other way:
[[[highmem,]normal,]dma,]NULL.
It's a complete dog's breakfast that stuff. Hard to understand, not very
logical and insufficiently commented.
linux-2.6-npiggin/mm/vmscan.c | 34 +++++++++++++++++++++++-----------
1 files changed, 23 insertions(+), 11 deletions(-)
diff -puN mm/vmscan.c~vm-kswapd-incremental-min mm/vmscan.c
--- linux-2.6/mm/vmscan.c~vm-kswapd-incremental-min 2004-03-01 20:29:18.000000000 +1100
+++ linux-2.6-npiggin/mm/vmscan.c 2004-03-01 21:27:24.000000000 +1100
@@ -889,6 +889,8 @@ out:
return ret;
}
+extern int sysctl_lower_zone_protection;
+
/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
@@ -907,12 +909,9 @@ out:
* dead and from now on, only perform a short scan. Basically we're polling
* the zone for when the problem goes away.
*
- * kswapd scans the zones in the highmem->normal->dma direction. It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones. This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * balance_pgdat tries to coexist with the INFAMOUS "incremental min" by
+ * trying to free lower zones a bit harder if higher zones are low too.
+ * See mm/page_alloc.c
*/
static int balance_pgdat(pg_data_t *pgdat, int nr_pages, struct page_state *ps)
{
@@ -930,8 +929,10 @@ static int balance_pgdat(pg_data_t *pgda
}
for (priority = DEF_PRIORITY; priority; priority--) {
+ unsigned long min;
int all_zones_ok = 1;
int pages_scanned = 0;
+ min = 0; /* Shut up gcc */
for (i = pgdat->nr_zones - 1; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;
@@ -939,15 +940,26 @@ static int balance_pgdat(pg_data_t *pgda
int max_scan;
int reclaimed;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
- continue;
-
if (nr_pages == 0) { /* Not software suspend */
- if (zone->free_pages <= zone->pages_high)
- all_zones_ok = 0;
+ /* "incremental min" right here */
if (all_zones_ok)
+ min = zone->pages_high;
+ else
+ min += zone->pages_high;
+
+ if (zone->free_pages <= min)
+ all_zones_ok = 0;
+ else
continue;
+
+ min += zone->pages_high *
+ sysctl_lower_zone_protection;
}
+
+ /* Note: this is checked *after* min is incremented */
+ if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ continue;
+
zone->temp_priority = priority;
max_scan = zone->nr_inactive >> priority;
reclaimed = shrink_zone(zone, max_scan, GFP_KERNEL,
_