2006-10-17 17:35:08

by Martin Bligh

[permalink] [raw]
Subject: [RFC] Remove temp_priority

diff -aurpN -X /home/mbligh/.diff.exclude linux-2.6.18/include/linux/mmzone.h 2.6.18-prio_fix/include/linux/mmzone.h
--- linux-2.6.18/include/linux/mmzone.h 2006-09-20 12:24:41.000000000 -0700
+++ 2.6.18-prio_fix/include/linux/mmzone.h 2006-10-17 10:27:48.000000000 -0700
@@ -199,13 +199,9 @@ struct zone {
* under - it drives the swappiness decision: whether to unmap mapped
* pages.
*
- * temp_priority is used to remember the scanning priority at which
- * this zone was successfully refilled to free_pages == pages_high.
- *
- * Access to both these fields is quite racy even on uniprocessor. But
- * it is expected to average out OK.
+ * Access to this field is quite racy even on uniprocessor. It needs
+ * to be fixed.
*/
- int temp_priority;
int prev_priority;


diff -aurpN -X /home/mbligh/.diff.exclude linux-2.6.18/mm/page_alloc.c 2.6.18-prio_fix/mm/page_alloc.c
--- linux-2.6.18/mm/page_alloc.c 2006-09-20 12:24:42.000000000 -0700
+++ 2.6.18-prio_fix/mm/page_alloc.c 2006-10-17 10:25:55.000000000 -0700
@@ -2016,7 +2016,7 @@ static void __meminit free_area_init_cor
zone->zone_pgdat = pgdat;
zone->free_pages = 0;

- zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
+ zone->prev_priority = DEF_PRIORITY;

zone_pcp_init(zone);
INIT_LIST_HEAD(&zone->active_list);
diff -aurpN -X /home/mbligh/.diff.exclude linux-2.6.18/mm/vmscan.c 2.6.18-prio_fix/mm/vmscan.c
--- linux-2.6.18/mm/vmscan.c 2006-09-20 12:24:42.000000000 -0700
+++ 2.6.18-prio_fix/mm/vmscan.c 2006-10-17 10:25:24.000000000 -0700
@@ -934,7 +934,6 @@ static unsigned long shrink_zones(int pr
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;

- zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;

@@ -984,7 +983,6 @@ unsigned long try_to_free_pages(struct z
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;

- zone->temp_priority = DEF_PRIORITY;
lru_pages += zone->nr_active + zone->nr_inactive;
}

@@ -1021,6 +1019,8 @@ unsigned long try_to_free_pages(struct z
if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
blk_congestion_wait(WRITE, HZ/10);
}
+ if (priority < 0)
+ priority = 0;
out:
for (i = 0; zones[i] != 0; i++) {
struct zone *zone = zones[i];
@@ -1028,7 +1028,7 @@ out:
if (!cpuset_zone_allowed(zone, __GFP_HARDWALL))
continue;

- zone->prev_priority = zone->temp_priority;
+ zone->prev_priority = priority;
}
return ret;
}
@@ -1075,12 +1075,6 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);

- for (i = 0; i < pgdat->nr_zones; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- zone->temp_priority = DEF_PRIORITY;
- }
-
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
@@ -1140,7 +1134,6 @@ scan:
if (!zone_watermark_ok(zone, order, zone->pages_high,
end_zone, 0))
all_zones_ok = 0;
- zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
sc.nr_scanned = 0;
@@ -1182,11 +1175,13 @@ scan:
if (nr_reclaimed >= SWAP_CLUSTER_MAX)
break;
}
+ if (priority < 0)
+ priority = 0;
out:
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;

- zone->prev_priority = zone->temp_priority;
+ zone->prev_priority = priority;
}
if (!all_zones_ok) {
cond_resched();


Attachments:
2.6.18-prio_fix (3.40 kB)

2006-10-17 17:42:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Remove temp_priority

Martin Bligh wrote:
> This is not tested yet. What do you think?
>
> This patch removes temp_priority, as it is racy. We're setting
> prev_priority from it, and yet temp_priority could have been
> set back to DEF_PRIORITY by another reclaimer.

I like it. I wonder if we should get kswapd to stick its priority
into the zone at the point where zone_watermark_ok becomes true,
rather than setting all zones to the lowest priority? That would
require a bit more logic though I guess.

For that matter (going off the topic a bit), I wonder if
try_to_free_pages should have a watermark check there too? This
might help reduce the latency issue you brought up where one process
has reclaimed a lot of pages, but another isn't making any progress
and has to go through the full priority range? Maybe that's
statistically pretty unlikely?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-17 17:53:26

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] Remove temp_priority

Nick Piggin wrote:
> Martin Bligh wrote:
>
>> This is not tested yet. What do you think?
>>
>> This patch removes temp_priority, as it is racy. We're setting
>> prev_priority from it, and yet temp_priority could have been
>> set back to DEF_PRIORITY by another reclaimer.
>
>
> I like it.

OK, I think that should fix most of it, and I'll admit it's cleaner
than the first one.

> I wonder if we should get kswapd to stick its priority
> into the zone at the point where zone_watermark_ok becomes true,
> rather than setting all zones to the lowest priority? That would
> require a bit more logic though I guess.
>
> For that matter (going off the topic a bit), I wonder if
> try_to_free_pages should have a watermark check there too? This
> might help reduce the latency issue you brought up where one process
> has reclaimed a lot of pages, but another isn't making any progress
> and has to go through the full priority range? Maybe that's
> statistically pretty unlikely?

I've been mulling over how to kill prev_priority (and make everyone
happy, including akpm). My original thought was to keep a different
min_priority for each of GFP_IO, GFP_IO|GFP_FS, and the no IO ones.
But we still have the problem of how to accurately set the min back
up when we are sucessful.

Perhaps we should be a little more radical, and treat everyone apart
from kswapd as independant. Keep a kswapd_priority in the zone
structure, and all the direct reclaimers have their own local priority.
Then we set distress from min(kswap_priority, priority). All that does
is kick the direct reclaimers up a bit faster - kswapd has the easiest
time reclaiming pages, so that should never be too low.

M.




2006-10-18 12:42:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Remove temp_priority

Martin Bligh wrote:
> Nick Piggin wrote:

>> For that matter (going off the topic a bit), I wonder if
>> try_to_free_pages should have a watermark check there too? This
>> might help reduce the latency issue you brought up where one process
>> has reclaimed a lot of pages, but another isn't making any progress
>> and has to go through the full priority range? Maybe that's
>> statistically pretty unlikely?
>
>
> I've been mulling over how to kill prev_priority (and make everyone
> happy, including akpm). My original thought was to keep a different
> min_priority for each of GFP_IO, GFP_IO|GFP_FS, and the no IO ones.
> But we still have the problem of how to accurately set the min back
> up when we are sucessful.
>
> Perhaps we should be a little more radical, and treat everyone apart
> from kswapd as independant. Keep a kswapd_priority in the zone
> structure, and all the direct reclaimers have their own local priority.
> Then we set distress from min(kswap_priority, priority). All that does
> is kick the direct reclaimers up a bit faster - kswapd has the easiest
> time reclaiming pages, so that should never be too low.

I think that could *work*, but I still think it is a heuristics change
rather than a bug fix.

Do we want everyone to make some progress, even if that means having
some do some swapping and others not; or have the zone pressure (and
tendancy to swap) depend on how well progress is going, globally?

The latter is what we have now, and I don't think it is terrible (not
saying your idea can't work better, but it would need careful
consideration).

Coming from another angle, I am thinking about doing away with direct
reclaim completely. That means we don't need any GFP_IO or GFP_FS, and
solves the problem of large numbers of processes stuck in reclaim and
skewing aging and depleting the memory reserve.

But that's tricky because we don't have enough kswapds to get maximum
reclaim throughput on many configurations (only single core opterons
and UP systems, really).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-18 14:49:06

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] Remove temp_priority

> Coming from another angle, I am thinking about doing away with direct
> reclaim completely. That means we don't need any GFP_IO or GFP_FS, and
> solves the problem of large numbers of processes stuck in reclaim and
> skewing aging and depleting the memory reserve.

Last time I proposed that, the objection was how to throttle the heavy
dirtiers so they don't fill up RAM with dirty pages?

Also, how do you do atomic allocations? Create a huge memory pool and
pray really hard?

> But that's tricky because we don't have enough kswapds to get maximum
> reclaim throughput on many configurations (only single core opterons
> and UP systems, really).

It's not a question of enough kswapds. It's that we can dirty pages
faster than they can possibly be written to disk.

dd if=/dev/zero of=/tmp/foo

M.


2006-10-18 14:56:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Remove temp_priority

Martin J. Bligh wrote:
>> Coming from another angle, I am thinking about doing away with direct
>> reclaim completely. That means we don't need any GFP_IO or GFP_FS, and
>> solves the problem of large numbers of processes stuck in reclaim and
>> skewing aging and depleting the memory reserve.
>
>
> Last time I proposed that, the objection was how to throttle the heavy
> dirtiers so they don't fill up RAM with dirty pages?

Now that we have the dirty mmap accounting, page dirtiers should be
throttled pretty well via page writeback throttling.

> Also, how do you do atomic allocations? Create a huge memory pool and
> pray really hard?

Well, yes. Atomic allocations as of *today* cannot do any reclaim, and
thus they rely on kswapd to free their memory, and we keep a (not huge)
memory pool for them. They also have to be able to handle failures, and
by and large they do OK.

>> But that's tricky because we don't have enough kswapds to get maximum
>> reclaim throughput on many configurations (only single core opterons
>> and UP systems, really).
>
>
> It's not a question of enough kswapds. It's that we can dirty pages
> faster than they can possibly be written to disk.
>
> dd if=/dev/zero of=/tmp/foo

You can't catch that at the allocation side anyway because clean pagecache
may already exist for /tmp/foo.

We've always done pretty well (in 2.6) with correctly throttling and
limiting write(2) writes into pagecache, haven't we?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com