2006-10-17 01:30:56

by Martin Bligh

[permalink] [raw]
Subject: [PATCH] Use min of two prio settings in calculating distress for reclaim

diff -aurpN -X /home/mbligh/.diff.exclude linux-2.6.18/mm/vmscan.c 2.6.18-min_prio/mm/vmscan.c
--- linux-2.6.18/mm/vmscan.c 2006-09-20 12:24:42.000000000 -0700
+++ 2.6.18-min_prio/mm/vmscan.c 2006-10-16 18:16:39.000000000 -0700
@@ -713,7 +713,7 @@ done:
* But we had to alter page->flags anyway.
*/
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc, int priority)
{
unsigned long pgmoved;
int pgdeactivate = 0;
@@ -734,7 +734,7 @@ static void shrink_active_list(unsigned
* `distress' is a measure of how much trouble we're having
* reclaiming pages. 0 -> no problems. 100 -> great trouble.
*/
- distress = 100 >> zone->prev_priority;
+ distress = 100 >> min(zone->prev_priority, priority);

/*
* The point of this algorithm is to decide when to start
@@ -885,7 +885,7 @@ static unsigned long shrink_zone(int pri
nr_to_scan = min(nr_active,
(unsigned long)sc->swap_cluster_max);
nr_active -= nr_to_scan;
- shrink_active_list(nr_to_scan, zone, sc);
+ shrink_active_list(nr_to_scan, zone, sc, priority);
}

if (nr_inactive) {
@@ -1315,7 +1315,7 @@ static unsigned long shrink_all_zones(un
if (zone->nr_scan_active >= nr_pages || pass > 3) {
zone->nr_scan_active = 0;
nr_to_scan = min(nr_pages, zone->nr_active);
- shrink_active_list(nr_to_scan, zone, sc);
+ shrink_active_list(nr_to_scan, zone, sc, prio);
}
}


Attachments:
2.6.18-min_prio (1.44 kB)

2006-10-17 06:34:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

Martin Bligh wrote:

> Another bug is that if try_to_free_pages / balance_pgdat are called
> with a gfp_mask specifying GFP_IO and/or GFP_FS, they may reclaim
> the requisite number of pages, and reset prev_priority to DEF_PRIORITY.
>
> However, another reclaimer without those gfp_mask flags set may still
> be struggling to reclaim pages. The easy fix for this is to key the
> distress calculation not off zone->prev_priority, but also take into
> account the local caller's priority by using:
> min(zone->prev_priority, sc->priority)


Does it really matter who is doing the actual reclaiming? IMO, if the
non-crippled (GFP_IO|GFP_FS) reclaimer is making progress, the other
guy doesn't need to start swapping, and should soon notice that some
pages are getting freed up.

Workloads where non GFP_IO or GFP_FS reclaimers are having a lot of
trouble indicates that either it is very swappy or page writeback has
broken down and lots of dirty pages are being reclaimed off the LRU.
In either case, they are likely to continue to have problems, even if
they are now able to unmap the odd page.

What are the empirical effects of this patch? What's the numbers? And
what have you done to akpm? ;)
--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-17 06:45:24

by Martin Bligh

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

Nick Piggin wrote:
> Martin Bligh wrote:
>
>> Another bug is that if try_to_free_pages / balance_pgdat are called
>> with a gfp_mask specifying GFP_IO and/or GFP_FS, they may reclaim
>> the requisite number of pages, and reset prev_priority to DEF_PRIORITY.
>>
>> However, another reclaimer without those gfp_mask flags set may still
>> be struggling to reclaim pages. The easy fix for this is to key the
>> distress calculation not off zone->prev_priority, but also take into
>> account the local caller's priority by using:
>> min(zone->prev_priority, sc->priority)
>
>
> Does it really matter who is doing the actual reclaiming? IMO, if the
> non-crippled (GFP_IO|GFP_FS) reclaimer is making progress, the other
> guy doesn't need to start swapping, and should soon notice that some
> pages are getting freed up.

That's not what happens though. We walk down the priorities, fail to
reclaim anything (in this case, move anything from active to inactive)
and the OOM killer fires. Perhaps the other pages being freed are
being stolen ... we're in direct reclaim here. we're meant to be
getting our own pages.

Why would we ever want distress to be based off a priority that's
higher than our current one? That's just silly.

> Workloads where non GFP_IO or GFP_FS reclaimers are having a lot of
> trouble indicates that either it is very swappy or page writeback has
> broken down and lots of dirty pages are being reclaimed off the LRU.
> In either case, they are likely to continue to have problems, even if
> they are now able to unmap the odd page.

We scanned 122,000 odd pages. Of which we skipped over over 100,000
of them because they were mapped, and we didn't think we had to try
very hard, because distress was 0.

> What are the empirical effects of this patch? What's the numbers? And
> what have you done to akpm? ;)

Showed him a real trace of a production system blowing up?
Demonstrated that the current heuristics are broken?
That sort of thing.

M.

2006-10-17 06:50:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

On Tue, 17 Oct 2006 16:33:53 +1000
Nick Piggin <[email protected]> wrote:

> And what have you done to akpm? ;)

I'm sulking. Would prefer to bitbucket the whole ->prev_priority thing
and do something better, but I haven't got around to thinking about it yet.

2006-10-17 06:56:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

Martin J. Bligh wrote:

> Nick Piggin wrote:
>
>> Martin Bligh wrote:
>>
>>> Another bug is that if try_to_free_pages / balance_pgdat are called
>>> with a gfp_mask specifying GFP_IO and/or GFP_FS, they may reclaim
>>> the requisite number of pages, and reset prev_priority to DEF_PRIORITY.
>>>
>>> However, another reclaimer without those gfp_mask flags set may still
>>> be struggling to reclaim pages. The easy fix for this is to key the
>>> distress calculation not off zone->prev_priority, but also take into
>>> account the local caller's priority by using:
>>> min(zone->prev_priority, sc->priority)
>>
>>
>>
>> Does it really matter who is doing the actual reclaiming? IMO, if the
>> non-crippled (GFP_IO|GFP_FS) reclaimer is making progress, the other
>> guy doesn't need to start swapping, and should soon notice that some
>> pages are getting freed up.
>
>
> That's not what happens though. We walk down the priorities, fail to
> reclaim anything (in this case, move anything from active to inactive)
> and the OOM killer fires. Perhaps the other pages being freed are
> being stolen ... we're in direct reclaim here. we're meant to be
> getting our own pages.


That may be what happens in *your* kernel, but does it happen upstream?
Because I expect this patch would fix the problem

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=408d85441cd5a9bd6bc851d677a10c605ed8db5f

(together with the zone_near_oom thing)

>
> Why would we ever want distress to be based off a priority that's
> higher than our current one? That's just silly.


Maybe there is an occasional GFP_NOFS reclaimer, and the workload
involves a huge
number of easy to reclaim inodes. If there are some GFP_KERNEL
allocators (or kswapd)
that are otherwise making a meal of this work, then we don't want to
start swapping
stuff out.

Hypothetical maybe, but you can't just make the assertion that it is
just silly,
because that isn't clear. And it isn't clear that your patch fixes anything.

>> Workloads where non GFP_IO or GFP_FS reclaimers are having a lot of
>> trouble indicates that either it is very swappy or page writeback has
>> broken down and lots of dirty pages are being reclaimed off the LRU.
>> In either case, they are likely to continue to have problems, even if
>> they are now able to unmap the odd page.
>
>
> We scanned 122,000 odd pages. Of which we skipped over over 100,000
> of them because they were mapped, and we didn't think we had to try
> very hard, because distress was 0.


So... presumably next time, we will try harder?

>> What are the empirical effects of this patch? What's the numbers? And
>> what have you done to akpm? ;)
>
>
> Showed him a real trace of a production system blowing up?
> Demonstrated that the current heuristics are broken?
> That sort of thing.


So, what kernel did you test?

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-17 13:54:09

by Martin Bligh

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

>> That's not what happens though. We walk down the priorities, fail to
>> reclaim anything (in this case, move anything from active to inactive)
>> and the OOM killer fires. Perhaps the other pages being freed are
>> being stolen ... we're in direct reclaim here. we're meant to be
>> getting our own pages.
>
>
> That may be what happens in *your* kernel, but does it happen upstream?
> Because I expect this patch would fix the problem
>
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=408d85441cd5a9bd6bc851d677a10c605ed8db5f
>
>
> (together with the zone_near_oom thing)

So? I fail to see how slapping another bandaid on top of it prevents
us from fixing an underlying problem.

>> Why would we ever want distress to be based off a priority that's
>> higher than our current one? That's just silly.
>
> Maybe there is an occasional GFP_NOFS reclaimer, and the workload
> involves a huge number of easy to reclaim inodes. If there are some
> GFP_KERNEL allocators (or kswapd) that are otherwise making a meal
> of this work, then we don't want to start swapping stuff out.
>
> Hypothetical maybe, but you can't just make the assertion that it is
> just silly, because that isn't clear. And it isn't clear that your
> patch fixes anything.

It'll stop the GFP_NOFS reclaimer going OOM. Yes, some other
bandaids might do that as well, but what on earth is the point? By
the time anyone gets to the lower prios, they SHOULD be setting
distress up - they ARE in distress.

The point of having zone->prev_priority is to kick everyone up into
reclaim faster in sync with each other. Look at how it's set, it's meant
to function as a min of everyone's prio. Then when we start successfully
reclaiming again, we kick it back to DEF_PRIORITY to indicate everything
is OK. But that fails to take account of the fact that some reclaimers
will struggle more than others.

If we're in direct reclaim in the first place, it's because we're
allocating faster than kswapd can keep up, or we're meant to be
throttling ourselves on dirty writeback. Either way, we need to be
freeing our own pages, not dicking around thrashing the LRU lists
without doing anything.

All of this debate seems pretty pointless to me. Look at the code, and
what I fixed. It's clearly broken by inspection. I'm not saying this is
the only way to fix it, or that it fixes all the world's problems. But
it's clearly an incremental improvement by fixing an obvious bug.

No wonder Linux doesn't degrade gracefully under memory pressure - we
refuse to reclaim when we need to.

M.

PS. I think we should just remove temp_priority altogether, and use a
local variable.

2006-10-17 16:43:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

Martin J. Bligh wrote:
>>> That's not what happens though. We walk down the priorities, fail to
>>> reclaim anything (in this case, move anything from active to inactive)
>>> and the OOM killer fires. Perhaps the other pages being freed are
>>> being stolen ... we're in direct reclaim here. we're meant to be
>>> getting our own pages.
>>
>>
>>
>> That may be what happens in *your* kernel, but does it happen upstream?
>> Because I expect this patch would fix the problem
>>
>> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=408d85441cd5a9bd6bc851d677a10c605ed8db5f
>>
>>
>> (together with the zone_near_oom thing)
>
>
> So? I fail to see how slapping another bandaid on top of it prevents
> us from fixing an underlying problem.

That is the fix for the underlying problem. The underlying problem is that
OOM was being triggered incorrectly without enough scanning. Your patch
is the bandaid which just tries to increase the amount of scanning a bit
in the hope that it holds off OOM for your workload.

>>> Why would we ever want distress to be based off a priority that's
>>> higher than our current one? That's just silly.
>>
>>
>> Maybe there is an occasional GFP_NOFS reclaimer, and the workload
>> involves a huge number of easy to reclaim inodes. If there are some
>
> > GFP_KERNEL allocators (or kswapd) that are otherwise making a meal
> > of this work, then we don't want to start swapping stuff out.
> >
>
>> Hypothetical maybe, but you can't just make the assertion that it is
>> just silly, because that isn't clear. And it isn't clear that your
>
> > patch fixes anything.
>
> It'll stop the GFP_NOFS reclaimer going OOM. Yes, some other
> bandaids might do that as well, but what on earth is the point? By
> the time anyone gets to the lower prios, they SHOULD be setting
> distress up - they ARE in distress.

Distress is a per-zone thing. It is precisely that way because there *are*
different types of reclaim and you don't want a crippled reclaimer (which
might indeed be having trouble reclaiming stuff) from saying the system
is in distress.

If they are the *only* reclaimer, then OK, distress will go up.

> The point of having zone->prev_priority is to kick everyone up into
> reclaim faster in sync with each other. Look at how it's set, it's meant
> to function as a min of everyone's prio. Then when we start successfully
> reclaiming again, we kick it back to DEF_PRIORITY to indicate everything
> is OK. But that fails to take account of the fact that some reclaimers
> will struggle more than others.

I don't agree that the thing to aim for is ensuring everyone is able
to reclaim something.

And why do you ignore the other side of the coin, where now reclaimers
that are easily able to make progress are being made to swap stuff out?

> If we're in direct reclaim in the first place, it's because we're
> allocating faster than kswapd can keep up, or we're meant to be
> throttling ourselves on dirty writeback. Either way, we need to be
> freeing our own pages, not dicking around thrashing the LRU lists
> without doing anything.

If the GFP_NOFS reclaimer is having a lot of trouble reclaiming, and so
you decide to turn on reclaim_mapped, then it is not suddenly going to
be able to free those pages.

> All of this debate seems pretty pointless to me. Look at the code, and
> what I fixed. It's clearly broken by inspection. I'm not saying this is
> the only way to fix it, or that it fixes all the world's problems. But
> it's clearly an incremental improvement by fixing an obvious bug.

It is not clearly broken. You have some picture in your mind of how you
think reclaim should work, and it doesn't match that. Fine, but that
doesn't make it a bug.

>
> No wonder Linux doesn't degrade gracefully under memory pressure - we
> refuse to reclaim when we need to.

If the zone is unable to be reclaimed from, then the priority will be
lowered.

> PS. I think we should just remove temp_priority altogether, and use a
> local variable.

Yes. I don't remember how temp_priority came about. I think it was Nikita?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-17 16:59:20

by Martin Bligh

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

>> It'll stop the GFP_NOFS reclaimer going OOM. Yes, some other
>> bandaids might do that as well, but what on earth is the point? By
>> the time anyone gets to the lower prios, they SHOULD be setting
>> distress up - they ARE in distress.
>
> Distress is a per-zone thing. It is precisely that way because there *are*
> different types of reclaim and you don't want a crippled reclaimer (which
> might indeed be having trouble reclaiming stuff) from saying the system
> is in distress.
>
> If they are the *only* reclaimer, then OK, distress will go up.

So you'd rather the "crippled" reclaimer went and fire the OOM killer
and shoot someone instead? I don't see why we should penalise them,
especially as the dirty page throttling is global, and will just kick
pretty much anyone trying to do an allocation. There's nothing magic
about the "crippled" reclaimer as you put it. They're doing absolutely
nothing wrong, or that they should be punished for. They need a page.

>> The point of having zone->prev_priority is to kick everyone up into
>> reclaim faster in sync with each other. Look at how it's set, it's meant
>> to function as a min of everyone's prio. Then when we start successfully
>> reclaiming again, we kick it back to DEF_PRIORITY to indicate everything
>> is OK. But that fails to take account of the fact that some reclaimers
>> will struggle more than others.
>
> I don't agree that the thing to aim for is ensuring everyone is able
> to reclaim something.
>
> And why do you ignore the other side of the coin, where now reclaimers
> that are easily able to make progress are being made to swap stuff out?

Because I'd rather err on the side of moving a few mapped pages from the
active to the inactive list than cause massive latencies for a page
allocation that's dropping into direct reclaim and/or going OOM.

>> If we're in direct reclaim in the first place, it's because we're
>> allocating faster than kswapd can keep up, or we're meant to be
>> throttling ourselves on dirty writeback. Either way, we need to be
>> freeing our own pages, not dicking around thrashing the LRU lists
>> without doing anything.
>
> If the GFP_NOFS reclaimer is having a lot of trouble reclaiming, and so
> you decide to turn on reclaim_mapped, then it is not suddenly going to
> be able to free those pages.

Well it's certainly not going to work if we don't even try. There were
ZERO pages in the inactive list at this point. The system is totally
frigging hosed and we're not even trying to reclaim pages because
we're in deluded-happy-la-la land and we think everything is fine.

This is what happens as we kick down prio levels in one thread:

priority = 12 active_distress = 0 swap_tendency = 0 gfp_mask = d0
priority = 12 active_distress = 0 swap_tendency = 0 gfp_mask = d0
priority = 11 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 10 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 9 active_distress = 0 swap_tendency = 81 gfp_mask = d0
priority = 8 active_distress = 0 swap_tendency = 81 gfp_mask = d0
priority = 7 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 6 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 5 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 4 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 3 active_distress = 25 swap_tendency = 106 gfp_mask = d0
priority = 2 active_distress = 50 swap_tendency = 131 gfp_mask = d0
priority = 1 active_distress = 0 swap_tendency = 81 gfp_mask = d0
priority = 0 active_distress = 0 swap_tendency = 81 gfp_mask = d0

Notice that distress is not kicking up as priority kicks down (see
1 and 0 at the end). Because some other idiot reset prev_priority
back to 12.

>> All of this debate seems pretty pointless to me. Look at the code, and
>> what I fixed. It's clearly broken by inspection. I'm not saying this is
>> the only way to fix it, or that it fixes all the world's problems. But
>> it's clearly an incremental improvement by fixing an obvious bug.
>
> It is not clearly broken. You have some picture in your mind of how you
> think reclaim should work, and it doesn't match that. Fine, but that
> doesn't make it a bug.

It's not just a picture in my mind, it clearly does not work in
practice. The bugfix is pretty trivial.

>> No wonder Linux doesn't degrade gracefully under memory pressure - we
>> refuse to reclaim when we need to.
>
> If the zone is unable to be reclaimed from, then the priority will be
> lowered.

WHICH priority?

We have multiple different reclaimers firing. If ANY of them are really
struggling to reclaim, then we should start reclaiming mapped pages.
We want a min of all the reclaimers prios here. It's not like we're
going to go nuts and wipe the whole damned list, once we've got
back SWAP_CLUSTER_MAX of them, we should return.

>> PS. I think we should just remove temp_priority altogether, and use a
>> local variable.
>
> Yes. I don't remember how temp_priority came about. I think it was Nikita?

OK, at least we agree on something.

M.

2006-10-17 17:15:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Use min of two prio settings in calculating distress for reclaim

Martin Bligh wrote:
>> Distress is a per-zone thing. It is precisely that way because there
>> *are*
>> different types of reclaim and you don't want a crippled reclaimer (which
>> might indeed be having trouble reclaiming stuff) from saying the system
>> is in distress.
>>
>> If they are the *only* reclaimer, then OK, distress will go up.
>
>
> So you'd rather the "crippled" reclaimer went and fire the OOM killer
> and shoot someone instead?

No, so I fixed that.
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=408d85441cd5a9bd6bc851d677a10c605ed8db5f

> I don't see why we should penalise them,
> especially as the dirty page throttling is global, and will just kick
> pretty much anyone trying to do an allocation. There's nothing magic

How does dirty page throttling kick anyone trying to do an allocation?
It kicks at page dirtying time.

> about the "crippled" reclaimer as you put it. They're doing absolutely
> nothing wrong, or that they should be punished for. They need a page.

When did I say anything about magic or being punished? They need a page
and they will get it when enough memory gets freed. Pages being reclaimed
by process A may be allocated by process B just fine.

>> I don't agree that the thing to aim for is ensuring everyone is able
>> to reclaim something.
>>
>> And why do you ignore the other side of the coin, where now reclaimers
>> that are easily able to make progress are being made to swap stuff out?
>
>
> Because I'd rather err on the side of moving a few mapped pages from the
> active to the inactive list than cause massive latencies for a page
> allocation that's dropping into direct reclaim and/or going OOM.

We shouldn't go OOM. And there are latencies everywhere and this won't
fix them. A GFP_NOIO allocator can't swap out pages at all, for example.

>> If the GFP_NOFS reclaimer is having a lot of trouble reclaiming, and so
>> you decide to turn on reclaim_mapped, then it is not suddenly going to
>> be able to free those pages.
>
>
> Well it's certainly not going to work if we don't even try. There were
> ZERO pages in the inactive list at this point. The system is totally
> frigging hosed and we're not even trying to reclaim pages because
> we're in deluded-happy-la-la land and we think everything is fine.

So that could be the temp_priority race. If no progress is being made
anywhere, the current logic (minus races) says that prev_prio should
reach 0. Regardless of whether it is GFP_NOFS or whatever.

> This is what happens as we kick down prio levels in one thread:
>
> priority = 12 active_distress = 0 swap_tendency = 0 gfp_mask = d0
> priority = 12 active_distress = 0 swap_tendency = 0 gfp_mask = d0
> priority = 11 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 10 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 9 active_distress = 0 swap_tendency = 81 gfp_mask = d0
> priority = 8 active_distress = 0 swap_tendency = 81 gfp_mask = d0
> priority = 7 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 6 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 5 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 4 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 3 active_distress = 25 swap_tendency = 106 gfp_mask = d0
> priority = 2 active_distress = 50 swap_tendency = 131 gfp_mask = d0
> priority = 1 active_distress = 0 swap_tendency = 81 gfp_mask = d0
> priority = 0 active_distress = 0 swap_tendency = 81 gfp_mask = d0
>
> Notice that distress is not kicking up as priority kicks down (see
> 1 and 0 at the end). Because some other idiot reset prev_priority
> back to 12.

Fine, so fix that race rather than papering over it by using the min
of prev_priority and current priority.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com