2010-04-19 12:22:50

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

Sorry for replying that late, but after digging through another pile of tasks I'm happy to come back to this issue and I'll try to answer all open questions.
Fortunately I'm also able to add a few new insights that might resurrect this discussion^^

For the requested CFQ scheduler tuning, its deadline what is here :-)
So I can't apply all that. But in the past I was already able to show that all the "slowdown" occurs above the block device layer (read back through our threads if interessted about details). But eventually that leaves all lower layer tuning out of the critical zone.

Corrado also asked for iostat data, due to the reason explained above (issue above BDL) it doesn't contain anything much useful as expected.
So I'll just add a one liner of good/bad case to show that things like req-sz etc are the same, but just slower.
This "being slower" is caused by the request arriving in the BDL at a lower rate - caused by our beloved full timeouts in congestion_wait.

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
bad sdb 0.00 0.00 154.50 0.00 70144.00 0.00 908.01 0.62 4.05 2.72 42.00
good sdb 0.00 0.00 270.50 0.00 122624.00 0.00 906.65 1.32 4.94 2.92 79.00


So now coming to the probably most critical part - the evict once discussion in this thread.
I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.

In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
Therefore I ran all tests and verifications with that drops.
In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)

On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.

But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
- first write/read load after reboot or dropping caches -> read TP good
- second write/read load after reboot or dropping caches -> read TP bad
=> so what changed.

I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:

pre write 1
Buffers: 484 kB
Cached: 5664 kB
pre write 2
Buffers: 33500 kB
Cached: 149856 kB
pre write 3
Buffers: 65564 kB
Cached: 115888 kB
pre write 4
Buffers: 85556 kB
Cached: 97184 kB

It stays at ~85M with more writes which is approx 50% of my free 160M memory.
It can be said that once Buffers reached the 65M level all (no matter how much read load I throw at the system) following read loads will have the bad throughput.
Dropping caches - and by that removing these buffers - gives back the good performance.

So far I found no alternative to a manual drop_caches, but recommending a 30 second cron job dropping caches to get good read performance for customers is not that good anyway.
I checked if the buffers get cleaned some when, but neither a lot of subsequent read loads pushing the pressure towards read page cache (I hoped the buffers would age or something to eventually get thrown out) nor waiting a long time helped.
The system seems to be totally unable to get rid of these buffers without my manual help via drop_caches.

I imagine a huge customer DB running wirtes&reads fine at day, with a nightly large backup that losses 50% read throughput because the kernel keeps 50% buffers all the night - and by that doesn't fit in their night slot - just to draw one realistic scenario.
Is there anything to avoid that behavior to "never free these buffers", but still get all/some of the intended benefits of 56e49d21?

Ideas welcome

P.S. This is still a .32 stable kernel + Mels watermark wait patch based analysis - I plan to check current kernels as well once I find the time, but let me know if there are known obvious fixes related to this issue I should test asap.

Mel Gorman wrote:
> On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
>> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>>
>>> Test scenario
>>> =============
>>> X86-64 machine 1 socket 4 cores
>>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>>> on-board and a piece of crap, and a decent RAID card could blow
>>> the budget.
>>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>>> Christian was doing
>> With that many disks, you can easily have dozens of megabytes
>> of data in flight to the disk at once. That is a major
>> fraction of memory.
>>
>
> That is easily possible. Note, I'm not maintaining this workload configuration
> is a good idea.
>
> The background to this problem is Christian running a disk-intensive iozone
> workload over many CPUs and disks with limited memory. It's already known
> that if he added a small amount of extra memory, the problem went away.
> The problem was a massive throughput regression and a bisect pinpointed
> two patches (both mine) but neither make sense. One altered the order pages
> come back from lists but not availability and his hardware does no automatic
> merging. A second does alter the availility of pages via the per-cpu lists
> but reverting the behaviour didn't help.
>
> The first fix to this was to replace congestion_wait with a waitqueue
> that woke up processes if the watermarks were met. This fixed
> Christian's problem but Andrew wants to pin the underlying cause.
>
> I strongly suspect that evict-once behaves sensibly when memory is ample
> but in this particular case, it's not helping.
>
>> In fact, you might have all of the inactive file pages under
>> IO...
>>
>
> Possibly. The tests have a write and a read phase but I wasn't
> collecting the data with sufficient granularity to see which of the
> tests are actually stalling.
>
>>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>>> fix title: revertevict
>>> fixed in mainline? no
>>> affects: 2.6.31 to now
>>>
>>> For reasons that are not immediately obvious, the evict-once patches
>>> *really* hurt the time spent on congestion and the number of pages
>>> reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>>> because clearly you tested this for AIM7 and might have some
>>> theories. For the purposes of testing, I just reverted the changes.
>> The patch helped IO tests with reasonable amounts of memory
>> available, because the VM can cache frequently used data
>> much more effectively.
>>
>> This comes at the cost of caching less recently accessed
>> use-once data, which should not be an issue since the data
>> is only used once...
>>
>
> Indeed. With or without evict-once, I'd have an expectation of all the
> pages being recycled anyway because of the amount of data involved.
>
>>> Rik, any theory on evict-once?
>> No real theories yet, just the observation that your revert
>> appears to be buggy (see below) and the possibility that your
>> test may have all of the inactive file pages under IO...
>>
>
> Bah. I had the initial revert right and screwed up reverting from
> 2.6.32.10 on. I'm rerunning the tests. Is this right?
>
> - if (is_active_lru(lru)) {
> - if (inactive_list_is_low(zone, sc, file))
> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
> + if (is_active_lru(lru)) {
> + shrink_active_list(nr_to_scan, zone, sc, priority, file);
> return 0;
>
>
>> Can you reproduce the stall if you lower the dirty limits?
>>
>
> I'm rerunning the revertevict patches at the moment. When they complete,
> I'll experiment with dirty limits. Any suggested values or will I just
> increase it by some arbitrary amount and see what falls out? e.g.
> increse dirty_ratio to 80.
>
>>> static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>> struct zone *zone, struct scan_control *sc, int priority)
>>> {
>>> int file = is_file_lru(lru);
>>>
>>> - if (is_active_lru(lru)) {
>>> - if (inactive_list_is_low(zone, sc, file))
>>> - shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>> + if (lru == LRU_ACTIVE_FILE) {
>>> + shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>> return 0;
>>> }
>> Your revert is buggy. With this change, anonymous pages will
>> never get deactivated via shrink_list.
>>
>
> /me slaps self
>

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


2010-04-19 21:44:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
> So now coming to the probably most critical part - the evict once discussion in this thread.
> I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
>
> In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
> Therefore I ran all tests and verifications with that drops.
> In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
> Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
>
> On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
> Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
>
> But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
> - first write/read load after reboot or dropping caches -> read TP good
> - second write/read load after reboot or dropping caches -> read TP bad
> => so what changed.
>
> I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
> When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
>
> pre write 1
> Buffers: 484 kB
> Cached: 5664 kB
> pre write 2
> Buffers: 33500 kB
> Cached: 149856 kB
> pre write 3
> Buffers: 65564 kB
> Cached: 115888 kB
> pre write 4
> Buffers: 85556 kB
> Cached: 97184 kB
>
> It stays at ~85M with more writes which is approx 50% of my free 160M memory.

Ok, so I am the idiot that got quoted on 'the active set is not too big, so
buffer heads are not a problem when avoiding to scan it' in eternal history.

But the threshold inactive/active ratio for skipping active file pages is
actually 1:1.

The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
to be a bit more natural anyway? Below is a patch that changes it to 2:1.
Christian, can you check if it fixes your regression?

Additionally, we can always scan active file pages but only deactivate them
when the ratio is off and otherwise strip buffers of clean pages.

What do people think?

Hannes

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ede99..a4aea76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);

- return (active > inactive);
+ return (active > inactive / 2);
}

unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..8f1a846 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
active = zone_page_state(zone, NR_ACTIVE_FILE);
inactive = zone_page_state(zone, NR_INACTIVE_FILE);

- return (active > inactive);
+ return (active > inactive / 2);
}

/**

2010-04-20 07:21:18

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure



Johannes Weiner wrote:
> On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
>> So now coming to the probably most critical part - the evict once discussion in this thread.
>> I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
>>
>> In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
>> Therefore I ran all tests and verifications with that drops.
>> In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
>> Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
>>
>> On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
>> Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
>>
>> But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
>> - first write/read load after reboot or dropping caches -> read TP good
>> - second write/read load after reboot or dropping caches -> read TP bad
>> => so what changed.
>>
>> I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
>> When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
>>
>> pre write 1
>> Buffers: 484 kB
>> Cached: 5664 kB
>> pre write 2
>> Buffers: 33500 kB
>> Cached: 149856 kB
>> pre write 3
>> Buffers: 65564 kB
>> Cached: 115888 kB
>> pre write 4
>> Buffers: 85556 kB
>> Cached: 97184 kB
>>
>> It stays at ~85M with more writes which is approx 50% of my free 160M memory.
>
> Ok, so I am the idiot that got quoted on 'the active set is not too big, so
> buffer heads are not a problem when avoiding to scan it' in eternal history.
>
> But the threshold inactive/active ratio for skipping active file pages is
> actually 1:1.
>
> The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
> to be a bit more natural anyway? Below is a patch that changes it to 2:1.
> Christian, can you check if it fixes your regression?

I'll check it out.
from the numbers I have up to now I know that the good->bad transition
for my case is somewhere between 30M/60M e.g. first and second write.
The ratio 2:1 will eat max 53M of my ~160M that gets split up.

That means setting the ratio to 2:1 or whatever else might help or not,
but eventually there is just another setting of workload vs. memory
constraints that would still be affected. Still I guess 3:1 (and I'll
try that as well) should be enough to be a bit more towards the save side.

> Additionally, we can always scan active file pages but only deactivate them
> when the ratio is off and otherwise strip buffers of clean pages.

In think we need something that allows the system to forget its history
somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm)
it should eventually throw all old things out.
Like I described before many systems have different usage patterns when
e.g. comparing day/night workload. So it is far from optimal if e.g. day
write loads eat so much cache and never give it back for nightly huge
reads tasks or something similar.

Would your suggestion achieve that already?
If not what kind change could?

> What do people think?
>
> Hannes
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ede99..a4aea76 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
> active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
>
> - return (active > inactive);
> + return (active > inactive / 2);
> }
>
> unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..8f1a846 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
> active = zone_page_state(zone, NR_ACTIVE_FILE);
> inactive = zone_page_state(zone, NR_INACTIVE_FILE);
>
> - return (active > inactive);
> + return (active > inactive / 2);
> }
>
> /**
>

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2010-04-20 08:54:45

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure



Christian Ehrhardt wrote:
>
>
> Johannes Weiner wrote:
[...]

>>>
>>> It stays at ~85M with more writes which is approx 50% of my free 160M
>>> memory.
>>
>> Ok, so I am the idiot that got quoted on 'the active set is not too
>> big, so
>> buffer heads are not a problem when avoiding to scan it' in eternal
>> history.
>>
>> But the threshold inactive/active ratio for skipping active file pages is
>> actually 1:1.
>>
>> The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?)
>> appears
>> to be a bit more natural anyway? Below is a patch that changes it to
>> 2:1.
>> Christian, can you check if it fixes your regression?
>
> I'll check it out.
> from the numbers I have up to now I know that the good->bad transition
> for my case is somewhere between 30M/60M e.g. first and second write.
> The ratio 2:1 will eat max 53M of my ~160M that gets split up.
>
> That means setting the ratio to 2:1 or whatever else might help or not,
> but eventually there is just another setting of workload vs. memory
> constraints that would still be affected. Still I guess 3:1 (and I'll
> try that as well) should be enough to be a bit more towards the save side.

For "my case" 2:1 is not enough, 3:1 almost and 4:1 fixes the issue.
Still as I mentioned before I think any value carved in stone can and
will be bad to some use case - as 1:1 is for mine.

If we end up being unable to fix it internally by allowing the system to
"forget" and eventually free old unused buffers at least somewhen - then
we should neither implement it as 2:1 nor 3:1 nor whatsoever, but as
userspace configurable e.g. /proc/sys/vm/active_inactive_ratio.

I hope your suggestion below or an extension to it will allow the kernel
to free the buffers somewhen. Depending on how good/fast this solution
then will work we can still modify the ratio if needed.

>> Additionally, we can always scan active file pages but only deactivate
>> them
>> when the ratio is off and otherwise strip buffers of clean pages.
>
> In think we need something that allows the system to forget its history
> somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm)
> it should eventually throw all old things out.
> Like I described before many systems have different usage patterns when
> e.g. comparing day/night workload. So it is far from optimal if e.g. day
> write loads eat so much cache and never give it back for nightly huge
> reads tasks or something similar.
>
> Would your suggestion achieve that already?
> If not what kind change could?
>
[...]
--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2010-04-20 14:41:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On 04/19/2010 05:44 PM, Johannes Weiner wrote:

> What do people think?

It has potential advantages and disadvantages.

On smaller desktop systems, it is entirely possible that
the working set is close to half of the page cache. Your
patch reduces the amount of memory that is protected on
the active file list, so it may cause part of the working
set to get evicted.

On the other hand, having a smaller active list frees up
more memory for sequential (streaming, use-once) disk IO.
This can be useful on systems with large IO subsystems
and small memory (like Christian's s390 virtual machine,
with 256MB RAM and 4 disks!).

I wonder if we could not find some automatic way to
balance between these two situations, for example by
excluding currently-in-flight pages from the calculations.

In Christian's case, he could have 160MB of cache (buffer
+ page cache), of which 70MB is in flight to disk at a
time. It may be worthwhile to exclude that 70MB from the
total and aim for 45MB active file and 45MB inactive file
pages on his system. That way IO does not get starved.

On a desktop system, which needs the working set protected
and does less IO, we will automatically protect more of
the working set - since there is no IO to starve.

2010-04-20 15:32:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On Tue, Apr 20, 2010 at 09:20:58AM +0200, Christian Ehrhardt wrote:
>
>
> Johannes Weiner wrote:
> >On Mon, Apr 19, 2010 at 02:22:36PM +0200, Christian Ehrhardt wrote:
> >>So now coming to the probably most critical part - the evict once
> >>discussion in this thread.
> >>I'll try to explain what I found in the meanwhile - let me know whats
> >>unclear and I'll add data etc.
> >>
> >>In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps
> >>to improve the accuracy of the used testcase by lowering the noise from
> >>5-8% to <1%.
> >>Therefore I ran all tests and verifications with that drops.
> >>In the meanwhile I unfortunately discovered that Mel's fix only helps for
> >>the cases when the caches are dropped.
> >>Without it seems to be bad all the time. So don't cast the patch away due
> >>to that discovery :-)
> >>
> >>On the good side I was also able to analyze a few more things due to that
> >>insight - and it might give us new data to debug the root cause.
> >>Like Mel I also had identified "56e49d21 vmscan: evict use-once pages
> >>first" to be related in the past. But without the watermark wait fix,
> >>unapplying it 56e49d21 didn't change much for my case so I left this
> >>analysis path.
> >>
> >>But now after I found dropping caches is the key to "get back good
> >>performance" and "subsequent writes for bad performance" even with
> >>watermark wait applied I checked what else changes:
> >>- first write/read load after reboot or dropping caches -> read TP good
> >>- second write/read load after reboot or dropping caches -> read TP bad
> >>=> so what changed.
> >>
> >>I went through all kind of logs and found something in the system
> >>activity report which very probably is related to 56e49d21.
> >>When issuing subsequent writes after I dropped caches to get a clean
> >>start I get this in Buffers/Caches from Meminfo:
> >>
> >>pre write 1
> >>Buffers: 484 kB
> >>Cached: 5664 kB
> >>pre write 2
> >>Buffers: 33500 kB
> >>Cached: 149856 kB
> >>pre write 3
> >>Buffers: 65564 kB
> >>Cached: 115888 kB
> >>pre write 4
> >>Buffers: 85556 kB
> >>Cached: 97184 kB
> >>
> >>It stays at ~85M with more writes which is approx 50% of my free 160M
> >>memory.
> >
> >Ok, so I am the idiot that got quoted on 'the active set is not too big, so
> >buffer heads are not a problem when avoiding to scan it' in eternal
> >history.
> >
> >But the threshold inactive/active ratio for skipping active file pages is
> >actually 1:1.
> >
> >The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?)
> >appears
> >to be a bit more natural anyway? Below is a patch that changes it to 2:1.
> >Christian, can you check if it fixes your regression?
>
> I'll check it out.
> from the numbers I have up to now I know that the good->bad transition
> for my case is somewhere between 30M/60M e.g. first and second write.
> The ratio 2:1 will eat max 53M of my ~160M that gets split up.
>
> That means setting the ratio to 2:1 or whatever else might help or not,
> but eventually there is just another setting of workload vs. memory
> constraints that would still be affected. Still I guess 3:1 (and I'll
> try that as well) should be enough to be a bit more towards the save side.
>
> >Additionally, we can always scan active file pages but only deactivate them
> >when the ratio is off and otherwise strip buffers of clean pages.
>
> In think we need something that allows the system to forget its history
> somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm)
> it should eventually throw all old things out.

The idea is that it pans out on its own. If the workload changes, new
pages get activated and when that set grows too large, we start shrinking
it again.

Of course, right now this unscanned set is way too large and we can end
up wasting up to 50% of usable page cache on false active pages.

A fixed ratio does not scale with varying workloads, obviously, but having
it at a safe level still seems like a good trade-off.

We can still do the optimization, and in the worst case the amount of
memory wasted on false active pages is small enough that it should leave
the system performant.

You have a rather extreme page cache load. If 4:1 works for you, I think
this is a safe bet for now because we only frob the knobs into the
direction of earlier kernel behaviour.

We still have a nice amount of pages we do not need to scan regularly
(up to 50k file pages for a streaming IO load on a 1G machine).

Hannes

2010-04-20 17:23:45

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On 04/20/2010 11:32 AM, Johannes Weiner wrote:

> The idea is that it pans out on its own. If the workload changes, new
> pages get activated and when that set grows too large, we start shrinking
> it again.
>
> Of course, right now this unscanned set is way too large and we can end
> up wasting up to 50% of usable page cache on false active pages.

Thing is, changing workloads often change back.

Specifically, think of a desktop system that is doing
work for the user during the day and gets backed up
at night.

You do not want the backup to kick the working set
out of memory, because when the user returns in the
morning the desktop should come back quickly after
the screensaver is unlocked.

The big question is, what workload suffers from
having the inactive list at 50% of the page cache?

So far the only big problem we have seen is on a
very unbalanced virtual machine, with 256MB RAM
and 4 fast disks. The disks simply have more IO
in flight at once than what fits in the inactive
list.

This is a very untypical situation, and we can
probably solve it by excluding the in-flight pages
from the active/inactive file calculation.

2010-04-21 04:24:08

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure



Rik van Riel wrote:
> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>
>> The idea is that it pans out on its own. If the workload changes, new
>> pages get activated and when that set grows too large, we start shrinking
>> it again.
>>
>> Of course, right now this unscanned set is way too large and we can end
>> up wasting up to 50% of usable page cache on false active pages.
>
> Thing is, changing workloads often change back.
>
> Specifically, think of a desktop system that is doing
> work for the user during the day and gets backed up
> at night.
>
> You do not want the backup to kick the working set
> out of memory, because when the user returns in the
> morning the desktop should come back quickly after
> the screensaver is unlocked.

IMHO it is fine to prevent that nightly backup job from not being
finished when the user arrives at morning because we didn't give him
some more cache - and e.g. a 30 sec transition from/to both optimized
states is fine.
But eventually I guess the point is that both behaviors are reasonable
to achieve - depending on the users needs.

What we could do is combine all our thoughts we had so far:
a) Rik could create an experimental patch that excludes the in flight pages
b) Johannes could create one for his suggestion to "always scan active
file pages but only deactivate them when the ratio is off and otherwise
strip buffers of clean pages"
c) I would extend the patch from Johannes setting the ratio of
active/inactive pages to be a userspace tunable

a,b,a+b would then need to be tested if they achieve a better behavior.

c on the other hand would be a fine tunable to let administrators
(knowing their workloads) or distributions (e.g. different values for
Desktop/Server defaults) adapt their installations.

In theory a,b and c should work fine together in case we need all of them.

> The big question is, what workload suffers from
> having the inactive list at 50% of the page cache?
>
> So far the only big problem we have seen is on a
> very unbalanced virtual machine, with 256MB RAM
> and 4 fast disks. The disks simply have more IO
> in flight at once than what fits in the inactive
> list.

Did I get you right that this means the write case - explaining why it
is building up buffers to the 50% max?

Note: It even uses up to 64 disks, with 1 disk per thread so e.g. 16
threads => 16 disks.

For being "unbalanced" I'd like to mention that over the years I learned
that sometimes, after a while, virtualized systems look that way without
being intended - this happens by adding more and more guests and let
guest memory balooning take care of it.

> This is a very untypical situation, and we can
> probably solve it by excluding the in-flight pages
> from the active/inactive file calculation.

--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2010-04-21 07:35:51

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure



Christian Ehrhardt wrote:
>
>
> Rik van Riel wrote:
>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>
>>> The idea is that it pans out on its own. If the workload changes, new
>>> pages get activated and when that set grows too large, we start
>>> shrinking
>>> it again.
>>>
>>> Of course, right now this unscanned set is way too large and we can end
>>> up wasting up to 50% of usable page cache on false active pages.
>>
>> Thing is, changing workloads often change back.
>>
>> Specifically, think of a desktop system that is doing
>> work for the user during the day and gets backed up
>> at night.
>>
>> You do not want the backup to kick the working set
>> out of memory, because when the user returns in the
>> morning the desktop should come back quickly after
>> the screensaver is unlocked.
>
> IMHO it is fine to prevent that nightly backup job from not being
> finished when the user arrives at morning because we didn't give him
> some more cache - and e.g. a 30 sec transition from/to both optimized
> states is fine.
> But eventually I guess the point is that both behaviors are reasonable
> to achieve - depending on the users needs.
>
> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active
> file pages but only deactivate them when the ratio is off and otherwise
> strip buffers of clean pages"
> c) I would extend the patch from Johannes setting the ratio of
> active/inactive pages to be a userspace tunable

A first revision of patch c is attached.
I tested assigning different percentages, so far e.g. 50 really behave
like before and 25 protects ~42M Buffers in my example which would match
the intended behavior - see patch for more details.

Checkpatch and some basic function tests went fine.
While it may be not perfect yet, I think it is ready for feedback now.

> a,b,a+b would then need to be tested if they achieve a better behavior.
>
> c on the other hand would be a fine tunable to let administrators
> (knowing their workloads) or distributions (e.g. different values for
> Desktop/Server defaults) adapt their installations.
>
> In theory a,b and c should work fine together in case we need all of them.
>
>> The big question is, what workload suffers from
>> having the inactive list at 50% of the page cache?
>>
>> So far the only big problem we have seen is on a
>> very unbalanced virtual machine, with 256MB RAM
>> and 4 fast disks. The disks simply have more IO
>> in flight at once than what fits in the inactive
>> list.
>
> Did I get you right that this means the write case - explaining why it
> is building up buffers to the 50% max?
>

Thinking about it I wondered for what these Buffers are protected.
If the intention to save these buffers is for reuse with similar loads I
wonder why I "need" three iozones to build up the 85M in my case.

Buffers start at ~0, after iozone run 1 they are at ~35, then after #2
~65 and after run #3 ~85.
Shouldn't that either allocate 85M for the first directly in case that
much is needed for a single run - or if not the second and third run
just "resuse" the 35M Buffers from the first run still held?

Note - "1 iozone run" means "iozone ... -i 0" which sequentially writes
and then rewrites a 2Gb file on 16 disks in my current case.

looking forward especially to patch b as I'd really like to see a kernel
able to win back these buffers if they are no more used for a longer
period while still allowing to grow&protect them while needed.

--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


Attachments:
active-inacte-ratio-tunable.diff (4.56 kB)

2010-04-21 09:03:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On Wed, Apr 21, 2010 at 06:23:45AM +0200, Christian Ehrhardt wrote:
> Rik van Riel wrote:
> >You do not want the backup to kick the working set
> >out of memory, because when the user returns in the
> >morning the desktop should come back quickly after
> >the screensaver is unlocked.
>
> IMHO it is fine to prevent that nightly backup job from not being
> finished when the user arrives at morning because we didn't give him
> some more cache - and e.g. a 30 sec transition from/to both optimized
> states is fine.

For batched work maybe :-)

> What we could do is combine all our thoughts we had so far:
> a) Rik could create an experimental patch that excludes the in flight pages
> b) Johannes could create one for his suggestion to "always scan active
> file pages but only deactivate them when the ratio is off and otherwise
> strip buffers of clean pages"

Please drop that idea, that 'Buffers:' is a red herring. It's just pages
that do not back files but block devices. Stripping buffer_heads won't
achieve anything, we need to get rid of the pages. Sorry, I should have
slept and thought before writing that suggestion.

2010-04-21 13:20:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On 04/21/2010 03:35 AM, Christian Ehrhardt wrote:
>
>
> Christian Ehrhardt wrote:
>>
>>
>> Rik van Riel wrote:
>>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>>
>>>> The idea is that it pans out on its own. If the workload changes, new
>>>> pages get activated and when that set grows too large, we start
>>>> shrinking
>>>> it again.
>>>>
>>>> Of course, right now this unscanned set is way too large and we can end
>>>> up wasting up to 50% of usable page cache on false active pages.
>>>
>>> Thing is, changing workloads often change back.
>>>
>>> Specifically, think of a desktop system that is doing
>>> work for the user during the day and gets backed up
>>> at night.
>>>
>>> You do not want the backup to kick the working set
>>> out of memory, because when the user returns in the
>>> morning the desktop should come back quickly after
>>> the screensaver is unlocked.
>>
>> IMHO it is fine to prevent that nightly backup job from not being
>> finished when the user arrives at morning because we didn't give him
>> some more cache - and e.g. a 30 sec transition from/to both optimized
>> states is fine.
>> But eventually I guess the point is that both behaviors are reasonable
>> to achieve - depending on the users needs.
>>
>> What we could do is combine all our thoughts we had so far:
>> a) Rik could create an experimental patch that excludes the in flight
>> pages
>> b) Johannes could create one for his suggestion to "always scan active
>> file pages but only deactivate them when the ratio is off and
>> otherwise strip buffers of clean pages"

I think you are confusing "buffer heads" with "buffers".

You can strip buffer heads off pages, but that is not
your problem.

"buffers" in /proc/meminfo stands for cached metadata,
eg. the filesystem journal, inodes, directories, etc...
Caching such metadata is legitimate, because it reduces
the number of disk seeks down the line.

2010-04-21 13:21:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

On 04/21/2010 12:23 AM, Christian Ehrhardt wrote:

> IMHO it is fine to prevent that nightly backup job from not being
> finished when the user arrives at morning because we didn't give him
> some more cache

How on earth would a backup job benefit from cache?

It only accesses each bit of data once, so caching the
to-be-backed-up data is a waste of memory.

2010-04-22 06:21:20

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

Trying to answer and consolidate all open parts of this thread down below.

Rik van Riel wrote:
> On 04/21/2010 03:35 AM, Christian Ehrhardt wrote:
>>
>>
>> Christian Ehrhardt wrote:
>>>
>>>
>>> Rik van Riel wrote:
>>>> On 04/20/2010 11:32 AM, Johannes Weiner wrote:
>>>>
>>>>> The idea is that it pans out on its own. If the workload changes, new
>>>>> pages get activated and when that set grows too large, we start
>>>>> shrinking
>>>>> it again.
>>>>>
>>>>> Of course, right now this unscanned set is way too large and we can
>>>>> end
>>>>> up wasting up to 50% of usable page cache on false active pages.
>>>>
>>>> Thing is, changing workloads often change back.
>>>>
>>>> Specifically, think of a desktop system that is doing
>>>> work for the user during the day and gets backed up
>>>> at night.
>>>>
>>>> You do not want the backup to kick the working set
>>>> out of memory, because when the user returns in the
>>>> morning the desktop should come back quickly after
>>>> the screensaver is unlocked.
>>>
>>> IMHO it is fine to prevent that nightly backup job from not being
>>> finished when the user arrives at morning because we didn't give him
>>> some more cache - and e.g. a 30 sec transition from/to both optimized
>>> states is fine.
>>> But eventually I guess the point is that both behaviors are reasonable
>>> to achieve - depending on the users needs.
>>>
>>> What we could do is combine all our thoughts we had so far:
>>> a) Rik could create an experimental patch that excludes the in flight
>>> pages
>>> b) Johannes could create one for his suggestion to "always scan active
>>> file pages but only deactivate them when the ratio is off and
>>> otherwise strip buffers of clean pages"
>
> I think you are confusing "buffer heads" with "buffers".
>
> You can strip buffer heads off pages, but that is not
> your problem.
>
> "buffers" in /proc/meminfo stands for cached metadata,
> eg. the filesystem journal, inodes, directories, etc...
> Caching such metadata is legitimate, because it reduces
> the number of disk seeks down the line.

Yeah I mixed that as well, thanks for clarification (Johannes wrote a
similar response effectively kicking b) from the list of things we could
do).

Regarding your question from thread reply#3
> How on earth would a backup job benefit from cache?
>
> It only accesses each bit of data once, so caching the
> to-be-backed-up data is a waste of memory.

If it is a low memory system with a lot of disks (like in my case)
giving it more cache allows e.g. larger readaheads or less cache
trashing - but it might be ok, as it might be rare case to hit all those
constraints at once.
But as we discussed before on virtual servers it can happen from time to
time due to balooning and much more disk attachments etc.



So definitely not the majority of cases around, but some corner cases
here and there that would benefit at least from making the preserved
ratio configurable if we don't find a good way to let it take the memory
back without hurting the intended preservation functionality.

For that reason - how about the patch I posted yesterday (to consolidate
this spread out thread I attach it here again)



And finally I still would like to understand why writing the same files
three times increase the active file pages each time instead of reusing
those already brought into memory by the first run.
To collect that last open thread as well I'll cite my own question here:

> Thinking about it I wondered for what these Buffers are protected.
> If the intention to save these buffers is for reuse with similar
loads > I wonder why I "need" three iozones to build up the 85M in my case.

> Buffers start at ~0, after iozone run 1 they are at ~35, then after
#2 > ~65 and after run #3 ~85.
> Shouldn't that either allocate 85M for the first directly in case
that > much is needed for a single run - or if not the second and third
run > > just "resuse" the 35M Buffers from the first run still held?

> Note - "1 iozone run" means "iozone ... -i 0" which sequentially
> writes and then rewrites a 2Gb file on 16 disks in my current case.

Trying to answering this question my self using your buffer details
above doesn't completely fit without further clarification, as the same
files should have the same dir, inode, ... (all ext2 in my case, so no
journal data as well).


--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


Attachments:
active-inacte-ratio-tunable.diff (4.56 kB)

2010-04-26 10:59:50

by Christian Ehrhardt

[permalink] [raw]
Subject: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

From: Christian Ehrhardt <[email protected]>

*updates in v2*
- use do_div

This patch creates a knob to help users that have workloads suffering from the
fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
evict use-once pages first".
It also provides the tuning mechanisms for other users that want an even bigger
working set to be protected.

To be honest the best solution would be to allow a system not using the working
set to regain that memory *somewhen*, and therefore without drawbacks to the
scenarios it was implemented for e.g. UI interactivity while copying a lot of
data. But up to now there was no idea how to get that behaviour implemented.

In the old thread started by Elladan that finally led to 56e49d21 Wu Fengguang
wrote:
"In the worse scenario, it could waste half the memory that could
otherwise be used for readahead buffer and to prevent thrashing, in a
server serving large datasets that are hardly reused, but still slowly
builds up its active list during the long uptime (think about a slowly
performance downgrade that can be fixed by a crude dropcache action).

That said, the actual performance degradation could be much smaller -
say 15% - all memories are not equal."

We now identified a case with up to -60% Throughput, therefore this patch tries
to provide a more gentle interface than drop_caches to help a system stuck in
this.

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0% - like a kernel pre 56e49d21
- x% - allow customizing the system to someones needs

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <[email protected]>
---

[diffstat]
Documentation/sysctl/vm.txt | 10 ++++++++++
include/linux/mm.h | 2 ++
kernel/sysctl.c | 9 +++++++++
mm/memcontrol.c | 9 ++++++---
mm/vmscan.c | 17 ++++++++++++++---
5 files changed, 41 insertions(+), 6 deletions(-)

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt 2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt 2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@

Currently, these files are in /proc/sys/vm:

+- active_inactive_ratio
- block_dump
- dirty_background_bytes
- dirty_background_ratio
@@ -57,6 +58,15 @@

==============================================================

+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
block_dump

block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c 2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c 2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
.extra2 = &one,
},
#endif
+ {
+ .procname = "active_inactive_ratio",
+ .data = &sysctl_active_inactive_ratio,
+ .maxlen = sizeof(sysctl_active_inactive_ratio),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },

/*
* NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c 2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c 2010-04-26 12:45:46.000000000 +0200
@@ -893,12 +893,15 @@
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
{
unsigned long active;
- unsigned long inactive;
+ unsigned long activetoprotect;

- inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+ activetoprotect = active
+ + mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE)
+ * sysctl_active_inactive_ratio;
+ activetoprotect = do_div(activetoprotect, 100);

- return (active > inactive);
+ return (active > activetoprotect);
}

unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c 2010-04-26 12:50:47.000000000 +0200
@@ -1459,14 +1459,25 @@
return low;
}

+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
static int inactive_file_is_low_global(struct zone *zone)
{
- unsigned long active, inactive;
+ unsigned long active, activetoprotect;

active = zone_page_state(zone, NR_ACTIVE_FILE);
- inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+ activetoprotect = zone_page_state(zone, NR_FILE)
+ * sysctl_active_inactive_ratio;
+ activetoprotect = do_div(activetoprotect, 100);
+
+ return (active > activetoprotect);

- return (active > inactive);
}

/**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h 2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@

extern void dump_page(struct page *page);

+extern int sysctl_active_inactive_ratio;
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */

2010-04-26 11:59:35

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

Hi

I've quick reviewed your patch. but unfortunately I can't write my
reviewed-by sign.

> Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
> From: Christian Ehrhardt <[email protected]>
>
> *updates in v2*
> - use do_div
>
> This patch creates a knob to help users that have workloads suffering from the
> fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
> evict use-once pages first".
> It also provides the tuning mechanisms for other users that want an even bigger
> working set to be protected.

We certainly need no knob. because typical desktop users use various
application,
various workload. then, the knob doesn't help them.

Probably, I've missed previous discussion. I'm going to find your previous mail.

2010-04-26 12:43:33

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2



KOSAKI Motohiro wrote:
> Hi
>
> I've quick reviewed your patch. but unfortunately I can't write my
> reviewed-by sign.

Not a problem, atm I'm happy about any review and comment :-)

>> Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2
>> From: Christian Ehrhardt <[email protected]>
>>
>> *updates in v2*
>> - use do_div
>>
>> This patch creates a knob to help users that have workloads suffering from the
>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
>> evict use-once pages first".
>> It also provides the tuning mechanisms for other users that want an even bigger
>> working set to be protected.
>
> We certainly need no knob. because typical desktop users use various
> application,
> various workload. then, the knob doesn't help them.

Briefly - We had discussed non desktop scenarios where like a day load
that builds up the working set to 50% and a nightly backup job which
then is unable to use that protected 50% when sequentially reading a lot
of disks and due to that doesn't finish before morning.

The knob should help those people that know their system would suffer
from this or similar cases to e.g. set the protected ratio smaller or
even to zero if wanted.

As mentioned before, being able to gain back those protected 50% would
be even better - if it can be done in a way not hurting the original
intention of protecting them.

I personally just don't feel too good knowing that 50% of my memory
might hang around unused for many hours while they could be of some use.
I absolutely agree with the old intention and see how the patch helped
with the latency issue Elladan brought up in the past - but it just
looks way too aggressive to protect it "forever" for some server use cases.

> Probably, I've missed previous discussion. I'm going to find your previous mail.

The discussion ends at http://lkml.org/lkml/2010/4/22/38 - feel free to
click through it.

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

2010-04-26 14:21:38

by Rik van Riel

[permalink] [raw]
Subject: Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:

>>> This patch creates a knob to help users that have workloads suffering
>>> from the
>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>> vmscan:
>>> evict use-once pages first".
>>> It also provides the tuning mechanisms for other users that want an
>>> even bigger
>>> working set to be protected.
>>
>> We certainly need no knob. because typical desktop users use various
>> application,
>> various workload. then, the knob doesn't help them.
>
> Briefly - We had discussed non desktop scenarios where like a day load
> that builds up the working set to 50% and a nightly backup job which
> then is unable to use that protected 50% when sequentially reading a lot
> of disks and due to that doesn't finish before morning.

This is a red herring. A backup touches all of the
data once, so it does not need a lot of page cache
and will not "not finish before morning" due to the
working set being protected.

You're going to have to come up with a more realistic
scenario than that.

> I personally just don't feel too good knowing that 50% of my memory
> might hang around unused for many hours while they could be of some use.
> I absolutely agree with the old intention and see how the patch helped
> with the latency issue Elladan brought up in the past - but it just
> looks way too aggressive to protect it "forever" for some server use cases.

So far we have seen exactly one workload where it helps
to reduce the size of the active file list, and that is
not due to any need for caching more inactive pages.

On the contrary, it is because ALL OF THE INACTIVE PAGES
are in flight to disk, all under IO at the same time.

Caching has absolutely nothing to do with the regression
you ran into.

2010-04-27 14:01:14

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2



Rik van Riel wrote:
> On 04/26/2010 08:43 AM, Christian Ehrhardt wrote:
>
>>>> This patch creates a knob to help users that have workloads suffering
>>>> from the
>>>> fix 1:1 active inactive ratio brought into the kernel by "56e49d21
>>>> vmscan:
>>>> evict use-once pages first".
>>>> It also provides the tuning mechanisms for other users that want an
>>>> even bigger
>>>> working set to be protected.
>>>
>>> We certainly need no knob. because typical desktop users use various
>>> application,
>>> various workload. then, the knob doesn't help them.
>>
>> Briefly - We had discussed non desktop scenarios where like a day load
>> that builds up the working set to 50% and a nightly backup job which
>> then is unable to use that protected 50% when sequentially reading a lot
>> of disks and due to that doesn't finish before morning.
>
> This is a red herring. A backup touches all of the
> data once, so it does not need a lot of page cache
> and will not "not finish before morning" due to the
> working set being protected.
>
> You're going to have to come up with a more realistic
> scenario than that.

I completely agree that a backup case is read once and therefore doesn't
benefit from caching itself, but you know my scenario from the thread
where this patch emerged from.
="Parallel iozone sequential read - resembling the classic backup case
(read once + sequential)."

While caching isn't helping the classic way, by having data in cache
ready on the next access it is still used transparently as the system
is reading ahead into page cache to assist the sequentially reading
process.
Yes it doesn't happen with direct IO and some, but unfortunately not
all backup tools use DIO. Additionally not all backup jobs have a whole
night, and this can really be a decision maker if you can quickly pump
out your 100 TB main database in 10 or 20 minutes.

So here comes the problem, due to the 50% preserved I assume it comes
into trouble allocating that page cache memory in time. So much that it
even slows down the load - meaning long enough to let the application
completely consume the data already read and then still letting it wait.
More about that below.

Now IMHO this feels comparable to a classic backup job, and by loosing
60% Throughput (more than a Gb/s) is seems neither red nor smells like
fish to me.

>> I personally just don't feel too good knowing that 50% of my memory
>> might hang around unused for many hours while they could be of some use.
>> I absolutely agree with the old intention and see how the patch helped
>> with the latency issue Elladan brought up in the past - but it just
>> looks way too aggressive to protect it "forever" for some server use
>> cases.
>
> So far we have seen exactly one workload where it helps
> to reduce the size of the active file list, and that is
> not due to any need for caching more inactive pages.
>
> On the contrary, it is because ALL OF THE INACTIVE PAGES
> are in flight to disk, all under IO at the same time.

Ok this time I think I got your point much better - sorry for
being confused.
Discard my patch, but I'd really like to clarify and verify your
assumption in conjunction with my findings and would be happy
if you can help me with that.

As mentioned the case that suffers from the 50% memory protected is
iozone read - so it would be "in flight FROM disk", but I guess that
it is not important if it is from or to right ?

Effectively I have two read cases, one with caches dropped which then
has almost full memory for page cache in the read case. And the other
one with a few writes before filling up the protected 50% leading to a
read case with only half of the memory for page cache.
Now if I really got you right this time the issue is caused by the
fact that the parallel read ahead on all 16 disks creates so much I/O
in flight that the 128M (=50% that are left) are not enough.
>From the past we know that the time lost for the -60% Throughput was
spent in a loop around direct_reclaim&congestion_wait trying to get the
memory for the page cache reads - would you consider it possible that
we now run into a scenario splitting the memory like this?:
- 50% active file protected
- a lot of the other half related to I/O that is currently
in flight from the disk -> not free-able too?
- almost nothing to free when allocating for the next read to page
cache (can only take pages above low watermark) -> waiting

I updated my old counter patch, that I used to verify the old issue were
we spent so much time in a full timeout of congestion wait. Thanks to
Mel this was fixed (I have his watermark wait patch applied), but I
assume having 50% protected I just run into the shortened wait more
often or wait longer for watermarks to still be an issue (due to 50%
not free-able).
See the patch inlined at the end of the mail for details what/how
it is exactly counted.

As before the scenario is iozone on 16 disks in parallel with 1 iozone
child per disk.
I ran:
- write, write, write, read -> bad case
- drop cache, read -> good case
Read throughput still drops by ~60% comparing good to bad case.
Here are the numbers I got for those two cases by my counters and
meminfo:

Value Initial state Write 1 Write 2 Write 3 Read after writes (bad) Read after DC (good)
watermark_wait_duration (ns) 0 9,902,333,643 12,288,444,574 24,197,098,221 317,175,021,553 35,002,926,894
watermark_wait 0 24102 26708 35285 29720 15515
pages_direct_reclaim 0 59195 65010 86777 90883 66672
failed_pages_direct_reclaim 0 24144 26768 35343 29733 15525
failed_pages_direct_reclaim_but_progress 0 24144 26768 35343 29733 15525

MemTotal: 248912 248912 248912 248912 248912 248912
MemFree: 185732 4868 5028 3780 3064 7136
Buffers: 536 33588 65660 84296 81868 32072
Cached: 9480 145252 111672 93736 98424 149724
Active: 11052 43920 76032 89084 87780 38024
Inactive: 6860 142628 108980 96528 100280 151572
Active(anon): 5092 4452 4428 4364 4516 4492
Inactive(anon): 6480 6608 6604 6604 6604 6604
Active(file): 5960 39468 71604 84720 83264 33532
Inactive(file): 380 136020 102376 89924 93676 144968
Unevictable: 3952 3952 3952 3952 3952 3952

Real Time passed in seconds 48.83 49.38 50.35 40.62 22.61
AVG wait time waitduration/# 410,851 460,104 685,762 10,672,107 2,256,070 => x5 longer waits in avg
-52.20% bad case runs about twice as often into waits

These numbers seem to point toward my assumption, that the 50% preserved
cause the system to be unable to find memory fast enough.
Happening twice as often to run into the wait after a direct_reclaim
that made progress, but not finding a free page.
And then in average waiting about 5 times longer to get things freed up
enough reaching the watermark and get woken up.


####

Eventually I'd also really like to completely understand why the active
file pages grow when I execute the same iozone write load three times.
They effectively write the same files in the same directories without
being a journaling file system (The effect can be seen in the table
above as well).

If one of these write runs would use more than ~30M active file pages
they would be allocated and afterwards protected, but they aren't.
Then after the second run I see ~60M active file pages.
As mentioned before I would assume that it either just reuses what is
in memory from the first run, or if it really uses new stuff then the
time has come to throw the old away.

Therefore I would assume that it should never get much more after the
first run as long as they are essentially doing the same.
Does someone already know or has a good assumption what might be
growing in these buffers?
Is there a good interface to check what is buffered and protected atm?

> Caching has absolutely nothing to do with the regression
> you ran into.

As mentioned above not by means of "having it in the cache for another
fast access" yes.
But maybe by "not getting memory for reads into page cache fast enough".

--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


#### patch for the counters shown in table above ######
Subject: [PATCH][DEBUGONLY] mm: track allocation waits

From: Christian Ehrhardt <[email protected]>

This patch adds some debug counters to track how often a system runs into
waits after direct reclaim (happens in case of did_some_progress & !page)
and how much time it spends there waiting.

#for debugging only#

Signed-off-by: Christian Ehrhardt <[email protected]>
---

[diffstat]
include/linux/sysctl.h | 1
kernel/sysctl.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 17 ++++++++++++++
3 files changed, 75 insertions(+)

[diff]
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h
--- linux-2.6.32.11-0.3.99.6.626e022.orig/include/linux/sysctl.h 2010-04-27 12:01:54.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/include/linux/sysctl.h 2010-04-27 12:03:56.000000000 +0200
@@ -68,6 +68,7 @@
CTL_BUS=8, /* Busses */
CTL_ABI=9, /* Binary emulation */
CTL_CPU=10, /* CPU stuff (speed scaling, etc) */
+ CTL_PERF=11, /* Performance counters and timer sums for debugging */
CTL_XEN=123, /* Xen info and control */
CTL_ARLAN=254, /* arlan wireless driver */
CTL_S390DBF=5677, /* s390 debug */
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/kernel/sysctl.c 2010-04-27 14:26:04.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/kernel/sysctl.c 2010-04-27 15:44:54.000000000 +0200
@@ -183,6 +183,7 @@
.default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
};

+static struct ctl_table perf_table[];
static struct ctl_table kern_table[];
static struct ctl_table vm_table[];
static struct ctl_table fs_table[];
@@ -236,6 +237,13 @@
.mode = 0555,
.child = dev_table,
},
+ {
+ .ctl_name = CTL_PERF,
+ .procname = "perf",
+ .mode = 0555,
+ .child = perf_table,
+ },
+
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
@@ -254,6 +262,55 @@
static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
#endif

+extern unsigned long perf_count_watermark_wait;
+extern unsigned long perf_count_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim;
+extern unsigned long perf_count_failed_pages_direct_reclaim_but_progress;
+extern unsigned long perf_count_watermark_wait_duration;
+static struct ctl_table perf_table[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_watermark_wait_duration",
+ .data = &perf_count_watermark_wait_duration,
+ .mode = 0666,
+ .maxlen = sizeof(unsigned long),
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_watermark_wait",
+ .data = &perf_count_watermark_wait,
+ .mode = 0666,
+ .maxlen = sizeof(unsigned long),
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_pages_direct_reclaim",
+ .data = &perf_count_pages_direct_reclaim,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0666,
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_failed_pages_direct_reclaim",
+ .data = &perf_count_failed_pages_direct_reclaim,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0666,
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "perf_count_failed_pages_direct_reclaim_but_progress",
+ .data = &perf_count_failed_pages_direct_reclaim_but_progress,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0666,
+ .proc_handler = &proc_doulongvec_minmax,
+ },
+ { .ctl_name = 0 }
+};
+
static struct ctl_table kern_table[] = {
{
.ctl_name = CTL_UNNUMBERED,
diff -Naur linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c
--- linux-2.6.32.11-0.3.99.6.626e022.orig/mm/page_alloc.c 2010-04-27 12:01:55.000000000 +0200
+++ linux-2.6.32.11-0.3.99.6.626e022/mm/page_alloc.c 2010-04-27 14:06:40.000000000 +0200
@@ -191,6 +191,7 @@
wake_up_interruptible(&watermark_wq);
}

+unsigned long perf_count_watermark_wait = 0;
/**
* watermark_wait - Wait for watermark to go above low
* @timeout: Wait until watermark is reached or this timeout is reached
@@ -202,6 +203,7 @@
long ret;
DEFINE_WAIT(wait);

+ perf_count_watermark_wait++;
prepare_to_wait(&watermark_wq, &wait, TASK_INTERRUPTIBLE);

/*
@@ -1725,6 +1727,10 @@
return page;
}

+unsigned long perf_count_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim = 0;
+unsigned long perf_count_failed_pages_direct_reclaim_but_progress = 0;
+
/* The really slow allocator path where we enter direct reclaim */
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1761,6 +1767,13 @@
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
+
+ perf_count_pages_direct_reclaim++;
+ if (!page)
+ perf_count_failed_pages_direct_reclaim++;
+ if (!page && *did_some_progress)
+ perf_count_failed_pages_direct_reclaim_but_progress++;
+
return page;
}

@@ -1841,6 +1854,7 @@
return alloc_flags;
}

+unsigned long perf_count_watermark_wait_duration = 0;
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1961,8 +1975,11 @@
/* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+ unsigned long t1;
/* Too much pressure, back off a bit at let reclaimers do work */
+ t1 = get_clock();
watermark_wait(HZ/50);
+ perf_count_watermark_wait_duration += ((get_clock() - t1) * 125) >> 9;
goto rebalance;
}