2010-08-26 15:14:40

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

congestion_wait() is a bit stupid in that it goes to sleep even when there
is no congestion. This causes stalls in a number of situations and may be
partially responsible for bug reports about desktop interactivity.

This patch series aims to account for these unnecessary congestion_waits()
and to avoid going to sleep when there is no congestion available. Patches
1 and 2 add instrumentation related to congestion which should be reuable
by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
instead of going to sleep if there is no congestion.

Once again, I shoved this through performance test. Unlike previous tests,
I ran this on a ported version of my usual test-suite that should be suitable
for release soon. It's not quite as good as my old set but it's sufficient
for this and related series. The tests I ran were kernbench vmr-stream
iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
the usual stress the number of high order allocations that can be made while
the system is under severe stress. The suite contains the necessary analysis
scripts as well and I'd release it now except the documentation blows.

x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)

The disks on all of them were single disks and not particularly fast.

Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
instrumentation and a second test with patch 3 applied.

In all cases, kernbench, hackbench, STREAM and iozone did not show any
performance difference because none of them were pressuring the system
enough to be calling congestion_wait() so I won't post the results.
About all worth noting for them is that nothing horrible appeared to break.

In the analysis scripts, I record unnecessary sleeps to be a sleep that
had no congestion. The post-processing scripts for cond_resched() will only
count an uncongested call to congestion_wait() as unnecessary if the process
actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.

One vague concern I have is when too many pages are isolated, we call
congestion_wait(). This could now actively spin in the loop for its quanta
before calling cond_resched(). If it's calling with no congestion, it's
hard to know what the proper thing to do there is.

X86
Sysbench on this machine was not stressed enough to call congestion_wait
so I'll just discuss the stress-highalloc test. This is the full report
from the testsuite

STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 70.00 ( 0.00%) 72.00 ( 2.00%)
Pass 2 72.00 ( 0.00%) 72.00 ( 0.00%)
At Rest 74.00 ( 0.00%) 73.00 (-1.00%)

FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 409 755
Direct reclaim pages scanned 185585 212524
Direct reclaim write file async I/O 442 554
Direct reclaim write anon async I/O 31789 27074
Direct reclaim write file sync I/O 17 23
Direct reclaim write anon sync I/O 17825 15013
Wake kswapd requests 895 1274
Kswapd wakeups 387 432
Kswapd pages scanned 16373859 12892992
Kswapd reclaim write file async I/O 29267 18188
Kswapd reclaim write anon async I/O 1243386 1080234
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 4479.04 3446.81
Time kswapd awake (seconds) 2229.99 1218.52

Total pages scanned 16559444 13105516
%age total pages scanned/written 7.99% 8.71%
%age file pages scanned/written 0.18% 0.14%
Percentage Time Spent Direct Reclaim 74.99% 69.54%
Percentage Time kswapd Awake 41.78% 28.57%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 474 38
Direct number schedule waited 0 9478
Direct time congest waited 21564ms 3732ms
Direct time schedule waited 0ms 4ms
Direct unnecessary wait 434 1
KSwapd number congest waited 68 0
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 5424ms 0ms
Kswapd unnecessary wait 44 0

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1493.97 1509.88
Total Elapsed Time (seconds) 5337.71 4265.07

Allocations under stress were slightly better but by and large there is no
significant difference in success rates. The test completed 1072 seconds
faster which is a pretty decent speedup.

Scanning rates in reclaim were higher but that is somewhat expected because
we weren't going to sleep as much. Time stalled in reclaim for both direct
and kswapd was reduced which is pretty significant.

In terms of congestion_wait, the time spent asleep was massively reduced
by 17 seconds for direct reclaim and 5 seconds for kswapd. cond_reched
is called a number of times instead of course but the time it spent
being scheduled was a mere 4ms. Overall, this looked positive.

X86-64
Sysbench again wasn't under enough pressure so here is the high alloc test.

STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 69.00 ( 0.00%) 73.00 ( 4.00%)
Pass 2 71.00 ( 0.00%) 74.00 ( 3.00%)
At Rest 72.00 ( 0.00%) 75.00 ( 3.00%)

FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 646 1091
Direct reclaim pages scanned 94779 102392
Direct reclaim write file async I/O 164 216
Direct reclaim write anon async I/O 12162 15413
Direct reclaim write file sync I/O 64 45
Direct reclaim write anon sync I/O 5366 6987
Wake kswapd requests 3950 3912
Kswapd wakeups 613 579
Kswapd pages scanned 7544412 7267203
Kswapd reclaim write file async I/O 14660 16256
Kswapd reclaim write anon async I/O 964824 1065445
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 3279.00 3564.59
Time kswapd awake (seconds) 1445.70 1870.70

Total pages scanned 7639191 7369595
%age total pages scanned/written 13.05% 14.99%
%age file pages scanned/written 0.19% 0.22%
Percentage Time Spent Direct Reclaim 70.48% 72.04%
Percentage Time kswapd Awake 35.62% 42.94%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 801 97
Direct number schedule waited 0 16079
Direct time congest waited 37448ms 9004ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 696 0
KSwapd number congest waited 10 1
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 900ms 100ms
Kswapd unnecessary wait 6 0

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1373.11 1383.7
Total Elapsed Time (seconds) 4058.33 4356.47

Success rates were slightly higher again, not by a massive amount but some.
Time to complete the test was unfortunately increased slightly though and
I'm not sure where that is coming from. The increased number of successful
allocations would account for some of that because the system is under
greater memory pressure as a result of the allocations.

Scanning rates are comparable. Writing back files from reclaim was slighly
increased which I believe it due to less time being spent asleep so there
was a smaller window for the flusher threads to do their work. Reducing
that is the responsibility of another series.

Again, the time spent asleep in congestion_wait() is reduced by a large
amount - 28 seconds for direct reclaim and none of the cond_resched()
resulted in sleep times.

Overally, seems reasonable.

PPC64
Unlike the other two machines, sysbench called congestion_wait a few times so here are the full
results for sysbench

SYSBENCH
sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
traceonly-v1r1 nocongest-v1r1
1 5307.36 ( 0.00%) 5349.58 ( 0.79%)
2 9886.45 ( 0.00%) 10274.78 ( 3.78%)
3 14165.01 ( 0.00%) 14210.64 ( 0.32%)
4 16239.12 ( 0.00%) 16201.46 (-0.23%)
5 15337.09 ( 0.00%) 15541.56 ( 1.32%)
6 14763.64 ( 0.00%) 15805.80 ( 6.59%)
7 14216.69 ( 0.00%) 15023.57 ( 5.37%)
8 13749.62 ( 0.00%) 14492.34 ( 5.12%)
9 13647.75 ( 0.00%) 13969.77 ( 2.31%)
10 13275.70 ( 0.00%) 13495.08 ( 1.63%)
11 13324.91 ( 0.00%) 12879.81 (-3.46%)
12 13169.23 ( 0.00%) 12967.36 (-1.56%)
13 12896.20 ( 0.00%) 12981.43 ( 0.66%)
14 12793.44 ( 0.00%) 12768.26 (-0.20%)
15 12627.98 ( 0.00%) 12522.86 (-0.84%)
16 12228.54 ( 0.00%) 12352.07 ( 1.00%)
FTrace Reclaim Statistics: vmscan
sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
traceonly-v1r1 nocongest-v1r1
Direct reclaims 0 0
Direct reclaim pages scanned 0 0
Direct reclaim write file async I/O 0 0
Direct reclaim write anon async I/O 0 0
Direct reclaim write file sync I/O 0 0
Direct reclaim write anon sync I/O 0 0
Wake kswapd requests 0 0
Kswapd wakeups 202 194
Kswapd pages scanned 5990987 5618709
Kswapd reclaim write file async I/O 24 16
Kswapd reclaim write anon async I/O 1509 1564
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 0.00 0.00
Time kswapd awake (seconds) 174.23 152.17

Total pages scanned 5990987 5618709
%age total pages scanned/written 0.03% 0.03%
%age file pages scanned/written 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.00% 0.00%
Percentage Time kswapd Awake 2.80% 2.60%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0
Direct number schedule waited 0 0
Direct time congest waited 0ms 0ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 0 0
KSwapd number congest waited 10 3
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 800ms 300ms
Kswapd unnecessary wait 6 0

Performance is improved by a decent marging although I didn't check if
it was statistically significant or not. The time kswapd spent asleep was
slightly reduced.

STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 40.00 ( 0.00%) 35.00 (-5.00%)
Pass 2 50.00 ( 0.00%) 45.00 (-5.00%)
At Rest 61.00 ( 0.00%) 64.00 ( 3.00%)

FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 166 926
Direct reclaim pages scanned 167920 183644
Direct reclaim write file async I/O 391 412
Direct reclaim write anon async I/O 31563 31986
Direct reclaim write file sync I/O 54 52
Direct reclaim write anon sync I/O 21696 17087
Wake kswapd requests 123 128
Kswapd wakeups 143 143
Kswapd pages scanned 3899414 4229450
Kswapd reclaim write file async I/O 12392 13098
Kswapd reclaim write anon async I/O 673260 709817
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 1595.13 1692.18
Time kswapd awake (seconds) 1114.00 1210.48

Total pages scanned 4067334 4413094
%age total pages scanned/written 18.18% 17.50%
%age file pages scanned/written 0.32% 0.31%
Percentage Time Spent Direct Reclaim 45.89% 47.50%
Percentage Time kswapd Awake 46.09% 48.04%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 233 16
Direct number schedule waited 0 1323
Direct time congest waited 10164ms 1600ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 218 0
KSwapd number congest waited 11 13
KSwapd number schedule waited 0 3
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 1100ms 1244ms
Kswapd unnecessary wait 0 0

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1880.56 1870.36
Total Elapsed Time (seconds) 2417.17 2519.51

Allocation success rates are slightly down but on PPC64, they are always
very difficult and I have other ideas on how allocation success rates could
be improved.

What is more important is that again the time spent asleep due to
congestion_wait() was reduced for direct reclaimers.

The results here aren't as positive as the other two machines but they
still seem acceptable.

Broadly speaking, I think sleeping in congestion_wait() has been responsible
for some bugs related to stalls under large IO, particularly the read IO,
so we need to do something about it. These tests seem overall positive but
it'd be interesting if someone with a workload that stalls in congestion_wait
unnecessarily are helped by this patch. Desktop interactivity would be harder
to test because I think it has multiple root causes of which congestion_wait
is just one of them. I've included Christian Ehrhardt in the cc because he
had a bug back in April that was rooted in congestion_wait() that I think
this might help and hopefully he can provide hard data for a workload
with lots of IO but constrained memory. I cc'd Johannes because we were
discussion congestion_wait() at LSF/MM and he might have some thoughts and
I think I was talking to Jan briefly about congestion_wait() as well. As
this affects writeback, Wu and fsdevel might have some opinions.

include/trace/events/writeback.h | 22 ++++++++++++++++++++++
mm/backing-dev.c | 31 ++++++++++++++++++++++++++-----
2 files changed, 48 insertions(+), 5 deletions(-)


2010-08-26 15:14:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/3] writeback: Record if the congestion was unnecessary

If congestion_wait() is called when there is no congestion, the caller
will wait for the full timeout. This can cause unreasonable and
unnecessary stalls. There are a number of potential modifications that
could be made to wake sleepers but this patch measures how serious the
problem is. It keeps count of how many congested BDIs there are. If
congestion_wait() is called with no BDIs congested, the tracepoint will
record that the wait was unnecessary.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/writeback.h | 11 ++++++++---
mm/backing-dev.c | 15 ++++++++++++---
2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e3bee61..03bb04b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);

TRACE_EVENT(writeback_congest_waited,

- TP_PROTO(unsigned int usec_delayed),
+ TP_PROTO(unsigned int usec_delayed, bool unnecessary),

- TP_ARGS(usec_delayed),
+ TP_ARGS(usec_delayed, unnecessary),

TP_STRUCT__entry(
__field( unsigned int, usec_delayed )
+ __field( unsigned int, unnecessary )
),

TP_fast_assign(
__entry->usec_delayed = usec_delayed;
+ __entry->unnecessary = unnecessary;
),

- TP_printk("usec_delayed=%u", __entry->usec_delayed)
+ TP_printk("usec_delayed=%u unnecessary=%d",
+ __entry->usec_delayed,
+ __entry->unnecessary
+ )
);

#endif /* _TRACE_WRITEBACK_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7ae33e2..a49167f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+static atomic_t nr_bdi_congested[2];

void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
wait_queue_head_t *wqh = &congestion_wqh[sync];

bit = sync ? BDI_sync_congested : BDI_async_congested;
- clear_bit(bit, &bdi->state);
+ if (test_and_clear_bit(bit, &bdi->state))
+ atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
enum bdi_state bit;

bit = sync ? BDI_sync_congested : BDI_async_congested;
- set_bit(bit, &bdi->state);
+ if (!test_and_set_bit(bit, &bdi->state))
+ atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);

@@ -760,14 +763,20 @@ long congestion_wait(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
+ bool unnecessary = false;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];

+ /* Check if this call to congestion_wait was necessary */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0)
+ unnecessary = true;
+
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);

- trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+ trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
+ unnecessary);

return ret;
}
--
1.7.1

2010-08-26 15:14:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/3] writeback: Account for time spent congestion_waited

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a
tracepoint for congestion_wait to record when congestion_wait() occurred
and how long was spent.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/writeback.h | 17 +++++++++++++++++
mm/backing-dev.c | 4 ++++
2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..e3bee61 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,23 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);

+TRACE_EVENT(writeback_congest_waited,
+
+ TP_PROTO(unsigned int usec_delayed),
+
+ TP_ARGS(usec_delayed),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, usec_delayed )
+ ),
+
+ TP_fast_assign(
+ __entry->usec_delayed = usec_delayed;
+ ),
+
+ TP_printk("usec_delayed=%u", __entry->usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */

/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..7ae33e2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,16 @@ EXPORT_SYMBOL(set_bdi_congested);
long congestion_wait(int sync, long timeout)
{
long ret;
+ unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];

prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+
return ret;
}
EXPORT_SYMBOL(congestion_wait);
--
1.7.1

2010-08-26 15:14:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

If congestion_wait() is called with no BDIs congested, the caller will
sleep for the full timeout and this is an unnecessary sleep. This patch
checks if there are BDIs congested. If so, it goes to sleep as normal.
If not, it calls cond_resched() to ensure the caller is not hogging the
CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/backing-dev.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a49167f..6abe860 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];

- /* Check if this call to congestion_wait was necessary */
- if (atomic_read(&nr_bdi_congested[sync]) == 0)
+ /*
+ * If there is no congestion, there is no point sleeping on the queue.
+ * This call was unecessary but in case we are spinning due to a bad
+ * caller, at least call cond_reched() and sleep if our CPU quota
+ * has expired
+ */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0) {
unnecessary = true;
-
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- ret = io_schedule_timeout(timeout);
- finish_wait(wqh, &wait);
+ cond_resched();
+ ret = 0;
+ } else {
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ }

trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
unnecessary);
--
1.7.1

2010-08-26 17:20:49

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> congestion_wait() is a bit stupid in that it goes to sleep even when there
> is no congestion. This causes stalls in a number of situations and may be
> partially responsible for bug reports about desktop interactivity.
>
> This patch series aims to account for these unnecessary congestion_waits()
> and to avoid going to sleep when there is no congestion available. Patches
> 1 and 2 add instrumentation related to congestion which should be reuable
> by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> instead of going to sleep if there is no congestion.
>
> Once again, I shoved this through performance test. Unlike previous tests,
> I ran this on a ported version of my usual test-suite that should be suitable
> for release soon. It's not quite as good as my old set but it's sufficient
> for this and related series. The tests I ran were kernbench vmr-stream
> iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> the usual stress the number of high order allocations that can be made while
> the system is under severe stress. The suite contains the necessary analysis
> scripts as well and I'd release it now except the documentation blows.
>
> x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
>
> The disks on all of them were single disks and not particularly fast.
>
> Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> instrumentation and a second test with patch 3 applied.
>
> In all cases, kernbench, hackbench, STREAM and iozone did not show any
> performance difference because none of them were pressuring the system
> enough to be calling congestion_wait() so I won't post the results.
> About all worth noting for them is that nothing horrible appeared to break.
>
> In the analysis scripts, I record unnecessary sleeps to be a sleep that
> had no congestion. The post-processing scripts for cond_resched() will only
> count an uncongested call to congestion_wait() as unnecessary if the process
> actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
>
> One vague concern I have is when too many pages are isolated, we call
> congestion_wait(). This could now actively spin in the loop for its quanta
> before calling cond_resched(). If it's calling with no congestion, it's
> hard to know what the proper thing to do there is.

Suddenly, many processes could enter into the direct reclaim path by another
reason(ex, fork bomb) regradless of congestion. backing dev congestion is
just one of them.

I think if congestion_wait returns without calling io_schedule_timeout
by your patch, too_many_isolated can schedule_timeout to wait for the system's
calm to preventing OOM killing.

How about this?

If you don't mind, I will send the patch based on this patch series
after your patch settle down or Could you add this to your patch series?
But I admit this doesn't almost affect your experiment.

>From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Fri, 27 Aug 2010 02:06:45 +0900
Subject: [PATCH] Wait regardless of congestion if too many pages are isolated

Suddenly, many processes could enter into the direct reclaim path
regradless of congestion. backing dev congestion is just one of them.
But current implementation calls congestion_wait if too many pages are isolated.

if congestion_wait returns without calling io_schedule_timeout,
too_many_isolated can schedule_timeout to wait for the system's calm
to preventing OOM killing.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/backing-dev.c | 5 ++---
mm/compaction.c | 6 +++++-
mm/vmscan.c | 6 +++++-
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6abe860..9431bca 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
- * write congestion. If no backing_devs are congested then just wait for the
- * next write to be completed.
+ * write congestion. If no backing_devs are congested then just returns.
*/
long congestion_wait(int sync, long timeout)
{
@@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
if (atomic_read(&nr_bdi_congested[sync]) == 0) {
unnecessary = true;
cond_resched();
- ret = 0;
+ ret = timeout;
} else {
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/compaction.c b/mm/compaction.c
index 94cce51..7370683 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }

if (fatal_signal_pending(current))
return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3109ff7..f5e3e28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
1.7.0.5


--
Kind regards,
Minchan Kim

2010-08-26 17:23:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 1/3] writeback: Account for time spent congestion_waited

On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

I think that's enough to add tracepoint until solving this issue at least.

--
Kind regards,
Minchan Kim

2010-08-26 17:32:05

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > is no congestion. This causes stalls in a number of situations and may be
> > partially responsible for bug reports about desktop interactivity.
> >
> > This patch series aims to account for these unnecessary congestion_waits()
> > and to avoid going to sleep when there is no congestion available. Patches
> > 1 and 2 add instrumentation related to congestion which should be reuable
> > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > instead of going to sleep if there is no congestion.
> >
> > Once again, I shoved this through performance test. Unlike previous tests,
> > I ran this on a ported version of my usual test-suite that should be suitable
> > for release soon. It's not quite as good as my old set but it's sufficient
> > for this and related series. The tests I ran were kernbench vmr-stream
> > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > the usual stress the number of high order allocations that can be made while
> > the system is under severe stress. The suite contains the necessary analysis
> > scripts as well and I'd release it now except the documentation blows.
> >
> > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> >
> > The disks on all of them were single disks and not particularly fast.
> >
> > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > instrumentation and a second test with patch 3 applied.
> >
> > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > performance difference because none of them were pressuring the system
> > enough to be calling congestion_wait() so I won't post the results.
> > About all worth noting for them is that nothing horrible appeared to break.
> >
> > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > had no congestion. The post-processing scripts for cond_resched() will only
> > count an uncongested call to congestion_wait() as unnecessary if the process
> > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> >
> > One vague concern I have is when too many pages are isolated, we call
> > congestion_wait(). This could now actively spin in the loop for its quanta
> > before calling cond_resched(). If it's calling with no congestion, it's
> > hard to know what the proper thing to do there is.
>
> Suddenly, many processes could enter into the direct reclaim path by another
> reason(ex, fork bomb) regradless of congestion. backing dev congestion is
> just one of them.
>

This situation applys with or without this series, right?

> I think if congestion_wait returns without calling io_schedule_timeout
> by your patch, too_many_isolated can schedule_timeout to wait for the system's
> calm to preventing OOM killing.
>

More likely, to stop a loop in too_many_isolated() consuming CPU time it
can do nothing with.

> How about this?
>
> If you don't mind, I will send the patch based on this patch series
> after your patch settle down or Could you add this to your patch series?
> But I admit this doesn't almost affect your experiment.
>

I think it's a related topic so could belong with the series.

> From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Fri, 27 Aug 2010 02:06:45 +0900
> Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
>
> Suddenly, many processes could enter into the direct reclaim path
> regradless of congestion. backing dev congestion is just one of them.
> But current implementation calls congestion_wait if too many pages are isolated.
>
> if congestion_wait returns without calling io_schedule_timeout,
> too_many_isolated can schedule_timeout to wait for the system's calm
> to preventing OOM killing.
>

I think the reasoning here might be a little off. How about;

If many processes enter direct reclaim or memory compaction, too many pages
can get isolated. In this situation, too_many_isolated() can call
congestion_wait() but if there is no congestion, it fails to go to sleep
and instead spins until it's quota expires.

This patch checks if congestion_wait() returned without sleeping. If it
did because there was no congestion, it unconditionally goes to sleep
instead of hogging the CPU.

> Signed-off-by: Minchan Kim <[email protected]>
> ---
> mm/backing-dev.c | 5 ++---
> mm/compaction.c | 6 +++++-
> mm/vmscan.c | 6 +++++-
> 3 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6abe860..9431bca 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> * @timeout: timeout in jiffies
> *
> * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> - * write congestion. If no backing_devs are congested then just wait for the
> - * next write to be completed.
> + * write congestion. If no backing_devs are congested then just returns.
> */
> long congestion_wait(int sync, long timeout)
> {
> @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> unnecessary = true;
> cond_resched();
> - ret = 0;
> + ret = timeout;
> } else {
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>

We don't really need the timeout variable here but I see what you are
at. It's unfortunate to just go to sleep for HZ/10 but if it's not
congestion, we do not have any other event to wake up on at the moment.
We'd have to introduce a too_many_isolated waitqueue that is kicked if
pages are put back on the LRU.

This is better than spinning though.

> if (fatal_signal_pending(current))
> return 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_dirty;
> while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))

This seems very reasonable. I'll review it more carefully tomorrow and if I
spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
the too_many_isolated() case but I cannot think of a better alternative
without adding more waitqueues.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-26 17:35:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/trace/events/writeback.h | 11 ++++++++---
> mm/backing-dev.c | 15 ++++++++++++---
> 2 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index e3bee61..03bb04b 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
>
> TRACE_EVENT(writeback_congest_waited,
>
> - TP_PROTO(unsigned int usec_delayed),
> + TP_PROTO(unsigned int usec_delayed, bool unnecessary),
>
> - TP_ARGS(usec_delayed),
> + TP_ARGS(usec_delayed, unnecessary),
>
> TP_STRUCT__entry(
> __field( unsigned int, usec_delayed )
> + __field( unsigned int, unnecessary )
> ),
>
> TP_fast_assign(
> __entry->usec_delayed = usec_delayed;
> + __entry->unnecessary = unnecessary;
> ),
>
> - TP_printk("usec_delayed=%u", __entry->usec_delayed)
> + TP_printk("usec_delayed=%u unnecessary=%d",
> + __entry->usec_delayed,
> + __entry->unnecessary
> + )
> );
>
> #endif /* _TRACE_WRITEBACK_H */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7ae33e2..a49167f 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);

Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
But this seems to consider whole backing dev. Is your intention? or Am I missing now?

--
Kind regards,
Minchan Kim

2010-08-26 17:38:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep. This patch
> checks if there are BDIs congested. If so, it goes to sleep as normal.
> If not, it calls cond_resched() to ensure the caller is not hogging the
> CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/backing-dev.c | 20 ++++++++++++++------
> 1 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index a49167f..6abe860 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c

Function's decripton should be changed since we don't wait next write any more.

> @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> DEFINE_WAIT(wait);
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> - /* Check if this call to congestion_wait was necessary */
> - if (atomic_read(&nr_bdi_congested[sync]) == 0)
> + /*
> + * If there is no congestion, there is no point sleeping on the queue.
> + * This call was unecessary but in case we are spinning due to a bad
> + * caller, at least call cond_reched() and sleep if our CPU quota
> + * has expired
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> unnecessary = true;
> -
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, &wait);
> + cond_resched();
> + ret = 0;

"ret = timeout" is more proper as considering io_schedule_timeout's return value.

> + } else {
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = io_schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + }
>
> trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> unnecessary);
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Kind regards,
Minchan Kim

2010-08-26 17:41:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Fri, Aug 27, 2010 at 02:35:34AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/trace/events/writeback.h | 11 ++++++++---
> > mm/backing-dev.c | 15 ++++++++++++---
> > 2 files changed, 20 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index e3bee61..03bb04b 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
> >
> > TRACE_EVENT(writeback_congest_waited,
> >
> > - TP_PROTO(unsigned int usec_delayed),
> > + TP_PROTO(unsigned int usec_delayed, bool unnecessary),
> >
> > - TP_ARGS(usec_delayed),
> > + TP_ARGS(usec_delayed, unnecessary),
> >
> > TP_STRUCT__entry(
> > __field( unsigned int, usec_delayed )
> > + __field( unsigned int, unnecessary )
> > ),
> >
> > TP_fast_assign(
> > __entry->usec_delayed = usec_delayed;
> > + __entry->unnecessary = unnecessary;
> > ),
> >
> > - TP_printk("usec_delayed=%u", __entry->usec_delayed)
> > + TP_printk("usec_delayed=%u unnecessary=%d",
> > + __entry->usec_delayed,
> > + __entry->unnecessary
> > + )
> > );
> >
> > #endif /* _TRACE_WRITEBACK_H */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 7ae33e2..a49167f 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
> >
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
>
> Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
> But this seems to consider whole backing dev. Is your intention? or Am I missing now?
>

Not whole backing devs, all backing devs. This is intentional.

If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout
because a wakeup event will not occur - this is a bad scenario. To know if
0 BDIs were congested, one could either walk all the BDIs checking their
status or maintain a counter like nr_bdi_congested which is what I decided on.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-26 17:43:01

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep. This patch
> > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > If not, it calls cond_resched() to ensure the caller is not hogging the
> > CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/backing-dev.c | 20 ++++++++++++++------
> > 1 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index a49167f..6abe860 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
>
> Function's decripton should be changed since we don't wait next write any more.
>

My bad. I need to check that "next write" thing. It doesn't appear to be
happening but maybe that side of things just broke somewhere in the
distant past. I lack context of how this is meant to work so maybe
someone will educate me.

> > @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> > DEFINE_WAIT(wait);
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > - /* Check if this call to congestion_wait was necessary */
> > - if (atomic_read(&nr_bdi_congested[sync]) == 0)
> > + /*
> > + * If there is no congestion, there is no point sleeping on the queue.
> > + * This call was unecessary but in case we are spinning due to a bad
> > + * caller, at least call cond_reched() and sleep if our CPU quota
> > + * has expired
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > unnecessary = true;
> > -
> > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > - ret = io_schedule_timeout(timeout);
> > - finish_wait(wqh, &wait);
> > + cond_resched();
> > + ret = 0;
>
> "ret = timeout" is more proper as considering io_schedule_timeout's return value.
>

Good point, will fix.

> > + } else {
> > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > + ret = io_schedule_timeout(timeout);
> > + finish_wait(wqh, &wait);
> > + }
> >
> > trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> > unnecessary);

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-26 17:50:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Thu, Aug 26, 2010 at 06:31:47PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > > is no congestion. This causes stalls in a number of situations and may be
> > > partially responsible for bug reports about desktop interactivity.
> > >
> > > This patch series aims to account for these unnecessary congestion_waits()
> > > and to avoid going to sleep when there is no congestion available. Patches
> > > 1 and 2 add instrumentation related to congestion which should be reuable
> > > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > > instead of going to sleep if there is no congestion.
> > >
> > > Once again, I shoved this through performance test. Unlike previous tests,
> > > I ran this on a ported version of my usual test-suite that should be suitable
> > > for release soon. It's not quite as good as my old set but it's sufficient
> > > for this and related series. The tests I ran were kernbench vmr-stream
> > > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > > the usual stress the number of high order allocations that can be made while
> > > the system is under severe stress. The suite contains the necessary analysis
> > > scripts as well and I'd release it now except the documentation blows.
> > >
> > > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> > >
> > > The disks on all of them were single disks and not particularly fast.
> > >
> > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > > instrumentation and a second test with patch 3 applied.
> > >
> > > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > > performance difference because none of them were pressuring the system
> > > enough to be calling congestion_wait() so I won't post the results.
> > > About all worth noting for them is that nothing horrible appeared to break.
> > >
> > > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > > had no congestion. The post-processing scripts for cond_resched() will only
> > > count an uncongested call to congestion_wait() as unnecessary if the process
> > > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> > >
> > > One vague concern I have is when too many pages are isolated, we call
> > > congestion_wait(). This could now actively spin in the loop for its quanta
> > > before calling cond_resched(). If it's calling with no congestion, it's
> > > hard to know what the proper thing to do there is.
> >
> > Suddenly, many processes could enter into the direct reclaim path by another
> > reason(ex, fork bomb) regradless of congestion. backing dev congestion is
> > just one of them.
> >
>
> This situation applys with or without this series, right?

I think the situation applys with this series. That's because old behavior was calling
schedule regardless of I/O congested as seeing io_schedule_timeout.
But you are changing it now as calling it conditionally.

>
> > I think if congestion_wait returns without calling io_schedule_timeout
> > by your patch, too_many_isolated can schedule_timeout to wait for the system's
> > calm to preventing OOM killing.
> >
>
> More likely, to stop a loop in too_many_isolated() consuming CPU time it
> can do nothing with.
>
> > How about this?
> >
> > If you don't mind, I will send the patch based on this patch series
> > after your patch settle down or Could you add this to your patch series?
> > But I admit this doesn't almost affect your experiment.
> >
>
> I think it's a related topic so could belong with the series.
>
> > From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <[email protected]>
> > Date: Fri, 27 Aug 2010 02:06:45 +0900
> > Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
> >
> > Suddenly, many processes could enter into the direct reclaim path
> > regradless of congestion. backing dev congestion is just one of them.
> > But current implementation calls congestion_wait if too many pages are isolated.
> >
> > if congestion_wait returns without calling io_schedule_timeout,
> > too_many_isolated can schedule_timeout to wait for the system's calm
> > to preventing OOM killing.
> >
>
> I think the reasoning here might be a little off. How about;
>
> If many processes enter direct reclaim or memory compaction, too many pages
> can get isolated. In this situation, too_many_isolated() can call
> congestion_wait() but if there is no congestion, it fails to go to sleep
> and instead spins until it's quota expires.
>
> This patch checks if congestion_wait() returned without sleeping. If it
> did because there was no congestion, it unconditionally goes to sleep
> instead of hogging the CPU.

That's good to me. :)

>
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > mm/backing-dev.c | 5 ++---
> > mm/compaction.c | 6 +++++-
> > mm/vmscan.c | 6 +++++-
> > 3 files changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 6abe860..9431bca 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> > * @timeout: timeout in jiffies
> > *
> > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > - * write congestion. If no backing_devs are congested then just wait for the
> > - * next write to be completed.
> > + * write congestion. If no backing_devs are congested then just returns.
> > */
> > long congestion_wait(int sync, long timeout)
> > {
> > @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> > if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > unnecessary = true;
> > cond_resched();
> > - ret = 0;
> > + ret = timeout;
> > } else {
> > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > ret = io_schedule_timeout(timeout);
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 94cce51..7370683 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> > * delay for some time until fewer pages are isolated
> > */
> > while (unlikely(too_many_isolated(zone))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + long timeout = HZ/10;
> > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule_timeout(timeout);
> > + }
> >
>
> We don't really need the timeout variable here but I see what you are
> at. It's unfortunate to just go to sleep for HZ/10 but if it's not
> congestion, we do not have any other event to wake up on at the moment.
> We'd have to introduce a too_many_isolated waitqueue that is kicked if
> pages are put back on the LRU.

I thought it firstly but first of all, let's make sure how often this situation happens
and it's really serious problem. I means it's rather overkill.
>
> This is better than spinning though.
>
> > if (fatal_signal_pending(current))
> > return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3109ff7..f5e3e28 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > unsigned long nr_dirty;
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + long timeout = HZ/10;
> > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule_timeout(timeout);
> > + }
> >
> > /* We are about to die and free our memory. Return now. */
> > if (fatal_signal_pending(current))
>
> This seems very reasonable. I'll review it more carefully tomorrow and if I
> spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
> the too_many_isolated() case but I cannot think of a better alternative
> without adding more waitqueues.

Thanks. Mel.

>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

--
Kind regards,
Minchan Kim

2010-08-26 18:11:01

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/3] writeback: Account for time spent congestion_waited

On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Johannes Weiner <[email protected]>

2010-08-26 18:17:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called with no BDIs congested, the caller will
> > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > CPU longer than its quota but otherwise will not sleep.
> > >
> > > This is aimed at reducing some of the major desktop stalls reported during
> > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > but it could just have been reclaiming clean page cache pages with no
> > > congestion. Without this patch, it would sleep for a full timeout but after
> > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > Similar logic applies to direct reclaimers that are not making enough
> > > progress.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/backing-dev.c | 20 ++++++++++++++------
> > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index a49167f..6abe860 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> >
> > Function's decripton should be changed since we don't wait next write any more.
> >
>
> My bad. I need to check that "next write" thing. It doesn't appear to be
> happening but maybe that side of things just broke somewhere in the
> distant past. I lack context of how this is meant to work so maybe
> someone will educate me.

On every retired io request the congestion state on the bdi is checked
and the congestion waitqueue woken up.

So without congestion, we still only wait until the next write
retires, but without any IO, we sleep the full timeout.

Check __freed_requests() in block/blk-core.c.

2010-08-26 18:29:15

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.

I am not convinced that unnecessary is the right word. On a workload
without any IO (i.e. no congestion_wait() necessary, ever), I noticed
the VM regressing both in time and in reclaiming the right pages when
simply removing congestion_wait() from the direct reclaim paths (the
one in __alloc_pages_slowpath and the other one in
do_try_to_free_pages).

So just being stupid and waiting for the timeout in direct reclaim
while kswapd can make progress seemed to do a better job for that
load.

I can not exactly pinpoint the reason for that behaviour, it would be
nice if somebody had an idea.

So personally I think it's a good idea to get an insight on the use of
congestion_wait() [patch 1] but I don't agree with changing its
behaviour just yet, or judging its usefulness solely on whether it
correctly waits for bdi congestion.

2010-08-26 20:23:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > >
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > >
> > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > ---
> > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > >
> > > Function's decripton should be changed since we don't wait next write any more.
> > >
> >
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
>
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
>
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
>
> Check __freed_requests() in block/blk-core.c.
>

Seems reasonable. Still, if there is no write IO going on and no
congestion there seems to be no point going to sleep for the full
timeout. It still feels wrong.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-26 20:31:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
>
> I am not convinced that unnecessary is the right word. On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).
>
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
>
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.
>

There is a possibility that the behaviour in that case was due to flusher
threads doing the writes rather than direct reclaim queueing pages for IO
in an inefficient manner. So the stall is stupid but happens to work out
well because flusher threads get the chance to do work.

> So personally I think it's a good idea to get an insight on the use of
> congestion_wait() [patch 1] but I don't agree with changing its
> behaviour just yet, or judging its usefulness solely on whether it
> correctly waits for bdi congestion.
>

Unfortunately, I strongly suspect that some of the desktop stalls seen during
IO (one of which involved no writes) were due to calling congestion_wait
and waiting the full timeout where no writes are going on.

It gets potentially worse too. Lets say we have a system with many BDIs of
different speed - e.g. SSD on one end of the spectrum and USB flash drive
on the other. The congestion for writes could be on the USB flash drive but
due to low memory, the allocator, direct reclaimers and kswapd go to sleep
periodically on congestion_wait for USB even though the bulk of the pages
need reclaiming are backed by an SSD.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-27 01:11:31

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > >
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > >
> > > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > > ---
> > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > >
> > > > Function's decripton should be changed since we don't wait next write any more.
> > > >
> > >
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> >
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> >
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> >
> > Check __freed_requests() in block/blk-core.c.
> >
>
> Seems reasonable. Still, if there is no write IO going on and no
> congestion there seems to be no point going to sleep for the full
> timeout. It still feels wrong.

Yeah the stupid sleeping feels wrong. However there are ~20
congestion_wait() callers spread randomly in VM, FS and block drivers.
Many of them may be added by rule of thumb, however what if some of
them happen to depend on the old stupid sleeping behavior? Obviously
you've done extensive tests on the page reclaim paths, however that's
far from enough to cover the wider changes made by this patch.

We may have to do the conversions case by case. Converting to
congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
other waiting schemes.

Thanks,
Fengguang

2010-08-27 01:22:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

Minchan,

It's much cleaner to keep the unchanged congestion_wait() and add a
congestion_wait_check() for converting problematic wait sites. The
too_many_isolated() wait is merely a protective mechanism, I won't
bother to improve it at the cost of more code.

Thanks,
Fengguang

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> if (fatal_signal_pending(current))
> return 0;

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_dirty;
> while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> --
> 1.7.0.5
>
>
> --
> Kind regards,
> Minchan Kim

2010-08-27 01:41:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

Hi, Wu.

On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <[email protected]> wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.

You means following as?

while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait_check(BLK_RW_ASYNC, HZ/10);

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}


>
> Thanks,
> Fengguang
>

--
Kind regards,
Minchan Kim

2010-08-27 01:43:19

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > >
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > >
> > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > ---
> > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > >
> > > Function's decripton should be changed since we don't wait next write any more.
> > >
> >
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
>
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
>
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
>
> Check __freed_requests() in block/blk-core.c.

congestion_wait() is tightly related with pageout() and writeback,
however it may have some intention for the no-IO case as well.

- if write congested, maybe we are doing too much pageout(), so wait.
it might also reduce some get_request_wait() stalls (the normal way
is to explicitly check for congestion before doing write out).

- if any write completes, it may free some PG_reclaim pages, so proceed.
(when not congested)

- if no IO at all, the 100ms sleep might still prevent a page reclaimer
from stealing lots of slices from a busy computing program that
involves no page allocation at all.

Thanks,
Fengguang

2010-08-27 01:51:18

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> Hi, Wu.
>
> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <[email protected]> wrote:
> > Minchan,
> >
> > It's much cleaner to keep the unchanged congestion_wait() and add a
> > congestion_wait_check() for converting problematic wait sites. The
> > too_many_isolated() wait is merely a protective mechanism, I won't
> > bother to improve it at the cost of more code.
>
> You means following as?

No, I mean do not change the too_many_isolated() related code at all :)
And to use congestion_wait_check() in other places that we can prove
there is a problem that can be rightly fixed by changing to
congestion_wait_check().

> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait_check(BLK_RW_ASYNC, HZ/10);
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> return SWAP_CLUSTER_MAX;
> }

Thanks,
Fengguang

2010-08-27 02:02:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <[email protected]> wrote:
> On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
>> Hi, Wu.
>>
>> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <[email protected]> wrote:
>> > Minchan,
>> >
>> > It's much cleaner to keep the unchanged congestion_wait() and add a
>> > congestion_wait_check() for converting problematic wait sites. The
>> > too_many_isolated() wait is merely a protective mechanism, I won't
>> > bother to improve it at the cost of more code.
>>
>> You means following as?
>
> No, I mean do not change the too_many_isolated() related code at all :)
> And to use congestion_wait_check() in other places that we can prove
> there is a problem that can be rightly fixed by changing to
> congestion_wait_check().

I always suffer from understanding your comment.
Apparently, my eyes have a problem. ;(

This patch is dependent of Mel's series.
With changing congestion_wait with just return when no congestion, it
would have CPU hogging in too_many_isolated. I think it would apply in
Li's congestion_wait_check, too.
If no change is current congestion_wait, we doesn't need this patch.

Still, maybe I can't understand your comment. Sorry.
--
Kind regards,
Minchan Kim

2010-08-27 02:12:18

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
If this is the case, we already have queue congested. removing
congestion_wait() might cause regression but either your change or the
congestion_wait_check() should not have the regression, as we do check
if the bdi is congested.

Thanks,
Shaohua

2010-08-27 04:34:33

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Fri, Aug 27, 2010 at 10:02:52AM +0800, Minchan Kim wrote:
> On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <[email protected]> wrote:
> > On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> >> Hi, Wu.
> >>
> >> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <[email protected]> wrote:
> >> > Minchan,
> >> >
> >> > It's much cleaner to keep the unchanged congestion_wait() and add a
> >> > congestion_wait_check() for converting problematic wait sites. The
> >> > too_many_isolated() wait is merely a protective mechanism, I won't
> >> > bother to improve it at the cost of more code.
> >>
> >> You means following as?
> >
> > No, I mean do not change the too_many_isolated() related code at all :)
> > And to use congestion_wait_check() in other places that we can prove
> > there is a problem that can be rightly fixed by changing to
> > congestion_wait_check().
>
> I always suffer from understanding your comment.
> Apparently, my eyes have a problem. ;(

> This patch is dependent of Mel's series.
> With changing congestion_wait with just return when no congestion, it
> would have CPU hogging in too_many_isolated. I think it would apply in
> Li's congestion_wait_check, too.
> If no change is current congestion_wait, we doesn't need this patch.
>
> Still, maybe I can't understand your comment. Sorry.

Sorry! The confusion must come from the modified congestion_wait() by
Mel. My proposal is to _not_ modify congestion_wait(), but add another
congestion_wait_check() which won't sleep 100ms when no IO. In this
way, the following chunks become unnecessary.

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }

if (fatal_signal_pending(current))
return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3109ff7..f5e3e28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))

Thanks,
Fengguang

2010-08-27 05:13:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep.

That, I think, is an invalid assumption. congestion_wait is used in
some places as a backoff mechanism that waits for some IO work to be
done, with congestion disappearing being a indication that progress
has been made and so we can retry sooner than the entire timeout.

For example, if _xfs_buf_lookup_pages() fails to allocate page cache
pages for a buffer, it will kick the xfsbufd to writeback dirty
buffers (so they can be freed) and immediately enter
congestion_wait(). If there isn't congestion when we enter
congestion_wait(), we still want to give the xfsbufds a chance to
clean some pages before we retry the allocation for the new buffer.
Removing the congestion_wait() sleep behaviour will effectively
_increase_ memory pressure with XFS on fast disk subsystems because
it now won't backoff between failed allocation attempts...

Perhaps a congestion_wait_iff_congested() variant is needed for the
VM? I can certainly see how it benefits the VM from a latency
perspective, but it is the opposite behaviour that is expected in
other places...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-08-27 08:16:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.

The workload was accessing a large sparse-file through mmap, so there
wasn't much IO in the first place.

And I experimented on the latest -mmotm where direct reclaim wouldn't
do writeback by itself anymore, but kick the flushers.

> > So personally I think it's a good idea to get an insight on the use of
> > congestion_wait() [patch 1] but I don't agree with changing its
> > behaviour just yet, or judging its usefulness solely on whether it
> > correctly waits for bdi congestion.
> >
>
> Unfortunately, I strongly suspect that some of the desktop stalls seen during
> IO (one of which involved no writes) were due to calling congestion_wait
> and waiting the full timeout where no writes are going on.

Oh, I am in full agreement here! Removing those congestion_wait() as
described above showed a reduction in peak latency. The dilemma is
only that it increased the overall walltime of the load.

And the scanning behaviour deteriorated, as in having increased
scanning pressure on other zones than the unpatched kernel did.

So I think very much that we need a fix. congestion_wait() causes
stalls and relying on random sleeps for the current reclaim behaviour
can not be the solution, at all.

I just don't think we can remove it based on the argument that it
doesn't do what it is supposed to do, when it does other things right
that it is not supposed to do ;-)

2010-08-27 09:20:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Fri, Aug 27, 2010 at 10:12:10AM +0800, Shaohua Li wrote:
> On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > >
> > > I am not convinced that unnecessary is the right word. On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > >
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > >
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > >
> >
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> If this is the case, we already have queue congested.

Not necessarily. The fact that with the full series we sometimes call
cond_sched() indicating that there was no congestion when congestion_wait()
was called proves that. We might have some IO on the queue but it's not
congested. Also, there is no guarantee that the congested queue is one we
care about. If we are reclaiming main memory and the congested queue is a
USB stick, we do not necessarily need to stall.

> removing
> congestion_wait() might cause regression but either your change or the
> congestion_wait_check() should not have the regression, as we do check
> if the bdi is congested.
>

What congestion_wait_check()? If there is no congestion and no writes,
congestion is the wrong event to sleep on.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-27 09:24:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > >
> > > I am not convinced that unnecessary is the right word. On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > >
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > >
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > >
> >
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> The workload was accessing a large sparse-file through mmap, so there
> wasn't much IO in the first place.
>

Then waiting on congestion was the totally wrong thing to do. We were
effectively calling sleep(HZ/10) and magically this was helping in some
undefined manner. Do you know *which* called of congestion_wait() was
the most important to you?

> And I experimented on the latest -mmotm where direct reclaim wouldn't
> do writeback by itself anymore, but kick the flushers.
>

What were the results? I'm preparing a full series incorporating a
number of patches in this area to see how they behave in aggregate.

> > > So personally I think it's a good idea to get an insight on the use of
> > > congestion_wait() [patch 1] but I don't agree with changing its
> > > behaviour just yet, or judging its usefulness solely on whether it
> > > correctly waits for bdi congestion.
> > >
> >
> > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > IO (one of which involved no writes) were due to calling congestion_wait
> > and waiting the full timeout where no writes are going on.
>
> Oh, I am in full agreement here! Removing those congestion_wait() as
> described above showed a reduction in peak latency. The dilemma is
> only that it increased the overall walltime of the load.
>

Do you know why because leaving in random sleeps() hardly seems to be
the right approach?

> And the scanning behaviour deteriorated, as in having increased
> scanning pressure on other zones than the unpatched kernel did.
>

Probably because it was scanning more but not finding what it needed.
There is a condition other than congestion it is having trouble with. In
some respects, I think if we change congestion_wait() as I propose,
we may see a case where CPU usage is higher because it's now
encountering the unspecified reclaim problem we have.

> So I think very much that we need a fix. congestion_wait() causes
> stalls and relying on random sleeps for the current reclaim behaviour
> can not be the solution, at all.
>
> I just don't think we can remove it based on the argument that it
> doesn't do what it is supposed to do, when it does other things right
> that it is not supposed to do ;-)
>

We are not removing it, we are just stopping it going to sleep for
stupid reasons. If we find that wall time is increasing as a result, we
have a path to figuring out what the real underlying problem is instead
of sweeping it under the rug.

congestion_wait() is causing other problems such as Christian's bug of
massive IO regressions because it was sleeping when it shouldn't.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-27 09:34:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Fri, Aug 27, 2010 at 03:13:16PM +1000, Dave Chinner wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep.
>
> That, I think, is an invalid assumption. congestion_wait is used in
> some places as a backoff mechanism that waits for some IO work to be
> done, with congestion disappearing being a indication that progress
> has been made and so we can retry sooner than the entire timeout.
>

As it's write IO rather than some IO, I wonder if that's really the
right thing to do. However, I accept your (and others) point that
converting all congestion_wait() callers may be too much of a change.

> For example, if _xfs_buf_lookup_pages() fails to allocate page cache
> pages for a buffer, it will kick the xfsbufd to writeback dirty
> buffers (so they can be freed) and immediately enter
> congestion_wait(). If there isn't congestion when we enter
> congestion_wait(), we still want to give the xfsbufds a chance to
> clean some pages before we retry the allocation for the new buffer.
> Removing the congestion_wait() sleep behaviour will effectively
> _increase_ memory pressure with XFS on fast disk subsystems because
> it now won't backoff between failed allocation attempts...
>
> Perhaps a congestion_wait_iff_congested() variant is needed for the
> VM? I can certainly see how it benefits the VM from a latency
> perspective, but it is the opposite behaviour that is expected in
> other places...
>

I'm added a wait_iff_congested() and updated a few of the VM callers. I changed
a fairly minimum number of what appeared to be the obvious ones to change.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-27 09:34:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Fri, Aug 27, 2010 at 09:11:06AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > > CPU longer than its quota but otherwise will not sleep.
> > > > > >
> > > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > > progress.
> > > > > >
> > > > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > > > ---
> > > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > > index a49167f..6abe860 100644
> > > > > > --- a/mm/backing-dev.c
> > > > > > +++ b/mm/backing-dev.c
> > > > >
> > > > > Function's decripton should be changed since we don't wait next write any more.
> > > > >
> > > >
> > > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > > happening but maybe that side of things just broke somewhere in the
> > > > distant past. I lack context of how this is meant to work so maybe
> > > > someone will educate me.
> > >
> > > On every retired io request the congestion state on the bdi is checked
> > > and the congestion waitqueue woken up.
> > >
> > > So without congestion, we still only wait until the next write
> > > retires, but without any IO, we sleep the full timeout.
> > >
> > > Check __freed_requests() in block/blk-core.c.
> > >
> >
> > Seems reasonable. Still, if there is no write IO going on and no
> > congestion there seems to be no point going to sleep for the full
> > timeout. It still feels wrong.
>
> Yeah the stupid sleeping feels wrong. However there are ~20
> congestion_wait() callers spread randomly in VM, FS and block drivers.
> Many of them may be added by rule of thumb, however what if some of
> them happen to depend on the old stupid sleeping behavior? Obviously
> you've done extensive tests on the page reclaim paths, however that's
> far from enough to cover the wider changes made by this patch.
>
> We may have to do the conversions case by case. Converting to
> congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
> other waiting schemes.
>

I am taking this direction now.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-27 09:37:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs

On Fri, Aug 27, 2010 at 09:42:54AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > >
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > >
> > > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > > ---
> > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > >
> > > > Function's decripton should be changed since we don't wait next write any more.
> > > >
> > >
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> >
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> >
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> >
> > Check __freed_requests() in block/blk-core.c.
>
> congestion_wait() is tightly related with pageout() and writeback,
> however it may have some intention for the no-IO case as well.
>
> - if write congested, maybe we are doing too much pageout(), so wait.
> it might also reduce some get_request_wait() stalls (the normal way
> is to explicitly check for congestion before doing write out).
>
> - if any write completes, it may free some PG_reclaim pages, so proceed.
> (when not congested)
>

For these cases, would it make sense for wait_iff_congested() to compare
nr_writeback to nr_inactive and decide to wait on congestion if more
than half the inactive list is in writeback?

> - if no IO at all, the 100ms sleep might still prevent a page reclaimer
> from stealing lots of slices from a busy computing program that
> involves no page allocation at all.
>

I don't think this is a very strong arguement because cond_reched() is
being called.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-27 09:38:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion

On Fri, Aug 27, 2010 at 09:21:47AM +0800, Wu Fengguang wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.
>

This is what I've done. I dropped the patch again and am using
wait_iff_congested(). I left the too_many_isolated() callers as
congestion_wait().

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-29 16:04:05

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

Hi, Hannes.

On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
>
> I am not convinced that unnecessary is the right word. On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).

Not exactly same your experiment but I had a simillar experince.
I had a experiement about swapout. System has lots of anon pages but
almost no file pages and it already started to swap out. It means
system have no memory. In this case, I forked new process which mmap
some MB pages and touch the pages. It means VM should swapout some MB page
for the process. And I measured the time until completing touching the pages.

Sometime it's fast, sometime it's slow. time gap is almost two.
Interesting thing is when it is fast, many of pages are reclaimed by kswapd.
Ah.. I used swap to ramdisk and reserve the swap pages by touching before
starting the experiment. So I would say it's not a _flushd_ effect.

>
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
>
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.

I just thought the cause is direct reclaim just reclaims by 32 pages
but kswapd could reclaim many pages by batch. But i didn't look at it any more
due to busy. Does it make sense?

--
Kind regards,
Minchan Kim

2010-08-30 13:19:45

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > record that the wait was unnecessary.
> > > >
> > > > I am not convinced that unnecessary is the right word. On a workload
> > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > the VM regressing both in time and in reclaiming the right pages when
> > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > one in __alloc_pages_slowpath and the other one in
> > > > do_try_to_free_pages).
> > > >
> > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > while kswapd can make progress seemed to do a better job for that
> > > > load.
> > > >
> > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > nice if somebody had an idea.
> > > >
> > >
> > > There is a possibility that the behaviour in that case was due to flusher
> > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > in an inefficient manner. So the stall is stupid but happens to work out
> > > well because flusher threads get the chance to do work.
> >
> > The workload was accessing a large sparse-file through mmap, so there
> > wasn't much IO in the first place.
> >
>
> Then waiting on congestion was the totally wrong thing to do. We were
> effectively calling sleep(HZ/10) and magically this was helping in some
> undefined manner. Do you know *which* called of congestion_wait() was
> the most important to you?

Removing congestion_wait() in do_try_to_free_pages() definitely
worsens reclaim behaviour for this workload:

1. wallclock time of the testrun increases by 11%

2. the scanners do a worse job and go for the wrong zone:

-pgalloc_dma 79597
-pgalloc_dma32 134465902
+pgalloc_dma 297089
+pgalloc_dma32 134247237

-pgsteal_dma 77501
-pgsteal_dma32 133939446
+pgsteal_dma 294998
+pgsteal_dma32 133722312

-pgscan_kswapd_dma 145897
-pgscan_kswapd_dma32 266141381
+pgscan_kswapd_dma 287981
+pgscan_kswapd_dma32 186647637

-pgscan_direct_dma 9666
-pgscan_direct_dma32 1758655
+pgscan_direct_dma 302495
+pgscan_direct_dma32 80947179

-pageoutrun 1768531
-allocstall 614
+pageoutrun 1927451
+allocstall 8566

I attached the full vmstat contents below. Also the test program,
which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))

> > > > So personally I think it's a good idea to get an insight on the use of
> > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > correctly waits for bdi congestion.
> > > >
> > >
> > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > IO (one of which involved no writes) were due to calling congestion_wait
> > > and waiting the full timeout where no writes are going on.
> >
> > Oh, I am in full agreement here! Removing those congestion_wait() as
> > described above showed a reduction in peak latency. The dilemma is
> > only that it increased the overall walltime of the load.
> >
>
> Do you know why because leaving in random sleeps() hardly seems to be
> the right approach?

I am still trying to find out what's going wrong.

> > And the scanning behaviour deteriorated, as in having increased
> > scanning pressure on other zones than the unpatched kernel did.
> >
>
> Probably because it was scanning more but not finding what it needed.
> There is a condition other than congestion it is having trouble with. In
> some respects, I think if we change congestion_wait() as I propose,
> we may see a case where CPU usage is higher because it's now
> encountering the unspecified reclaim problem we have.

Exactly.

> > So I think very much that we need a fix. congestion_wait() causes
> > stalls and relying on random sleeps for the current reclaim behaviour
> > can not be the solution, at all.
> >
> > I just don't think we can remove it based on the argument that it
> > doesn't do what it is supposed to do, when it does other things right
> > that it is not supposed to do ;-)
> >
>
> We are not removing it, we are just stopping it going to sleep for
> stupid reasons. If we find that wall time is increasing as a result, we
> have a path to figuring out what the real underlying problem is instead
> of sweeping it under the rug.

Well, for that testcase it is in effect the same as a removal as
there's never congestion.

But again: I agree with your changes per-se, I just don't think they
should get merged as long as they knowingly catalyze a problem that
has yet to be identified.


Attachments:
(No filename) (5.21 kB)
mapped-file-stream.c (1.84 kB)
vmstat.a.2 (1.67 kB)
vmstat.b.2 (1.68 kB)
Download all attachments

2010-08-31 15:02:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary

On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > > record that the wait was unnecessary.
> > > > >
> > > > > I am not convinced that unnecessary is the right word. On a workload
> > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > > the VM regressing both in time and in reclaiming the right pages when
> > > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > > one in __alloc_pages_slowpath and the other one in
> > > > > do_try_to_free_pages).
> > > > >
> > > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > > while kswapd can make progress seemed to do a better job for that
> > > > > load.
> > > > >
> > > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > > nice if somebody had an idea.
> > > > >
> > > >
> > > > There is a possibility that the behaviour in that case was due to flusher
> > > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > > in an inefficient manner. So the stall is stupid but happens to work out
> > > > well because flusher threads get the chance to do work.
> > >
> > > The workload was accessing a large sparse-file through mmap, so there
> > > wasn't much IO in the first place.
> > >
> >
> > Then waiting on congestion was the totally wrong thing to do. We were
> > effectively calling sleep(HZ/10) and magically this was helping in some
> > undefined manner. Do you know *which* called of congestion_wait() was
> > the most important to you?
>
> Removing congestion_wait() in do_try_to_free_pages() definitely
> worsens reclaim behaviour for this workload:
>
> 1. wallclock time of the testrun increases by 11%
>
> 2. the scanners do a worse job and go for the wrong zone:
>
> -pgalloc_dma 79597
> -pgalloc_dma32 134465902
> +pgalloc_dma 297089
> +pgalloc_dma32 134247237
>
> -pgsteal_dma 77501
> -pgsteal_dma32 133939446
> +pgsteal_dma 294998
> +pgsteal_dma32 133722312
>
> -pgscan_kswapd_dma 145897
> -pgscan_kswapd_dma32 266141381
> +pgscan_kswapd_dma 287981
> +pgscan_kswapd_dma32 186647637
>
> -pgscan_direct_dma 9666
> -pgscan_direct_dma32 1758655
> +pgscan_direct_dma 302495
> +pgscan_direct_dma32 80947179
>
> -pageoutrun 1768531
> -allocstall 614
> +pageoutrun 1927451
> +allocstall 8566
>
> I attached the full vmstat contents below. Also the test program,
> which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
>

Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is
interesting is your workload.

> > > > > So personally I think it's a good idea to get an insight on the use of
> > > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > > correctly waits for bdi congestion.
> > > > >
> > > >
> > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > > IO (one of which involved no writes) were due to calling congestion_wait
> > > > and waiting the full timeout where no writes are going on.
> > >
> > > Oh, I am in full agreement here! Removing those congestion_wait() as
> > > described above showed a reduction in peak latency. The dilemma is
> > > only that it increased the overall walltime of the load.
> > >
> >
> > Do you know why because leaving in random sleeps() hardly seems to be
> > the right approach?
>
> I am still trying to find out what's going wrong.
>
> > > And the scanning behaviour deteriorated, as in having increased
> > > scanning pressure on other zones than the unpatched kernel did.
> > >
> >
> > Probably because it was scanning more but not finding what it needed.
> > There is a condition other than congestion it is having trouble with. In
> > some respects, I think if we change congestion_wait() as I propose,
> > we may see a case where CPU usage is higher because it's now
> > encountering the unspecified reclaim problem we have.
>
> Exactly.
>
> > > So I think very much that we need a fix. congestion_wait() causes
> > > stalls and relying on random sleeps for the current reclaim behaviour
> > > can not be the solution, at all.
> > >
> > > I just don't think we can remove it based on the argument that it
> > > doesn't do what it is supposed to do, when it does other things right
> > > that it is not supposed to do ;-)
> > >
> >
> > We are not removing it, we are just stopping it going to sleep for
> > stupid reasons. If we find that wall time is increasing as a result, we
> > have a path to figuring out what the real underlying problem is instead
> > of sweeping it under the rug.
>
> Well, for that testcase it is in effect the same as a removal as
> there's never congestion.
>
> But again: I agree with your changes per-se, I just don't think they
> should get merged as long as they knowingly catalyze a problem that
> has yet to be identified.

Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).

These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.

The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.

MICRO
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)


So, the allocstall counts go up of course because it is incremented
every time direct reclaim is entered and nocongest is only going to
sleep when there is congestion or significant writeback. I don't see
this as being nceessarily bad.

Direct scanning rates go up a bit as you'd expect - again because we are
sleeping less. It's interesting that the pages reclaimed is reduced implying
that despite higher scanning rates, there is less reclaim activity.

It's debatable if this is good or not because higher scanning rates in
themselves are not bad but fewer pages reclaimed seems positive so lets
see what the rest of the reports look like.

FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
Direct reclaims 2722 70156 67554 87691
Direct reclaim pages scanned 23955333 40161426 40034132 45080524
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 2718040 17801688 17622777 18379572
Kswapd wakeups 24 1 1 1
Kswapd pages scanned 177738381 161597313 161729224 156704078
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
Time kswapd awake (seconds) 489.17 400.20 403.19 390.08

Total pages scanned 201693714 201758739 201763356 201784602
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%

Interesting, kswapd is now staying awake (woke up only once) even though
the total time awake was reduced and it looks like because it was requested
to wake up a lot more that was keeping it awake. Despite the higher scan
rates from direct reclaim, the time actually spent direct reclaiming is
significantly reduced.

Scanning rates and times we direct reclaim go up but as we finish work a
lot faster, it would seem that we are doing less work overall.

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 3664 0 0 0
Direct time congest waited 247636ms 0ms 0ms 0ms
Direct full congest waited 3081 0 0 0
Direct number conditional waited 0 47587 45659 58779
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 3081 0 0 0
KSwapd number congest waited 1448 949 909 981
KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
KSwapd full congest waited 1056 90 115 147
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 1056 90 115 147

congest waited is congestion_wait() and conditional waited is
wait_iff_congested(). Look at what happens to the congest waited times
for direct reclaim - it disappeared and despite the number of times
wait_iff_congested() was called, it never actually decided it needed to
sleep. kswapd is still congestion waiting but the time it spent is
reduced.

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53

This is plain old time. The same test completes 91 seconds faster.
Ordinarily at this point I would be preparing to do a full series report
including the other benchmarks but I'm interested in seeing if there is
a significantly different reading of the above report as to whether it
is a "good" or "bad" result?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab